**Introduction**

**Mental Health Significance** 

Mental health is one of the key components toward overall well-being, as it influences an individual's quality of life. Despite its significance, mental health disorders are a global challenge that are seemingly worsening throughout the years. A CNN/Kaiser Family Foundation poll recently reported that "90% of Americans feel that we are in a mental health crisis" (**citation**). There are a variety of factors contributign to mental health disorders, and vary from individual to individual. However, trends exist on a country level such as econimic indicators, demographics, etc, that make individuals more susceptible to mental health disorders. Gaining an understanding of these factors that influence mental health and developing effective preventive strategies are highly valued for worldwide health initiatives. 

More information regarding mental health can be found at the World Health Organization (WHO) website: https://www.who.int/news-room/fact-sheets/detail/mental-health-strengthening-our-response 

**Purpose**

The purpose of this study is to analyze mental health disorders around the world and contribute toward developing predictive models for mental health on a global scale. Various factors that potentially influence mental health disorders around the world will be analyzed, such as demographics, economic indicators, and more. This study will use statistical analysis and machine learning techniques in order to gain insight toward the patterns of mental health disorders, and attempt to predict countries' susceptibilities to mental health disorders. 

This study is especially applicable to data science because it aims to address the very complex nature of mental health disorders around the world through using statistical analysis and machine learning techniques. By understanding the factors that contribute toward mental health disorders, people such as healthcare professionals and policymakers can design preventive strategies for specific populations that are susceptible to mental health disorders. From this, the burden of mental health disorders may be mitigated and overall improve the well-being of individuals around the world. 

**Part 1: Data Collection**

This section includes the functionality for collecting all relevant data for this study. First, all Python libraries are imported here for the study. 

In [68]:
import pandas as pd 
from functools import reduce

Factors that could potentially influence mental health throughout the world include: 
- Population density 
- GDP per capita 
- Unemployment percentage
- Healthcare expenditure 
- ** need some for demographics 

Data sources:

- Mental health dataset by country: https://www.kaggle.com/datasets/thedevastator/uncover-global-trends-in-mental-health-disorder
- Population dataset by country: https://www.kaggle.com/datasets/chandanchoudhury/world-population-dataset?select=world_population_by_year_1950_2023.csv 
- GDP dataset by country: https://www.kaggle.com/datasets/tmishinev/world-country-gdp-19602021 
- Unemployment dataset by country: https://www.kaggle.com/datasets/pantanjali/unemployment-dataset 
- Healthcare expenditure by country: https://www.kaggle.com/datasets/mjshri23/life-expectancy-and-socio-economic-world-bank 
- ** data sources for demographics 

**Gathering Data**

The factors (independent variables) analyzed in this study are population density, GDP per capita, unemployment percentage, healthcare expenditure, **demographics. The following demonstrates how the data is gathered, which will be reading from a csv file of each dataset listed above and storing the result in a DataFrame. 

Mental health data

This DataFrame will include mental health information (schizophrenia %, bipolar disorder %, eating disorders %, anxiety disorders %, drug use disorders %, depression %, and alcohol use disorders %) for each country for the years of 1990 through 2017. 

In [69]:
mental_health_data = pd.read_csv("Datasets/Mental health Depression disorder Data.csv", low_memory=False)
mental_health_data.head()

Unnamed: 0,index,Entity,Code,Year,Schizophrenia (%),Bipolar disorder (%),Eating disorders (%),Anxiety disorders (%),Drug use disorders (%),Depression (%),Alcohol use disorders (%)
0,0,Afghanistan,AFG,1990,0.16056,0.697779,0.101855,4.82883,1.677082,4.071831,0.672404
1,1,Afghanistan,AFG,1991,0.160312,0.697961,0.099313,4.82974,1.684746,4.079531,0.671768
2,2,Afghanistan,AFG,1992,0.160135,0.698107,0.096692,4.831108,1.694334,4.088358,0.670644
3,3,Afghanistan,AFG,1993,0.160037,0.698257,0.094336,4.830864,1.70532,4.09619,0.669738
4,4,Afghanistan,AFG,1994,0.160022,0.698469,0.092439,4.829423,1.716069,4.099582,0.66926


Population data 

Two DataFrames are created here, one for global country statistics, and the other for population of each country from the years 1950 through 2023. These datasets will be used in conjunction in order to determine the population density for each country in this range of years. 

In [70]:
country_data = pd.read_csv("Datasets/world_country_stats.csv")
country_data.head()

Unnamed: 0,country,region,land_area,fertility_rate,median_age
0,Afghanistan,Asia,652860,4.4,17.0
1,Albania,Europe,27400,1.4,38.0
2,Algeria,Africa,2381740,2.8,28.0
3,American Samoa,Oceania,200,2.2,29.0
4,Andorra,Europe,470,1.1,43.0


In [71]:
population_data = pd.read_csv("Datasets/world_population_by_year_1950_2023.csv")
population_data.head()

Unnamed: 0,country,1950,1951,1952,1953,1954,1955,1956,1957,1958,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
0,Afghanistan,7480461,7571537,7667533,7764546,7864285,7971931,8087727,8210201,8333826,...,32716210,33753499,34636207,35643418,36686784,37769499,38972230,40099462,41128771,42239854
1,Albania,1252582,1289168,1326948,1366744,1409005,1453730,1500624,1549571,1600983,...,2884102,2882481,2881063,2879355,2877013,2873883,2866849,2854710,2842321,2832439
2,Algeria,9019866,9271734,9521702,9771686,10011541,10242288,10473168,10703251,10933784,...,38760168,39543154,40339329,41136546,41927007,42705368,43451666,44177969,44903225,45606480
3,American Samoa,19032,19425,19561,19670,19758,19826,19902,19937,19918,...,52217,51368,50448,49463,48424,47321,46189,45035,44273,43914
4,Andorra,6005,5827,5454,5308,5566,6116,6705,7330,7994,...,71621,71746,72540,73837,75013,76343,77700,79034,79824,80088


GDP per capita data

This DataFrame will contain the GDP per capita in USD for each country between the years 1960 and 2021. 

In [72]:
gdp_data = pd.read_csv("Datasets/world_country_gdp_usd.csv")
gdp_data.head()

Unnamed: 0,Country Name,Country Code,year,GDP_USD,GDP_per_capita_USD
0,Aruba,ABW,1960,,
1,Africa Eastern and Southern,AFE,1960,21290590000.0,162.726326
2,Afghanistan,AFG,1960,537777800.0,59.773234
3,Africa Western and Central,AFW,1960,10404140000.0,107.930722
4,Angola,AGO,1960,,


Unemployment data

This DataFrame includes the percentage of individuals in each country that were unemployed in each year from 1991 through 2021.

In [73]:
unemployment_data = pd.read_csv("Datasets/unemployment analysis.csv")
unemployment_data.head()

Unnamed: 0,Country Name,Country Code,1991,1992,1993,1994,1995,1996,1997,1998,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,Africa Eastern and Southern,AFE,7.8,7.84,7.85,7.84,7.83,7.84,7.86,7.81,...,6.56,6.45,6.41,6.49,6.61,6.71,6.73,6.91,7.56,8.11
1,Afghanistan,AFG,10.65,10.82,10.72,10.73,11.18,10.96,10.78,10.8,...,11.34,11.19,11.14,11.13,11.16,11.18,11.15,11.22,11.71,13.28
2,Africa Western and Central,AFW,4.42,4.53,4.55,4.54,4.53,4.57,4.6,4.66,...,4.64,4.41,4.69,4.63,5.57,6.02,6.04,6.06,6.77,6.84
3,Angola,AGO,4.21,4.21,4.23,4.16,4.11,4.1,4.09,4.07,...,7.35,7.37,7.37,7.39,7.41,7.41,7.42,7.42,8.33,8.53
4,Albania,ALB,10.31,30.01,25.26,20.84,14.61,13.93,16.88,20.05,...,13.38,15.87,18.05,17.19,15.42,13.62,12.3,11.47,13.33,11.82


Healthcare expenditure data

The following dataframe will contain information regarding each country's healthcare expenditure expressed as a percentage of their GDP, unemployment %, etc, from 2001 through 2019. 

In [74]:
# healthcare expenditure dataset (has healthcare expenditure %, education expenditure %, unemployment %, country's income group, etc)
expenditure_data = pd.read_csv("Datasets/life expectancy.csv")
expenditure_data.head()

Unnamed: 0,Country Name,Country Code,Region,IncomeGroup,Year,Life Expectancy World Bank,Prevelance of Undernourishment,CO2,Health Expenditure %,Education Expenditure %,Unemployment,Corruption,Sanitation,Injuries,Communicable,NonCommunicable
0,Afghanistan,AFG,South Asia,Low income,2001,56.308,47.8,730.0,,,10.809,,,2179727.1,9689193.7,5795426.38
1,Angola,AGO,Sub-Saharan Africa,Lower middle income,2001,47.059,67.5,15960.0,4.483516,,4.004,,,1392080.71,11190210.53,2663516.34
2,Albania,ALB,Europe & Central Asia,Upper middle income,2001,74.288,4.9,3230.0,7.139524,3.4587,18.575001,,40.520895,117081.67,140894.78,532324.75
3,Andorra,AND,Europe & Central Asia,High income,2001,,,520.0,5.865939,,,,21.78866,1697.99,695.56,13636.64
4,United Arab Emirates,ARE,Middle East & North Africa,High income,2001,74.544,2.8,97200.0,2.48437,,2.493,,,144678.14,65271.91,481740.7


**Part 2: Data Processing/Cleaning**

After the data has been collected, it must be properly formatted and cleaned in order to be merged for later processing. 

Mental health data



The mental health dataset will need to be cleaned as to drop all irrelevant rows and columns from the table, as well as typecasting the columns to their proper types. The 'Entity' column will be renamed to "Country Name" to remain consistent along all dataframes for later merging. Furthermore, the rows will need to be reduced so that only years from 2001-2017 are in the table. 

In [75]:
# dropping useless info from table 
mental_health_data = mental_health_data.drop(index=mental_health_data.index[6468:])

# typecasting columns to correct types 
mental_health_data['Year'] = mental_health_data['Year'].astype(int)
mental_health_data['Schizophrenia (%)'] = mental_health_data['Schizophrenia (%)'].astype(float)
mental_health_data['Bipolar disorder (%)'] = mental_health_data['Bipolar disorder (%)'].astype(float)
mental_health_data['Eating disorders (%)'] = mental_health_data['Eating disorders (%)'].astype(float)

# dropping rows with years not used in the study (only 2001-2017)
mental_health_data = mental_health_data[(mental_health_data['Year'] >= 2001) & (mental_health_data['Year'] <= 2017)]

# dropping columns irrelevant to study 
mental_health_data = mental_health_data.drop(columns=['index', 'Code'])

# renaming 'Entity' column to 'Country Name'
mental_health_data = mental_health_data.rename(columns={'Entity': 'Country Name'})

In [76]:
mental_health_data.head()

Unnamed: 0,Country Name,Year,Schizophrenia (%),Bipolar disorder (%),Eating disorders (%),Anxiety disorders (%),Drug use disorders (%),Depression (%),Alcohol use disorders (%)
11,Afghanistan,2001,0.161957,0.700499,0.086517,4.831409,1.839123,4.121381,0.661158
12,Afghanistan,2002,0.162414,0.701141,0.087023,4.838318,1.934326,4.124928,0.659213
13,Afghanistan,2003,0.162916,0.70186,0.087189,4.845538,2.051106,4.12523,0.657354
14,Afghanistan,2004,0.163377,0.702556,0.088158,4.851512,2.163044,4.126384,0.656132
15,Afghanistan,2005,0.163706,0.703078,0.088933,4.854684,2.247443,4.126908,0.655686


Population data 

The population data currently is in a format such that there is a column for the country name, as well as columns for each year in the table (1950-2023) with its values being the population. In order for this table to be in a tidy format, the data should be reshaped using melting so that the columns are country, year, and population. 

In [77]:
# melting the population data to reshape it 
global_population_data = population_data.melt(id_vars=['country'], var_name='year', value_name='population')

From here, we will need to calculate the population density for each country during this time frame. The 'country_data' dataframe will be used in conjunction with this 'global_population_data' dataframe, as it contains relevant information such as the land area (in km^2) for each country. The two dataframes will first be merged on the country name and then a new column will be created which calculates the population / land area for each country. 

In [78]:
# merge both dataframes on the 'country' column
density_data = pd.merge(global_population_data, country_data, on='country')

# creating 'population_density' column
density_data['population_density'] = density_data['population'] / density_data['land_area']
density_data.head()

Unnamed: 0,country,year,population,region,land_area,fertility_rate,median_age,population_density
0,Afghanistan,1950,7480461,Asia,652860,4.4,17.0,11.457986
1,Albania,1950,1252582,Europe,27400,1.4,38.0,45.714672
2,Algeria,1950,9019866,Africa,2381740,2.8,28.0,3.787091
3,American Samoa,1950,19032,Oceania,200,2.2,29.0,95.16
4,Andorra,1950,6005,Europe,470,1.1,43.0,12.776596


Finally, the year column is typecasted to an int, and all years not within the range of 2001-2017 are dropped. Furthermore, irrelevant columns in the table are dropped for this study. The 'country' and 'year' columns will also be renamed to remain consistent with the above.  

In [79]:
# typecasting year to correct int
density_data['year'] = density_data['year'].astype(int)

# dropping rows with years not used in the study (only 2001-2017)
density_data[(density_data['year'] >= 2001) & (density_data['year'] <= 2017)]

# dropping irrelevant columns for study 
density_data = density_data.drop(columns=['population', 'land_area', 'fertility_rate', 'median_age', 'region'])

# renaming 'country' to 'Country Name' and 'year' to 'Year'
density_data = density_data.rename(columns={'country': 'Country Name', 'year': 'Year'})

In [80]:
density_data.head()

Unnamed: 0,Country Name,Year,population_density
0,Afghanistan,1950,11.457986
1,Albania,1950,45.714672
2,Algeria,1950,3.787091
3,American Samoa,1950,95.16
4,Andorra,1950,12.776596


GDP per capita data 

The GDP per capita data is already in a proper format for merging the tables, so all that needs to be done is filtering the table to only have rows from the years 2001-2017, dropping any unnecessary columns, and renaming the 'year' column. 

In [81]:
# dropping rows with years not used in the study (only 2001-2017)
gdp_data = gdp_data[(gdp_data['year'] >= 2001) & (gdp_data['year'] <= 2017)]

# dropping irrelevant columns for study
gdp_data = gdp_data.drop(columns=['Country Code', 'GDP_USD'])

# renaming 'year to 'Year'
gdp_data = gdp_data.rename(columns={'year': 'Year'})

In [82]:
gdp_data.head()

Unnamed: 0,Country Name,Year,GDP_per_capita_USD
10906,Aruba,2001,20417.77596
10907,Africa Eastern and Southern,2001,633.548479
10908,Afghanistan,2001,
10909,Africa Western and Central,2001,539.338735
10910,Angola,2001,527.333529


Unemployment data

Similar to the population data, the unemployment data is formatted with columns for country name, country code, and a column for each year in the table (1991-2021) with values for the percentage of the population in that country that was unemployed. This dataframe will be melted as well, so that the resulting columns in the dataframe are country name, country code, year, and unemployment %. 

In [83]:
# melting the unemployment data to reshape it 
unemployment_data = unemployment_data.melt(id_vars=['Country Name', 'Country Code'], var_name='Year', value_name='Unemployment (%)')

Next, the 'year' column will need to be typecasted as an int, and the rows will be filtered so that only years from 2001-2017 are in the table. Furthermore, irrelevant columns from the table are dropped. 

In [84]:
# typecasting year to int 
unemployment_data['Year'] = unemployment_data['Year'].astype(int)

# dropping rows with years not used in the study (only 2001-2017)
unemployment_data = unemployment_data[(unemployment_data['Year'] >= 2001) & (unemployment_data['Year'] <= 2017)]

# dropping irrelevant columns for study
unemployment_data = unemployment_data.drop(columns=['Country Code'])

In [85]:
unemployment_data.head()

Unnamed: 0,Country Name,Year,Unemployment (%)
2350,Africa Eastern and Southern,2001,7.73
2351,Afghanistan,2001,10.81
2352,Africa Western and Central,2001,4.87
2353,Angola,2001,4.0
2354,Albania,2001,18.58


Healthcare expenditure data

The healthcare expenditure data is properly formatted, but needs to be filtered to contain years from 2001-2017 and unecessary columns from the table are dropped. 

In [86]:
# dropping rows with years not used in the study (only 2001-2017)
expenditure_data = expenditure_data[(expenditure_data['Year'] >= 2001) & (expenditure_data['Year'] <= 2017)]

# dropping irrelevant columns for study 
expenditure_data = expenditure_data.drop(columns=['Country Code', 'Prevelance of Undernourishment', 'CO2', 'Education Expenditure %', 'Corruption', 'Sanitation', 'Injuries', 'Communicable', 'NonCommunicable', 'Unemployment'])

In [87]:
expenditure_data.head()

Unnamed: 0,Country Name,Region,IncomeGroup,Year,Life Expectancy World Bank,Health Expenditure %
0,Afghanistan,South Asia,Low income,2001,56.308,
1,Angola,Sub-Saharan Africa,Lower middle income,2001,47.059,4.483516
2,Albania,Europe & Central Asia,Upper middle income,2001,74.288,7.139524
3,Andorra,Europe & Central Asia,High income,2001,,5.865939
4,United Arab Emirates,Middle East & North Africa,High income,2001,74.544,2.48437


**Merging the Data**

Now that the data has all been properly formatted, these datasets will be merged together. There are 5 dataframes that will be merged, and each contain the following data from each year during the 2001-2017 time period:

- **mental_health_data**: contains all mental health disorder data for each country
- **density_data**: contains the population density for each country 
- **gdp_data**: contains the GDP per capita in USD for each country 
- **unemployment_data**: contains the percentage of the population in each country that was unemployed
- **expenditure_data**: contains data on the life expectancy and health expenditure % for each country

These dataframes will be merged using the reduce function in Python along with the Panda's merge function, which continuously merges each dataframe on a left merge, starting with the main dataframe, the mental health data. 

In [88]:
# list of all dataframes used for merging 
frames = [mental_health_data, density_data, gdp_data, unemployment_data, expenditure_data]

# using reduce function to merge the dataframes 
df = reduce(lambda left_df, right_df: pd.merge(left_df, right_df, on=['Country Name', 'Year'], how='left'), frames)

df.head()

Unnamed: 0,Country Name,Year,Schizophrenia (%),Bipolar disorder (%),Eating disorders (%),Anxiety disorders (%),Drug use disorders (%),Depression (%),Alcohol use disorders (%),population_density,GDP_per_capita_USD,Unemployment (%),Region,IncomeGroup,Life Expectancy World Bank,Health Expenditure %
0,Afghanistan,2001,0.161957,0.700499,0.086517,4.831409,1.839123,4.121381,0.661158,30.15751,,10.81,South Asia,Low income,56.308,
1,Afghanistan,2002,0.162414,0.701141,0.087023,4.838318,1.934326,4.124928,0.659213,32.166553,179.426579,11.26,South Asia,Low income,56.784,9.44339
2,Afghanistan,2003,0.162916,0.70186,0.087189,4.845538,2.051106,4.12523,0.657354,34.686043,190.683814,11.14,South Asia,Low income,57.271,8.941258
3,Afghanistan,2004,0.163377,0.702556,0.088158,4.851512,2.163044,4.126384,0.656132,36.077491,211.382074,10.99,South Asia,Low income,57.772,9.808474
4,Afghanistan,2005,0.163706,0.703078,0.088933,4.854684,2.247443,4.126908,0.655686,37.391157,242.031313,11.22,South Asia,Low income,58.29,9.94829


**Cleaning the Merged Data**

From some analysis (not shown here) showing all the unique country names present in the dataframe, there seems to have been many entries from the mental health data where country names were broad regions throughout the world, such as 'Central Asia', 'North America', etc. Since this study is on the country level, and not region level, all rows with country names similar to these will be dropped from the table. 

In [89]:
names_drop = ['Australasia', 'Central Asia', 'Central Europe', 'Central Europe, Eastern Europe, and Central Asia', 
              'Central Latin America', 'Central Sub-Saharan Africa', 'Eastern Europe', 'East Asia', 
              'Eastern Sub-Saharan Africa', 'High SDI', 'High-income', 'High-income Asia Pacific', 
              'High-middle SDI', 'Latin America and Caribbean', 'Low SDI', 'Low-middle SDI', 'Middle SDI', 
              'Micronesia (country)', 'North Africa and Middle East', 'North America', 'Oceania', 
              'Southeast Asia', 'Southeast Asia, East Asia, and Oceania', 'Southern Latin America', 
              'Southern Sub-Saharan Africa', 'South Asia', 'Sub-Saharan Africa', 'Tropical Latin America',
              'United States Virgin Islands', 'Western Europe', 'Western Sub-Saharan Africa', 'World']

df = df[~df['Country Name'].isin(names_drop)]

Furthermore, we also want to make sure we remove any countries from the DataFrame that did not have corresponding data from the other tables during the merge. There were 7 new columns that were merged onto the mental health data (population density, GDP per capita in USD, unemployment %, region, income group, life expectancy, and healthcare expenditure %). For this study, we will assume that if there are more than 3 of these columns that contain NaN values for each country (during every year from 2001-2017), there is not sufficient data for the particular country and the country should be excluded from the study. The following shows that 100% of the countries in the table that did not meet this criteria occurred every year during the study period (17 years), showing there was not corresponding data for these countries during the merge. 

In [90]:
# showing the countries and how many rows for the country did not meet the data criteria 
# the number of rows not meeting the criteria for each country should be 17, as this is each year during the study 
filtered = df[df.isna().sum(axis=1) > 4]
filtered.groupby('Country Name').size()

Country Name
Andean Latin America                17
Bahamas                             17
Brunei                              17
Cape Verde                          17
Caribbean                           17
Congo                               17
Czech Republic                      17
Democratic Republic of Congo        17
Egypt                               17
England                             17
Gambia                              17
Iran                                17
Kyrgyzstan                          17
Laos                                17
Macedonia                           17
North Korea                         17
Northern Ireland                    17
Palestine                           17
Russia                              17
Saint Lucia                         17
Saint Vincent and the Grenadines    17
Scotland                            17
Slovakia                            17
South Korea                         17
Swaziland                           17
Syria       

These countries will be dropped from the DataFrame as there is not sufficient data to include for the analysis. 

In [91]:
df_cleaned = df[df.isna().sum(axis=1) <= 4]

In [92]:
filtered = df_cleaned[df_cleaned.isna().sum(axis=1) > 4]
filtered.groupby('Country Name').size()

Series([], dtype: int64)

In [93]:
df = df_cleaned

**Fixing Missing Data**

Now that any insufficient countries for the analysis have been removed from the DataFrame, we will see how much more missing data remains in the table. The following shows the proportion of rows in the table that contain at least one NaN value. 

In [94]:
print("Number of rows containing an NaN value: ", len(df[df.isna().any(axis=1)]))
print("Proportion of NaN rows in the entire DataFrame: ", len(df[df.isna().any(axis=1)]) / len(df))

Number of rows containing an NaN value:  323
Proportion of NaN rows in the entire DataFrame:  0.11377245508982035


Although this proportion is very small, there should be no missing values present in the table during further analysis. Below shows the proportion of missing values in the table for each column, as some columns may have more missing data than others. This will allow us to determine how to proceed with filling in missing data values. 

In [95]:
nans = df.isna().mean() 
nans

Country Name                  0.000000
Year                          0.000000
Schizophrenia (%)             0.000000
Bipolar disorder (%)          0.000000
Eating disorders (%)          0.000000
Anxiety disorders (%)         0.000000
Drug use disorders (%)        0.000000
Depression (%)                0.000000
Alcohol use disorders (%)     0.000000
population_density            0.011976
GDP_per_capita_USD            0.010919
Unemployment (%)              0.065868
Region                        0.000000
IncomeGroup                   0.000000
Life Expectancy World Bank    0.029588
Health Expenditure %          0.057415
dtype: float64

Population density missing data

Below shows that the two countries without population density data for the study period are Cote d'Ivoire and Sao Tome and Principe. Neither of these countries have population desnity data for the entire study period (2001-2017).

** figure out how to handle this 

In [96]:
pop = df['population_density'].isna()
pop = df[pop]
pop['Country Name'].unique()

array(["Cote d'Ivoire", 'Sao Tome and Principe'], dtype=object)

GDP per capita missing data

The rows in the table with missing values in the GDP per capita column are more scattered, as there are not any countries in the table with missing GDP values throughout the whole study period. Here, we can use a linear regression model to predict the GDP per capita in USD for these rows. 

** handling missing data

In [97]:
gdp = df['GDP_per_capita_USD'].isna()
df[gdp]

Unnamed: 0,Country Name,Year,Schizophrenia (%),Bipolar disorder (%),Eating disorders (%),Anxiety disorders (%),Drug use disorders (%),Depression (%),Alcohol use disorders (%),population_density,GDP_per_capita_USD,Unemployment (%),Region,IncomeGroup,Life Expectancy World Bank,Health Expenditure %
0,Afghanistan,2001,0.161957,0.700499,0.086517,4.831409,1.839123,4.121381,0.661158,30.15751,,10.81,South Asia,Low income,56.308,
51,American Samoa,2001,0.249614,0.466843,0.17983,3.289246,0.76038,2.945743,1.126103,291.62,,,East Asia & Pacific,Upper middle income,,
1167,Eritrea,2012,0.158294,0.616069,0.100636,3.646936,0.520459,3.860216,1.755663,32.203921,,5.53,Sub-Saharan Africa,Low income,63.238,3.780682
1168,Eritrea,2013,0.15831,0.616258,0.100789,3.649057,0.520482,3.859559,1.750575,32.637297,,5.58,Sub-Saharan Africa,Low income,63.726,5.007354
1169,Eritrea,2014,0.158354,0.616504,0.100744,3.652097,0.520971,3.861139,1.745641,32.905198,,5.67,Sub-Saharan Africa,Low income,64.201,4.057079
1170,Eritrea,2015,0.158428,0.616791,0.1013,3.655814,0.522229,3.861327,1.740921,33.069366,,5.83,Sub-Saharan Africa,Low income,64.664,4.453355
1171,Eritrea,2016,0.158544,0.61714,0.101816,3.660275,0.523399,3.863449,1.736561,33.319673,,5.9,Sub-Saharan Africa,Low income,65.111,3.518888
1172,Eritrea,2017,0.158711,0.617569,0.102503,3.665668,0.524821,3.867377,1.732617,33.633,,5.94,Sub-Saharan Africa,Low income,65.538,3.727679
1394,Guam,2001,0.260766,0.4789,0.240416,3.320364,0.789588,3.439942,1.214632,299.118519,,13.21,East Asia & Pacific,High income,75.373,
2567,Northern Mariana Islands,2001,0.261945,0.489008,0.239339,3.33354,0.734966,3.024075,1.472767,172.780435,,,East Asia & Pacific,High income,,


Unemployment % missing data 

** each country is missing the data throughout the whole study period 

figure out how to handle this 

In [98]:
unemployment = df['Unemployment (%)'].isna()
for i, r in df[unemployment].iterrows():
    print(r['Country Name'], r['Year'])

American Samoa 2001
American Samoa 2002
American Samoa 2003
American Samoa 2004
American Samoa 2005
American Samoa 2006
American Samoa 2007
American Samoa 2008
American Samoa 2009
American Samoa 2010
American Samoa 2011
American Samoa 2012
American Samoa 2013
American Samoa 2014
American Samoa 2015
American Samoa 2016
American Samoa 2017
Andorra 2001
Andorra 2002
Andorra 2003
Andorra 2004
Andorra 2005
Andorra 2006
Andorra 2007
Andorra 2008
Andorra 2009
Andorra 2010
Andorra 2011
Andorra 2012
Andorra 2013
Andorra 2014
Andorra 2015
Andorra 2016
Andorra 2017
Antigua and Barbuda 2001
Antigua and Barbuda 2002
Antigua and Barbuda 2003
Antigua and Barbuda 2004
Antigua and Barbuda 2005
Antigua and Barbuda 2006
Antigua and Barbuda 2007
Antigua and Barbuda 2008
Antigua and Barbuda 2009
Antigua and Barbuda 2010
Antigua and Barbuda 2011
Antigua and Barbuda 2012
Antigua and Barbuda 2013
Antigua and Barbuda 2014
Antigua and Barbuda 2015
Antigua and Barbuda 2016
Antigua and Barbuda 2017
Bermuda 2001
B

Life expectancy missing data 

** figure out how to handle these

all of these are also missing for the unemployment columns

In [99]:
life_exp = df['Life Expectancy World Bank'].isna()
for i, r in df[life_exp].iterrows():
    print(r['Country Name'], r['Year'])

American Samoa 2001
American Samoa 2002
American Samoa 2003
American Samoa 2004
American Samoa 2005
American Samoa 2006
American Samoa 2007
American Samoa 2008
American Samoa 2009
American Samoa 2010
American Samoa 2011
American Samoa 2012
American Samoa 2013
American Samoa 2014
American Samoa 2015
American Samoa 2016
American Samoa 2017
Andorra 2001
Andorra 2002
Andorra 2003
Andorra 2004
Andorra 2005
Andorra 2006
Andorra 2007
Andorra 2008
Andorra 2009
Andorra 2010
Andorra 2011
Andorra 2012
Andorra 2013
Andorra 2014
Andorra 2015
Andorra 2016
Andorra 2017
Dominica 2001
Dominica 2003
Dominica 2004
Dominica 2005
Dominica 2006
Dominica 2007
Dominica 2008
Dominica 2009
Dominica 2010
Dominica 2011
Dominica 2012
Dominica 2013
Dominica 2014
Dominica 2015
Dominica 2016
Dominica 2017
Marshall Islands 2001
Marshall Islands 2002
Marshall Islands 2003
Marshall Islands 2004
Marshall Islands 2005
Marshall Islands 2006
Marshall Islands 2007
Marshall Islands 2008
Marshall Islands 2009
Marshall Islands 

Healthcare expenditure % missing data

In [100]:
health = df['Health Expenditure %'].isna()
for i, r in df[health].iterrows():
    print(r['Country Name'], r['Year'])

Afghanistan 2001
American Samoa 2001
American Samoa 2002
American Samoa 2003
American Samoa 2004
American Samoa 2005
American Samoa 2006
American Samoa 2007
American Samoa 2008
American Samoa 2009
American Samoa 2010
American Samoa 2011
American Samoa 2012
American Samoa 2013
American Samoa 2014
American Samoa 2015
American Samoa 2016
American Samoa 2017
Bermuda 2001
Bermuda 2002
Bermuda 2003
Bermuda 2004
Bermuda 2005
Bermuda 2006
Bermuda 2007
Bermuda 2008
Bermuda 2009
Bermuda 2010
Bermuda 2011
Bermuda 2012
Bermuda 2013
Bermuda 2014
Bermuda 2015
Bermuda 2016
Bermuda 2017
Greenland 2001
Greenland 2002
Greenland 2003
Greenland 2004
Greenland 2005
Greenland 2006
Greenland 2007
Greenland 2008
Greenland 2009
Greenland 2010
Greenland 2011
Greenland 2012
Greenland 2013
Greenland 2014
Greenland 2015
Greenland 2016
Greenland 2017
Guam 2001
Guam 2002
Guam 2003
Guam 2004
Guam 2005
Guam 2006
Guam 2007
Guam 2008
Guam 2009
Guam 2010
Guam 2011
Guam 2012
Guam 2013
Guam 2014
Guam 2015
Guam 2016
Guam 20

**Part 3: Exploratory Data Analysis**

