This notebook is dedicated to investigating and cleaning the anxiety dataset.

In [1]:
%matplotlib inline
import warnings
warnings.simplefilter('ignore')
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
anxiety_df = pd.read_csv('../data/Raw/with-anxiety-disorders.csv')

In [3]:
anxiety_df.shape

(6468, 4)

In [4]:
anxiety_df.head()

Unnamed: 0,Entity,Code,Year,Prevalence - Anxiety disorders - Sex: Both - Age: Age-standardized (Percent) (%)
0,Afghanistan,AFG,1990,4.82883
1,Afghanistan,AFG,1991,4.82974
2,Afghanistan,AFG,1992,4.831108
3,Afghanistan,AFG,1993,4.830864
4,Afghanistan,AFG,1994,4.829423


In [5]:
anxiety_df['Year'].unique()

array([1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
       2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
       2012, 2013, 2014, 2015, 2016, 2017])

In [6]:
anxiety_df['Entity'].nunique()

231

After making some oberservations about what is included in the data, I rename one of the colums to make it easier to manipulate.

In [7]:
anxiety_df = anxiety_df.rename(columns={'Prevalence - Anxiety disorders - Sex: Both - Age: Age-standardized (Percent) (%)':'Anxiety_percent'})

Since the datasheet lacks information about which continent each country is from, I upload a dataset that includes the names of the countries and continents.

In [8]:
continents_df = pd.read_csv('../data/Raw/countries_continents2.csv')
continents_df.head()

Unnamed: 0.1,Unnamed: 0,Country,Region,Continent
0,0,Afghanistan,Southern Asia,Asia
1,1,Åland Islands,Northern Europe,Europe
2,2,Albania,Southern Europe,Europe
3,3,Algeria,Northern Africa,Africa
4,4,American Samoa,Polynesia,Oceania


In [9]:
continents_df['Country'].unique()

array(['Afghanistan', 'Åland Islands', 'Albania', 'Algeria',
       'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica',
       'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba',
       'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain',
       'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin',
       'Bermuda', 'Bhutan', 'Bolivia', 'Bonaire, Sint Eustatius and Saba',
       'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil',
       'British Indian Ocean Territory', 'British Virgin Islands',
       'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cape Verde',
       'Cambodia', 'Cameroon', 'Canada', 'Cayman Islands',
       'Central African Republic', 'Chad', 'Chile', 'China',
       'China, Hong Kong Special Administrative Region',
       'China, Macao Special Administrative Region', 'Christmas Island',
       'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo',
       'Cook Islands', 'Costa Rica', "Cote d'Ivoire", 'Croatia'

I then merge this data set with the anxiety dataset, and see how many countries merged.

In [10]:
merged_df = anxiety_df.merge(continents_df, left_on='Entity', right_on='Country')
merged_df.shape

(5348, 8)

In [11]:
merged_df['Entity'].nunique()

191

I am specifically interested in exploring rates of anxiety in countries of the world. Both the anxiety and continents dataset contain more locations than when they are merged together. However, I still have 191 unique entities in the new merged dataset, which is roughly equal to the number of countries in the world. I am satisfied that this is enough for analysis, recognizing that some data points are being excluded from analysis.

I then examine the mean, min, and max values for anxiety per continent:

In [12]:
merged_df.groupby('Continent')['Anxiety_percent'].agg(['mean', 'min','max'])

Unnamed: 0_level_0,mean,min,max
Continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Africa,3.427705,2.835318,5.072361
Asia,3.777915,2.023393,7.174615
Europe,4.461878,2.867995,7.6808
North America,4.29027,2.799207,6.971995
Oceania,3.868394,3.150086,8.96733
South America,4.837711,2.512714,6.39738


To further analyze the data, I imported a dataset that includes the population of each countries for every year represented in the anxiety dataset.

In [13]:
pop_df = pd.read_csv('../data/Cleaned/world_pop_1000s_clean')
pop_df.shape

(285, 32)

In [14]:
pop_df.head()

Unnamed: 0,code,Location,1990,1991,1992,1993,1994,1995,1996,1997,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,Average
0,900,World,5327231.0,5414289.0,5498920.0,5581598.0,5663150.0,5744213.0,5824892.0,5905046.0,...,6956824.0,7041194.0,7125828.0,7210582.0,7295291.0,7379797.0,7464022.0,7547859.0,7631091.0,6472563.0
1,901,More developed regions,1145508.0,1150893.0,1155966.0,1160737.0,1165232.0,1169481.0,1173490.0,1177284.0,...,1234768.0,1239557.0,1244115.0,1248454.0,1252615.0,1256622.0,1260479.0,1264146.0,1267559.0,1207273.0
2,902,Less developed regions,4181723.0,4263396.0,4342953.0,4420861.0,4497919.0,4574731.0,4651402.0,4727762.0,...,5722056.0,5801637.0,5881713.0,5962129.0,6042676.0,6123175.0,6203543.0,6283713.0,6363532.0,5265290.0
3,941,Least developed countries,506276.0,520262.0,534731.0,549559.0,564581.0,579682.0,594801.0,609983.0,...,836615.0,856471.0,876867.0,897793.0,919223.0,941131.0,963520.0,986385.0,1009691.0,737016.9
4,934,"Less developed regions, excluding least develo...",3675448.0,3743134.0,3808222.0,3871301.0,3933338.0,3995050.0,4056601.0,4117778.0,...,4885441.0,4945165.0,5004846.0,5064335.0,5123453.0,5182043.0,5240024.0,5297327.0,5353841.0,4528273.0


In [15]:
merged_df['Entity'].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra',
       'Angola', 'Antigua and Barbuda', 'Argentina', 'Armenia',
       'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain',
       'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin',
       'Bermuda', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina',
       'Botswana', 'Brazil', 'Brunei', 'Bulgaria', 'Burkina Faso',
       'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde',
       'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia',
       'Comoros', 'Congo', 'Costa Rica', "Cote d'Ivoire", 'Croatia',
       'Cuba', 'Cyprus', 'Denmark', 'Djibouti', 'Dominica',
       'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador',
       'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Fiji',
       'Finland', 'France', 'Gabon', 'Gambia', 'Georgia', 'Germany',
       'Ghana', 'Greece', 'Greenland', 'Grenada', 'Guam', 'Guatemala',
       'Guinea', 'Guinea-Bissau', 'Guyana', 'Hai

In [16]:
pop_df.head()

Unnamed: 0,code,Location,1990,1991,1992,1993,1994,1995,1996,1997,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,Average
0,900,World,5327231.0,5414289.0,5498920.0,5581598.0,5663150.0,5744213.0,5824892.0,5905046.0,...,6956824.0,7041194.0,7125828.0,7210582.0,7295291.0,7379797.0,7464022.0,7547859.0,7631091.0,6472563.0
1,901,More developed regions,1145508.0,1150893.0,1155966.0,1160737.0,1165232.0,1169481.0,1173490.0,1177284.0,...,1234768.0,1239557.0,1244115.0,1248454.0,1252615.0,1256622.0,1260479.0,1264146.0,1267559.0,1207273.0
2,902,Less developed regions,4181723.0,4263396.0,4342953.0,4420861.0,4497919.0,4574731.0,4651402.0,4727762.0,...,5722056.0,5801637.0,5881713.0,5962129.0,6042676.0,6123175.0,6203543.0,6283713.0,6363532.0,5265290.0
3,941,Least developed countries,506276.0,520262.0,534731.0,549559.0,564581.0,579682.0,594801.0,609983.0,...,836615.0,856471.0,876867.0,897793.0,919223.0,941131.0,963520.0,986385.0,1009691.0,737016.9
4,934,"Less developed regions, excluding least develo...",3675448.0,3743134.0,3808222.0,3871301.0,3933338.0,3995050.0,4056601.0,4117778.0,...,4885441.0,4945165.0,5004846.0,5064335.0,5123453.0,5182043.0,5240024.0,5297327.0,5353841.0,4528273.0


In [17]:
pop_df.shape

(285, 32)

In [18]:
pop_df['Location'].unique()

array(['World', 'More developed regions', 'Less developed regions',
       'Least developed countries',
       'Less developed regions, excluding least developed countries',
       'Less developed regions, excluding China',
       'Land-locked Developing Countries (LLDC)',
       'Small Island Developing States (SIDS)', 'High-income countries',
       'Middle-income countries', 'Upper-middle-income countries',
       'Lower-middle-income countries', 'Low-income countries',
       'No income group available', 'Africa', 'Asia', 'Europe',
       'Latin America and the Caribbean', 'Northern America', 'Oceania',
       'Sub-Saharan Africa', 'Eastern Africa', 'Burundi', 'Comoros',
       'Djibouti', 'Eritrea', 'Ethiopia', 'Kenya', 'Madagascar', 'Malawi',
       'Mauritius', 'Mayotte', 'Mozambique', 'Réunion', 'Rwanda',
       'Seychelles', 'Somalia', 'South Sudan', 'Uganda', 'Tanzania',
       'Zambia', 'Zimbabwe', ' Middle Africa', 'Angola', 'Cameroon',
       'Central African Republic', 'C

After importing and analyzing the population dataset, I realized that its country names were different from my anxiety dataset (for example, one had United States and another had United States of America). I went back into the population dataset to rename all of the countries so that they match (see XXXX).  

The population in the population dataset for each country was listed horizontally by year in one row, under columns for each year. However, in the anxiety dataset, the year was a variable and there were rows for each year. Therefore, I melted the population dataset so that it have each year as its own row. 

In [19]:
pop_long_df=pop_df.melt(id_vars=['code','Location','Average'])

I then did some evaluations just to make sure everything had transferred correctly.

In [20]:
pop_df['Location'].unique()

array(['World', 'More developed regions', 'Less developed regions',
       'Least developed countries',
       'Less developed regions, excluding least developed countries',
       'Less developed regions, excluding China',
       'Land-locked Developing Countries (LLDC)',
       'Small Island Developing States (SIDS)', 'High-income countries',
       'Middle-income countries', 'Upper-middle-income countries',
       'Lower-middle-income countries', 'Low-income countries',
       'No income group available', 'Africa', 'Asia', 'Europe',
       'Latin America and the Caribbean', 'Northern America', 'Oceania',
       'Sub-Saharan Africa', 'Eastern Africa', 'Burundi', 'Comoros',
       'Djibouti', 'Eritrea', 'Ethiopia', 'Kenya', 'Madagascar', 'Malawi',
       'Mauritius', 'Mayotte', 'Mozambique', 'Réunion', 'Rwanda',
       'Seychelles', 'Somalia', 'South Sudan', 'Uganda', 'Tanzania',
       'Zambia', 'Zimbabwe', ' Middle Africa', 'Angola', 'Cameroon',
       'Central African Republic', 'C

In [21]:
pop_long_df['Location'].value_counts()

Northern America                                 58
Latin America and the Caribbean                  58
Europe                                           58
Congo                                            58
Dominican Republic                               29
 Middle Africa                                   29
Marshall Islands                                 29
Trinidad and Tobago                              29
Namibia                                          29
Colombia                                         29
United States Virgin Islands                     29
Northern Africa and Western Asia                 29
Hungary                                          29
Kazakhstan                                       29
China, Taiwan Province of China                  29
Nicaragua                                        29
Least developed countries                        29
Timor                                            29
Bangladesh                                       29
Oceania (exc

In [22]:
29*285

8265

In [23]:
pop_long_df['Location'].unique()

array(['World', 'More developed regions', 'Less developed regions',
       'Least developed countries',
       'Less developed regions, excluding least developed countries',
       'Less developed regions, excluding China',
       'Land-locked Developing Countries (LLDC)',
       'Small Island Developing States (SIDS)', 'High-income countries',
       'Middle-income countries', 'Upper-middle-income countries',
       'Lower-middle-income countries', 'Low-income countries',
       'No income group available', 'Africa', 'Asia', 'Europe',
       'Latin America and the Caribbean', 'Northern America', 'Oceania',
       'Sub-Saharan Africa', 'Eastern Africa', 'Burundi', 'Comoros',
       'Djibouti', 'Eritrea', 'Ethiopia', 'Kenya', 'Madagascar', 'Malawi',
       'Mauritius', 'Mayotte', 'Mozambique', 'Réunion', 'Rwanda',
       'Seychelles', 'Somalia', 'South Sudan', 'Uganda', 'Tanzania',
       'Zambia', 'Zimbabwe', ' Middle Africa', 'Angola', 'Cameroon',
       'Central African Republic', 'C

I converted the names of years in the population dataset from strings to integers.

In [24]:
pop_long_df['variable']=pop_long_df['variable'].astype('int')

I was finally able to merge my anxiety dataset with the population, and found that I successfully merged it with 189 of the geographic entities.

In [25]:
ap_df = merged_df.merge(pop_long_df, left_on=['Year', 'Entity'], right_on=['variable', 'Location'])
ap_df['Entity'].nunique()

189

In [26]:
pop_long_df['Location'].unique()

array(['World', 'More developed regions', 'Less developed regions',
       'Least developed countries',
       'Less developed regions, excluding least developed countries',
       'Less developed regions, excluding China',
       'Land-locked Developing Countries (LLDC)',
       'Small Island Developing States (SIDS)', 'High-income countries',
       'Middle-income countries', 'Upper-middle-income countries',
       'Lower-middle-income countries', 'Low-income countries',
       'No income group available', 'Africa', 'Asia', 'Europe',
       'Latin America and the Caribbean', 'Northern America', 'Oceania',
       'Sub-Saharan Africa', 'Eastern Africa', 'Burundi', 'Comoros',
       'Djibouti', 'Eritrea', 'Ethiopia', 'Kenya', 'Madagascar', 'Malawi',
       'Mauritius', 'Mayotte', 'Mozambique', 'Réunion', 'Rwanda',
       'Seychelles', 'Somalia', 'South Sudan', 'Uganda', 'Tanzania',
       'Zambia', 'Zimbabwe', ' Middle Africa', 'Angola', 'Cameroon',
       'Central African Republic', 'C

I renamed the population column.

In [27]:
ap_df.rename(columns={'value' : 'Population'})

Unnamed: 0.1,Entity,Code,Year,Anxiety_percent,Unnamed: 0,Country,Region,Continent,code,Location,Average,variable,Population
0,Afghanistan,AFG,1990,4.828830,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.620690,1990,12412.0
1,Afghanistan,AFG,1991,4.829740,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.620690,1991,13299.0
2,Afghanistan,AFG,1992,4.831108,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.620690,1992,14486.0
3,Afghanistan,AFG,1993,4.830864,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.620690,1993,15817.0
4,Afghanistan,AFG,1994,4.829423,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.620690,1994,17076.0
5,Afghanistan,AFG,1995,4.828337,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.620690,1995,18111.0
6,Afghanistan,AFG,1996,4.828083,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.620690,1996,18853.0
7,Afghanistan,AFG,1997,4.827726,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.620690,1997,19357.0
8,Afghanistan,AFG,1998,4.826971,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.620690,1998,19738.0
9,Afghanistan,AFG,1999,4.826413,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.620690,1999,20171.0


The population dataset gave the population in thousands. I converted it into its actual numeric amount.

In [28]:
ap_df['Population']=ap_df['value']*1000

In [29]:
ap_df.head()

Unnamed: 0.1,Entity,Code,Year,Anxiety_percent,Unnamed: 0,Country,Region,Continent,code,Location,Average,variable,value,Population
0,Afghanistan,AFG,1990,4.82883,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.62069,1990,12412.0,12412000.0
1,Afghanistan,AFG,1991,4.82974,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.62069,1991,13299.0,13299000.0
2,Afghanistan,AFG,1992,4.831108,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.62069,1992,14486.0,14486000.0
3,Afghanistan,AFG,1993,4.830864,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.62069,1993,15817.0,15817000.0
4,Afghanistan,AFG,1994,4.829423,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.62069,1994,17076.0,17076000.0


I created a new variable for each row that had the number of people in the population who have anxiety.

In [30]:
ap_df['Anxiety_population']=ap_df['Anxiety_percent']/100 * ap_df['Population']
ap_df.head()

Unnamed: 0.1,Entity,Code,Year,Anxiety_percent,Unnamed: 0,Country,Region,Continent,code,Location,Average,variable,value,Population,Anxiety_population
0,Afghanistan,AFG,1990,4.82883,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.62069,1990,12412.0,12412000.0,599354.342985
1,Afghanistan,AFG,1991,4.82974,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.62069,1991,13299.0,13299000.0,642307.172127
2,Afghanistan,AFG,1992,4.831108,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.62069,1992,14486.0,14486000.0,699834.357854
3,Afghanistan,AFG,1993,4.830864,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.62069,1993,15817.0,15817000.0,764097.693301
4,Afghanistan,AFG,1994,4.829423,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.62069,1994,17076.0,17076000.0,824672.301042


I then dropped some duplicated datasets, renamed some columns, and exported the new cleaned anxiety dataset to a CSV.

In [31]:
ap2_df = ap_df.drop(['Country', 'Location','value', 'variable', 'code'], axis=1)

In [32]:
ap3_df = ap2_df.rename(columns={'Average':'Average_pop'})

In [33]:
ap3_df.to_csv('../data/Cleaned/anxiety_clean', index=False)

I also imported the new, long population dataset to a CSV. 

In [34]:
pop_long_df.to_csv('../data/Cleaned/world_pop_long', index=False)

In [35]:
ap_df

Unnamed: 0.1,Entity,Code,Year,Anxiety_percent,Unnamed: 0,Country,Region,Continent,code,Location,Average,variable,value,Population,Anxiety_population
0,Afghanistan,AFG,1990,4.828830,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.620690,1990,12412.0,12412000.0,5.993543e+05
1,Afghanistan,AFG,1991,4.829740,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.620690,1991,13299.0,13299000.0,6.423072e+05
2,Afghanistan,AFG,1992,4.831108,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.620690,1992,14486.0,14486000.0,6.998344e+05
3,Afghanistan,AFG,1993,4.830864,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.620690,1993,15817.0,15817000.0,7.640977e+05
4,Afghanistan,AFG,1994,4.829423,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.620690,1994,17076.0,17076000.0,8.246723e+05
5,Afghanistan,AFG,1995,4.828337,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.620690,1995,18111.0,18111000.0,8.744602e+05
6,Afghanistan,AFG,1996,4.828083,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.620690,1996,18853.0,18853000.0,9.102385e+05
7,Afghanistan,AFG,1997,4.827726,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.620690,1997,19357.0,19357000.0,9.345029e+05
8,Afghanistan,AFG,1998,4.826971,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.620690,1998,19738.0,19738000.0,9.527476e+05
9,Afghanistan,AFG,1999,4.826413,0,Afghanistan,Southern Asia,Asia,4,Afghanistan,24737.620690,1999,20171.0,20171000.0,9.735357e+05
