In [1]:
%matplotlib inline
import warnings
warnings.simplefilter('ignore')
import pandas as pd 
import matplotlib.pyplot as plt

I imported my human resources dataset and ran some initial investigations.

In [2]:
hr_df = pd.read_csv('../data/Raw/human_resources.csv')

In [3]:
hr_df['Country'].nunique()

154

In [4]:
hr_df['Country'].unique()

array(['Afghanistan', 'Albania', 'Angola', 'Antigua and Barbuda',
       'Argentina', 'Armenia', 'Australia', 'Azerbaijan', 'Bahrain',
       'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Bhutan',
       'Bolivia (Plurinational State of)', 'Bosnia and Herzegovina',
       'Brazil', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso',
       'Burundi', 'Cambodia', 'Canada', 'Central African Republic',
       'Chad', 'Chile', 'China', 'Colombia', 'Comoros', 'Congo',
       'Cook Islands', 'Costa Rica', "Côte d'Ivoire", 'Croatia', 'Cuba',
       'Cyprus', 'Czechia', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia',
       'Eswatini', 'Ethiopia', 'Fiji', 'Finland', 'France', 'Gabon',
       'Gambia', 'Georgia', 'Germany', 'Ghana', 'Greece', 'Grenada',
       'Guatemala', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti',
       'Honduras', 'Hungary', 'India', 'Indonesia',
       'Iran (Islamic Republic of)', 'Iraq', 'Israel', 'It

Upon realizing that many of the country names differed from my other datasets, I renamed the countries so that in the future they will match with those dataset.

In [5]:
hr_df['Country'] = hr_df['Country'].replace({'Bolivia (Plurinational State of)':'Bolivia', 
                      'Brunei Darussalam':'Brunei', 'Cabo Verde':'Cape Verde', 'Côte d’Ivoire':"Cote d'Ivoire",
                            'Czechia':'Czech Republic', "Democratic People's Republic of Korea":'North Korea'})

In [6]:
hr_df['Country'] = hr_df['Country'].replace({'Eswatini':'Swaziland', 'Iran (Islamic Republic of)':'Iran',
                                      "Lao People's Democratic Republic":'Laos', 'Micronesia (Federated States of)':'Micronesia (country)',
                                      'Republic of Korea':'South Korea', 'Republic of Moldova':'Moldova',
                                       'Russian Federation':'Russia'})

In [7]:
hr_df['Country'] = hr_df['Country'].replace({'State of Palestine':'Palestine', 'Syrian Arab Republic':'Syria',
                                      'Timor-Leste':'Timor', 'United Kingdom of Great Britain and Northern Ireland':'United Kingdom',
                                      'United Republic of Tanzania':'Tanzania', 'United States of America':'United States',
                                      'Venezuela (Bolivarian Republic of)':'Venezuela', 'Viet Nam':'Vietnam'})
    

I then imported the continents dataset and merged it with the disorders dataset.

In [8]:
continents_df = pd.read_csv('../data/Raw/countries_continents2.csv')

In [9]:
continents_df.head()

Unnamed: 0.1,Unnamed: 0,Country,Region,Continent
0,0,Afghanistan,Southern Asia,Asia
1,1,Åland Islands,Northern Europe,Europe
2,2,Albania,Southern Europe,Europe
3,3,Algeria,Northern Africa,Africa
4,4,American Samoa,Polynesia,Oceania


In [10]:
cr_df = hr_df.merge(continents_df, left_on='Country', right_on='Country')
cr_df.head()

Unnamed: 0.1,Country,Year,Psychiatrists working in mental health sector (per 100 000 population),Nurses working in mental health sector (per 100 000 population),Social workers working in mental health sector (per 100 000 population),Psychologists working in mental health sector (per 100 000 population),Unnamed: 0,Region,Continent
0,Afghanistan,2016,0.231,0.098,,0.296,0,Southern Asia,Asia
1,Albania,2016,1.471,6.876,1.06,1.231,2,Southern Europe,Europe
2,Angola,2016,0.057,0.66,0.022,0.179,6,Middle Africa,Africa
3,Antigua and Barbuda,2016,1.001,7.005,4.003,,9,Caribbean,North America
4,Argentina,2016,21.705,,,222.572,10,South America,South America


In [11]:
cr_df['Country'].unique()

array(['Afghanistan', 'Albania', 'Angola', 'Antigua and Barbuda',
       'Argentina', 'Armenia', 'Australia', 'Azerbaijan', 'Bahrain',
       'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Bhutan',
       'Bolivia', 'Bosnia and Herzegovina', 'Brazil', 'Brunei',
       'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Canada',
       'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia',
       'Comoros', 'Congo', 'Cook Islands', 'Costa Rica', 'Croatia',
       'Cuba', 'Cyprus', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia',
       'Swaziland', 'Ethiopia', 'Fiji', 'Finland', 'France', 'Gabon',
       'Gambia', 'Georgia', 'Germany', 'Ghana', 'Greece', 'Grenada',
       'Guatemala', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti',
       'Honduras', 'Hungary', 'India', 'Indonesia', 'Iran', 'Iraq',
       'Israel', 'Italy', 'Jamaica', 'Japan', 'Jordan', 'Kenya',
       'Kiribati', 'Kyrgyzstan', 'Latvia', 'Leb

In [12]:
cr_df['Year'].unique()

array([2016, 2015, 2017, 2013, 2014])

In [13]:
cr_df['Country'].nunique()

151

In [14]:
cr_df.shape

(151, 9)

In [15]:
cr_df['Year'].value_counts()

2016    106
2017     22
2015     20
2013      2
2014      1
Name: Year, dtype: int64

Based on the above series of investigations, it appears that this dataset's range of years is short and not evenly distributed. Most of the data is from 2016. It also appears many countries are missing from this dataset. There are only 140 rows.

I then renamed the columns to make them easier to manipulate.

In [16]:
cr_df = cr_df.rename(columns={'Psychiatrists working in mental health sector (per 100 000 population)':'Psychiatrists_per100000',
                                                      'Nurses working in mental health sector (per 100 000 population)': 'Nurses_per100000',
                                                     'Social workers working in mental health sector (per 100 000 population)':'Social_workers_per100000',
                                                     'Psychologists working in mental health sector (per 100 000 population)':'Psychologists_per100000'})


Earlier, a sample of the dataset showed that not all of the data was filled in each row. I created filters for each of the four human resources in the dataset and used them to evaluate how many rows were missing data and which data was present.

In [17]:
psychiatrists=cr_df['Psychiatrists_per100000']>0
nurses=cr_df['Nurses_per100000']>0
social_workers=cr_df['Social_workers_per100000']>0
psychologists=cr_df['Psychologists_per100000']>0

In [18]:
no_missing_data_df=cr_df[psychiatrists & nurses & social_workers & psychologists]
no_missing_data_df.shape

(72, 9)

In [19]:
#72 data points in 2016 have all four data points
no_missing_data_df.head()

Unnamed: 0.1,Country,Year,Psychiatrists_per100000,Nurses_per100000,Social_workers_per100000,Psychologists_per100000,Unnamed: 0,Region,Continent
1,Albania,2016,1.471,6.876,1.06,1.231,2,Southern Europe,Europe
2,Angola,2016,0.057,0.66,0.022,0.179,6,Middle Africa,Africa
5,Armenia,2016,3.84,11.245,0.274,0.788,11,Western Asia,Asia
7,Azerbaijan,2016,3.452,6.717,0.114,1.165,15,Western Asia,Asia
8,Bahrain,2017,5.467,27.918,1.458,1.239,17,Western Asia,Asia


In [20]:
have_psychiatrists=cr_df[psychiatrists].shape[0]
have_psychiatrists
#142 data points in 2016 have psychiatrists

142

In [21]:
have_social_workers=cr_df[social_workers].shape[0]
have_social_workers
#97 data points from 2016 have info about social workers

97

In [22]:
have_nurses=cr_df[nurses].shape[0]
have_nurses
#124 data points have info about nurses

124

In [23]:
have_psychologists=cr_df[psychologists].shape[0]
have_psychologists
#117 data points have info about psychologists

117

I then imported the population dataset and merged it with the resources dataset. 

In [24]:
pop_df = pd.read_csv('../data/Cleaned/world_pop_long')
pop_df.head()

Unnamed: 0,code,Location,Average,variable,value
0,900,World,6472563.0,1990,5327231.0
1,901,More developed regions,1207273.0,1990,1145508.0
2,902,Less developed regions,5265290.0,1990,4181723.0
3,941,Least developed countries,737016.9,1990,506276.0
4,934,"Less developed regions, excluding least develo...",4528273.0,1990,3675448.0


In [25]:
rp_df = cr_df.merge(pop_df, left_on=['Year', 'Country'], right_on=['variable', 'Location'])
rp_df.head()

Unnamed: 0.1,Country,Year,Psychiatrists_per100000,Nurses_per100000,Social_workers_per100000,Psychologists_per100000,Unnamed: 0,Region,Continent,code,Location,Average,variable,value
0,Afghanistan,2016,0.231,0.098,,0.296,0,Southern Asia,Asia,4,Afghanistan,24737.62069,2016,35383.0
1,Albania,2016,1.471,6.876,1.06,1.231,2,Southern Europe,Europe,8,Albania,3055.275862,2016,2886.0
2,Angola,2016,0.057,0.66,0.022,0.179,6,Middle Africa,Africa,24,Angola,19765.931034,2016,28842.0
3,Antigua and Barbuda,2016,1.001,7.005,4.003,,9,Caribbean,North America,28,Antigua and Barbuda,80.137931,2016,95.0
4,Argentina,2016,21.705,,,222.572,10,South America,South America,32,Argentina,38496.068966,2016,43508.0


No countries were lost in the merge.

In [26]:
rp_df['Country'].nunique()

151

I then created a value that represented the actual population, since it was originally only represented by thousands.

In [27]:
rp_df['Population']=rp_df['value']*1000
rp_df.head()

Unnamed: 0.1,Country,Year,Psychiatrists_per100000,Nurses_per100000,Social_workers_per100000,Psychologists_per100000,Unnamed: 0,Region,Continent,code,Location,Average,variable,value,Population
0,Afghanistan,2016,0.231,0.098,,0.296,0,Southern Asia,Asia,4,Afghanistan,24737.62069,2016,35383.0,35383000.0
1,Albania,2016,1.471,6.876,1.06,1.231,2,Southern Europe,Europe,8,Albania,3055.275862,2016,2886.0,2886000.0
2,Angola,2016,0.057,0.66,0.022,0.179,6,Middle Africa,Africa,24,Angola,19765.931034,2016,28842.0,28842000.0
3,Antigua and Barbuda,2016,1.001,7.005,4.003,,9,Caribbean,North America,28,Antigua and Barbuda,80.137931,2016,95.0,95000.0
4,Argentina,2016,21.705,,,222.572,10,South America,South America,32,Argentina,38496.068966,2016,43508.0,43508000.0


I then created variables that actually represented the number of human resources per country.

In [28]:
rp_df['#psychiatrists']=rp_df['Psychiatrists_per100000'] * rp_df['value']/100
rp_df['#nurses']=rp_df['Nurses_per100000'] * rp_df['value']/100
rp_df['#social_workers']=rp_df['Social_workers_per100000'] * rp_df['value']/100
rp_df['#psychologists']=rp_df['Psychologists_per100000'] * rp_df['value']/100

Next, I created a variable that represented the total resources in a country.

In [29]:
rp_df['Total_resources']=rp_df['#psychiatrists']+rp_df['#nurses']+rp_df['#social_workers']+rp_df['#psychologists']

In [30]:
rp_df.head()

Unnamed: 0.1,Country,Year,Psychiatrists_per100000,Nurses_per100000,Social_workers_per100000,Psychologists_per100000,Unnamed: 0,Region,Continent,code,Location,Average,variable,value,Population,#psychiatrists,#nurses,#social_workers,#psychologists,Total_resources
0,Afghanistan,2016,0.231,0.098,,0.296,0,Southern Asia,Asia,4,Afghanistan,24737.62069,2016,35383.0,35383000.0,81.73473,34.67534,,104.73368,
1,Albania,2016,1.471,6.876,1.06,1.231,2,Southern Europe,Europe,8,Albania,3055.275862,2016,2886.0,2886000.0,42.45306,198.44136,30.5916,35.52666,307.01268
2,Angola,2016,0.057,0.66,0.022,0.179,6,Middle Africa,Africa,24,Angola,19765.931034,2016,28842.0,28842000.0,16.43994,190.3572,6.34524,51.62718,264.76956
3,Antigua and Barbuda,2016,1.001,7.005,4.003,,9,Caribbean,North America,28,Antigua and Barbuda,80.137931,2016,95.0,95000.0,0.95095,6.65475,3.80285,,
4,Argentina,2016,21.705,,,222.572,10,South America,South America,32,Argentina,38496.068966,2016,43508.0,43508000.0,9443.4114,,,96836.62576,


I then dropped unnecessary columns.

In [31]:
rp3_df=rp_df.drop(columns={'Unnamed: 0', 'code', 'Location', 'Average', 'variable', 'value'})

In [32]:
rp3_df.head()

Unnamed: 0,Country,Year,Psychiatrists_per100000,Nurses_per100000,Social_workers_per100000,Psychologists_per100000,Region,Continent,Population,#psychiatrists,#nurses,#social_workers,#psychologists,Total_resources
0,Afghanistan,2016,0.231,0.098,,0.296,Southern Asia,Asia,35383000.0,81.73473,34.67534,,104.73368,
1,Albania,2016,1.471,6.876,1.06,1.231,Southern Europe,Europe,2886000.0,42.45306,198.44136,30.5916,35.52666,307.01268
2,Angola,2016,0.057,0.66,0.022,0.179,Middle Africa,Africa,28842000.0,16.43994,190.3572,6.34524,51.62718,264.76956
3,Antigua and Barbuda,2016,1.001,7.005,4.003,,Caribbean,North America,95000.0,0.95095,6.65475,3.80285,,
4,Argentina,2016,21.705,,,222.572,South America,South America,43508000.0,9443.4114,,,96836.62576,


The cleaned dataset was exported.

In [33]:
rp3_df.to_csv('../data/Cleaned/resources_clean', index=False)