In [None]:
# Data Source: https://www.kaggle.com/worldbank/world-development-indicators
# Folder: 'world-development-indicators'

# Analysis of Healthcare Indicators of the most developed Countries

[Sauro Grandi – July 14, 2018]

The analysis of healthcare indicators of the most developed countries has been conducted on the dataset:
World Development Indicators Dataset (https://www.kaggle.com/worldbank/world-development-indicators)

For most of the countries, it contains many indicators on different topics, including healthcare, such as data related to annual expenses, age of the population over 65, number of hospital beds per 1000 people, etc.

## Motivation
This analysis wants to compare the most developed countries on indicators related to their Healthcare system. 
After identifying significant indicators of the healthcare status, the study wants to point out which country performs better than others, whether there are trends in the time frame analyzed and find out hidden correlation between the indicators.
This analysis is important to understand which healthcare systems are more efficient than others and gives the elements to identify ways of improving them.  

## Research Questions
This study considers the countries: France, Germany, Italy, UK, Japan, USA, Canada, Switzerland, Denmark, Netherlands, Belgium, Finland, Sweden, Norway, China, in a time interval from 2000 to 2010. The analysis wants to answer to the following questions:
- Which countries have the highest Health expenditure per capita? Is there a trend over the considered yeas?
- Which countries have more Hospital beds per 1000 people? Is there a trend over the considered yeas?
- Which countries have more physicians per 1000 people? Is there a trend over the considered yeas?
- Is there a correlation between “Health expenditure per capita” and “Hospital beds per 1000 people” ?

## Exploration of the dataset world-development-indicators

In [None]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt

In [None]:
data = pd.read_csv('../input/Indicators.csv')
data.shape

In [None]:
# How many unique country codes and indicator names are there ? 
countries = data['CountryCode'].unique().tolist()
indicators = data['IndicatorName'].unique().tolist()
print('How many unique country codes and indicator names are there?')
print('Contries size: %d' % len(countries))
print('Indicators size: %d' % len(indicators))

### Indicators selection
Choice of a set of Indicatiors related to the Healthcare domain.
IndicatorName tha maches:
- Health
- Hospital
- Physicians

In [None]:
mask = data['IndicatorName'].str.contains('Health|Hospital|Physician|Population ages')
indicatorsFilter = data[mask]['IndicatorName'].unique().tolist()
#indicatorsFilter

Refinement the indicatorsFilter to the following indicators:
- Health expenditure per capita (current US$)
- Hospital beds (per 1,000 people)
- Physicians (per 1,000 people)

In [None]:
indicatorsFilter = ['Health expenditure per capita (current US$)',
                    'Hospital beds (per 1,000 people)', 
                    'Physicians (per 1,000 people)'
                    ]

### Countries selection
Choice of a set representing the most developed Countries. It must include: 
- G7 members + Switzerland, Denmark, Netherlands, Belgium, Finland, Sweden, Norway
- North America
- China and Japan

In [None]:
countriesFilter = ['CAN', 'FRA', 'DEU', 'IT', 'JPN', 'GBR', 'USA', 
                  'CHE', 'DNK', 'NLD', 'BEL', 'SWE', 'NOR', 'FIN']

### Years of interest
Choice of a time-span from 2000 to 2010 included

In [None]:
yearsFilter = [2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010]

## Data Cleaning
### Checking missing data

In [None]:
print ('No data available for the following Countries-Indicator:')
for country in countriesFilter:
    for indicator in indicatorsFilter:
        df = data[(data['CountryCode'] == country) &  (data['IndicatorName'] == indicator)]
        size = df.size
        if size == 0:
            print('Country %s, Indicator %s, size %d'% (country, indicator, size))

### Cleaning
Unfortunately we do not have any data for Italy (at least for the Health expenditure indicator)

In [None]:
# Let's remove China and Italy from the countriesFilter
countriesFilter = ['CAN', 'FRA', 'DEU', 'JPN', 'GBR', 'USA', 
                  'CHE', 'DNK', 'NLD', 'BEL', 'SWE', 'NOR', 'FIN']

### Filling

In [None]:
# Let's reduce the dataset extracting data by CountryCode and IndicatorName corresponding to our choice
filterFull = (data['CountryCode'].isin(countriesFilter)) & (data['IndicatorName'].isin(indicatorsFilter)) & (data['Year'].isin(yearsFilter))
data = data.loc[filterFull]
health_exp_df = data[data['IndicatorName'] == 'Health expenditure per capita (current US$)']
hosp_bed_df = data[data['IndicatorName'] == 'Hospital beds (per 1,000 people)']
phys_df = data[data['IndicatorName'] == 'Physicians (per 1,000 people)']

In [None]:
# Dataset size for each country and each indicator
# We accept maximum 3 years of missing data
print(health_exp_df['IndicatorName'].iloc[0])
for country in countriesFilter:
    df = health_exp_df[health_exp_df['CountryCode'] == country]
    if df.shape[0] < 8:
        print(country + ': ' + 'has more that 3 years of missing data')

print(hosp_bed_df['IndicatorName'].iloc[0])
for country in countriesFilter:
    df = hosp_bed_df[hosp_bed_df['CountryCode'] == country]
    if df.shape[0] < 8:
        print(country + ': ' + 'has more that 3 years of missing data')
        
print(phys_df['IndicatorName'].iloc[0])    
for country in countriesFilter:
    df = phys_df[phys_df['CountryCode'] == country]
    if df.shape[0] < 8:
        print(country + ': ' + 'has more that 3 years of missing data')

We will then remove: CHE, NLD, SWE, NOR for 'Hospital beds' analysis and CAN, JPN, GBR, USA, CHE, BEL for 'Physicians (per 1,000 people)' analysis

In [None]:
# dataframe with only the Year column
date_df = pd.DataFrame({'Year' : [2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,2010]})

Performing an outer join with date_df it will fill the missing rows with NaN values. Then we can apply fillna function to the dataset

In [None]:
hosp_bed_FRA_merged = hosp_bed_df[hosp_bed_df['CountryCode'] == 'FRA'].merge(date_df, on='Year', how='outer').sort_values(by='Year', ascending=True).fillna(method='ffill')
hosp_bed_DEU_merged = hosp_bed_df[hosp_bed_df['CountryCode'] == 'DEU'].merge(date_df, on='Year', how='outer').sort_values(by='Year', ascending=True).fillna(method='ffill')
hosp_bed_GBR_merged = hosp_bed_df[hosp_bed_df['CountryCode'] == 'GBR'].merge(date_df, on='Year', how='outer').sort_values(by='Year', ascending=True).fillna(method='ffill')
hosp_bed_DNK_merged = hosp_bed_df[hosp_bed_df['CountryCode'] == 'DNK'].merge(date_df, on='Year', how='outer').sort_values(by='Year', ascending=True).fillna(method='ffill')
hosp_bed_BEL_merged = hosp_bed_df[hosp_bed_df['CountryCode'] == 'BEL'].merge(date_df, on='Year', how='outer').sort_values(by='Year', ascending=True).fillna(method='ffill')
hosp_bed_NOR_merged = hosp_bed_df[hosp_bed_df['CountryCode'] == 'NOR'].merge(date_df, on='Year', how='outer').sort_values(by='Year', ascending=True).fillna(method='ffill')
hosp_bed_FIN_merged = hosp_bed_df[hosp_bed_df['CountryCode'] == 'FIN'].merge(date_df, on='Year', how='outer').sort_values(by='Year', ascending=True).fillna(method='ffill')
hosp_bed_CAN_merged = hosp_bed_df[hosp_bed_df['CountryCode'] == 'CAN'].merge(date_df, on='Year', how='outer').sort_values(by='Year', ascending=True).fillna(method='ffill')
hosp_bed_USA_merged = hosp_bed_df[hosp_bed_df['CountryCode'] == 'USA'].merge(date_df, on='Year', how='outer').sort_values(by='Year', ascending=True).fillna(method='ffill')
hosp_bed_JPN_merged = hosp_bed_df[hosp_bed_df['CountryCode'] == 'JPN'].merge(date_df, on='Year', how='outer').sort_values(by='Year', ascending=True).fillna(method='ffill')

In [None]:
phys_FRA_merged = phys_df[phys_df['CountryCode'] == 'FRA'].merge(date_df, on='Year', how='outer').sort_values(by='Year', ascending=True).fillna(method='ffill')
phys_DEU_merged = phys_df[phys_df['CountryCode'] == 'DEU'].merge(date_df, on='Year', how='outer').sort_values(by='Year', ascending=True).fillna(method='ffill')
phys_DNK_merged = phys_df[phys_df['CountryCode'] == 'DNK'].merge(date_df, on='Year', how='outer').sort_values(by='Year', ascending=True).fillna(method='ffill')
phys_NLD_merged = phys_df[phys_df['CountryCode'] == 'NLD'].merge(date_df, on='Year', how='outer').sort_values(by='Year', ascending=True).fillna(method='ffill')
phys_SWE_merged = phys_df[phys_df['CountryCode'] == 'SWE'].merge(date_df, on='Year', how='outer').sort_values(by='Year', ascending=True).fillna(method='ffill')
phys_FIN_merged = phys_df[phys_df['CountryCode'] == 'FIN'].merge(date_df, on='Year', how='outer').sort_values(by='Year', ascending=True).fillna(method='ffill')
phys_NOR_merged = phys_df[phys_df['CountryCode'] == 'NOR'].merge(date_df, on='Year', how='outer').sort_values(by='Year', ascending=True).fillna(method='ffill')

## Visualization
### Investigation on the Indicator 'Health expenditure per capita (current USD)'

In [None]:
# Extract a list ov values per Country
health_exp_FRA = health_exp_df[health_exp_df['CountryCode'] == 'FRA']['Value'].values
health_exp_DEU = health_exp_df[health_exp_df['CountryCode'] == 'DEU']['Value'].values
health_exp_GBR = health_exp_df[health_exp_df['CountryCode'] == 'GBR']['Value'].values
health_exp_CHE = health_exp_df[health_exp_df['CountryCode'] == 'CHE']['Value'].values
health_exp_DNK = health_exp_df[health_exp_df['CountryCode'] == 'DNK']['Value'].values
health_exp_NLD = health_exp_df[health_exp_df['CountryCode'] == 'NLD']['Value'].values
health_exp_BEL = health_exp_df[health_exp_df['CountryCode'] == 'BEL']['Value'].values
health_exp_SWE = health_exp_df[health_exp_df['CountryCode'] == 'SWE']['Value'].values
health_exp_NOR = health_exp_df[health_exp_df['CountryCode'] == 'NOR']['Value'].values
health_exp_FIN = health_exp_df[health_exp_df['CountryCode'] == 'FIN']['Value'].values
health_exp_CAN = health_exp_df[health_exp_df['CountryCode'] == 'CAN']['Value'].values
health_exp_USA = health_exp_df[health_exp_df['CountryCode'] == 'NLD']['Value'].values
health_exp_JPN = health_exp_df[health_exp_df['CountryCode'] == 'JPN']['Value'].values

In [None]:
# Bar Chart
years = np.array(yearsFilter)
width = 0.05

fig, ax = plt.subplots(figsize=(15, 8))

# create
plt_FRA = ax.bar(years,health_exp_FRA, width)
plt_DEU = ax.bar(years + width,health_exp_DEU, width)
plt_GBR = ax.bar(years + 2*width,health_exp_GBR, width)
plt_CHE = ax.bar(years + 3*width,health_exp_CHE, width)
plt_DNK = ax.bar(years + 4*width,health_exp_DNK, width)
plt_NLD = ax.bar(years + 5*width,health_exp_NLD, width)

plt_BEL = ax.bar(years + 6*width,health_exp_BEL, width)
plt_SWE = ax.bar(years + 7*width,health_exp_SWE, width)
plt_NOR = ax.bar(years + 8*width,health_exp_NOR, width)
plt_FIN = ax.bar(years + 9*width,health_exp_FIN, width)
plt_CAN = ax.bar(years + 10*width,health_exp_CAN, width)
plt_USA = ax.bar(years + 11*width,health_exp_USA, width)
plt_JPN = ax.bar(years + 12*width,health_exp_JPN, width)

# Axes and Labels
ax.set_xlim(years[0]-3*width, years[len(years)-1]+10*width)
ax.set_xlabel('Year')
ax.set_xticks(years+2*width)
xtickNames = ax.set_xticklabels(years)
plt.setp(xtickNames, rotation=45, fontsize=10)

ax.set_ylabel(health_exp_df['IndicatorName'].iloc[0])
#label the figure
ax.set_title(health_exp_df['IndicatorName'].iloc[0])
ax.legend( (plt_FRA[0], plt_DEU[0], plt_GBR[0], plt_CHE[0], plt_DNK[0], plt_NLD[0], 
          plt_BEL[0], plt_SWE[0], plt_NOR[0], plt_FIN[0], plt_CAN[0], plt_USA[0], plt_JPN[0]),
          ('FRA', 'DEU', 'GBR', 'CHE', 'DNK', 'NLD', 'BEL', 'SWE', 'NOR', 'FIN', 'CAN', 'USA', 'JPN') )

plt.show()

Switzerland and Norway have the highest Health Expenditure per capita. 
As expected the Health Expenditure per capita has been continuously increasing during the time-span

### Investigation on the Indicator 'Hospital beds (per 1,000 people)'
                   

In [None]:
# Extract a list ov values per Country
hosp_bed_FRA = hosp_bed_FRA_merged[hosp_bed_FRA_merged['CountryCode'] == 'FRA']['Value'].values
hosp_bed_DEU = hosp_bed_DEU_merged[hosp_bed_DEU_merged['CountryCode'] == 'DEU']['Value'].values
hosp_bed_GBR = hosp_bed_GBR_merged[hosp_bed_GBR_merged['CountryCode'] == 'GBR']['Value'].values
hosp_bed_DNK = hosp_bed_DNK_merged[hosp_bed_DNK_merged['CountryCode'] == 'DNK']['Value'].values
hosp_bed_BEL = hosp_bed_BEL_merged[hosp_bed_BEL_merged['CountryCode'] == 'BEL']['Value'].values
hosp_bed_NOR = hosp_bed_NOR_merged[hosp_bed_NOR_merged['CountryCode'] == 'NOR']['Value'].values
hosp_bed_FIN = hosp_bed_FIN_merged[hosp_bed_FIN_merged['CountryCode'] == 'FIN']['Value'].values
hosp_bed_CAN = hosp_bed_CAN_merged[hosp_bed_CAN_merged['CountryCode'] == 'CAN']['Value'].values
hosp_bed_USA = hosp_bed_USA_merged[hosp_bed_USA_merged['CountryCode'] == 'USA']['Value'].values
hosp_bed_JPN = hosp_bed_JPN_merged[hosp_bed_JPN_merged['CountryCode'] == 'JPN']['Value'].values

In [None]:
years = np.array(yearsFilter)
width = 0.05

fig, ax = plt.subplots(figsize=(12, 8))

# create
plt_FRA = ax.bar(years,hosp_bed_FRA, width)
plt_DEU = ax.bar(years + width,hosp_bed_DEU, width)
plt_GBR = ax.bar(years + 2*width,hosp_bed_GBR, width)
plt_DNK = ax.bar(years + 3*width,hosp_bed_DNK, width)
plt_BEL = ax.bar(years + 4*width,hosp_bed_BEL, width)
plt_NOR = ax.bar(years + 5*width,hosp_bed_NOR, width)
plt_FIN = ax.bar(years + 6*width,hosp_bed_FIN, width)
plt_CAN = ax.bar(years + 7*width,hosp_bed_CAN, width)
plt_USA = ax.bar(years + 8*width,hosp_bed_USA, width)
plt_JPN = ax.bar(years + 9*width,hosp_bed_JPN, width)

# Axes and Labels
ax.set_xlim(years[0]-3*width, years[len(years)-1]+10*width)
ax.set_xlabel('Year')
ax.set_xticks(years+2*width)
xtickNames = ax.set_xticklabels(years)
plt.setp(xtickNames, rotation=45, fontsize=10)

ax.set_ylabel(hosp_bed_df['IndicatorName'].iloc[0])
#label the figure
ax.set_title(hosp_bed_df['IndicatorName'].iloc[0])
ax.legend( (plt_FRA[0], plt_DEU[0], plt_GBR[0], plt_DNK[0],
          plt_BEL[0], plt_NOR[0], plt_FIN[0], plt_CAN[0], plt_USA[0], plt_JPN[0]),
          ('FRA', 'DEU', 'GBR', 'DNK', 'BEL', 'NOR', 'FIN', 'CAN', 'USA', 'JPN') )

plt.show()

There is a remarkable difference (more than double) in the number of hospital beds provided by Japan compared to the other Countries.
This indicator remains constant during the interval 2000-2010

### Investigation on the Indicator 'Physicians (per 1,000 people)'
                   

In [None]:
# Extract a list ov values per Country
phys_FRA = phys_DEU_merged[phys_FRA_merged['CountryCode'] == 'FRA']['Value'].values
phys_DEU = phys_DEU_merged[phys_DEU_merged['CountryCode'] == 'DEU']['Value'].values
phys_DNK = phys_DNK_merged[phys_DNK_merged['CountryCode'] == 'DNK']['Value'].values
phys_NLD = phys_NLD_merged[phys_NLD_merged['CountryCode'] == 'NLD']['Value'].values
phys_SWE = phys_SWE_merged[phys_SWE_merged['CountryCode'] == 'SWE']['Value'].values
phys_FIN = phys_FIN_merged[phys_FIN_merged['CountryCode'] == 'FIN']['Value'].values
phys_NOR = phys_NOR_merged[phys_NOR_merged['CountryCode'] == 'NOR']['Value'].values

In [None]:
years = np.array(yearsFilter)
width = 0.1

fig, ax = plt.subplots(figsize=(15, 5))

# create
plt_FRA = ax.bar(years,hosp_bed_FRA, width)
plt_DEU = ax.bar(years + width,hosp_bed_DEU, width)
plt_DNK = ax.bar(years + 2*width,phys_DNK, width)
plt_NLD = ax.bar(years + 3*width,phys_NLD, width)
plt_SWE = ax.bar(years + 4*width,phys_SWE, width)
plt_FIN = ax.bar(years + 5*width,phys_FIN, width)
plt_NOR = ax.bar(years + 6*width,phys_NOR, width)

# Axes and Labels
ax.set_xlim(years[0]-3*width, years[len(years)-1]+10*width)
ax.set_xlabel('Year')
ax.set_xticks(years+2*width)
xtickNames = ax.set_xticklabels(years)
plt.setp(xtickNames, rotation=45, fontsize=10)

ax.set_ylabel(phys_df['IndicatorName'].iloc[0])
#label the figure
ax.set_title(phys_df['IndicatorName'].iloc[0])
ax.legend( (plt_FRA[0], plt_DEU[0], plt_DNK[0],
          plt_NLD[0], plt_SWE[0], plt_FIN[0], plt_NOR[0]),
          ('FRA', 'DEU', 'DNK', 'NLD', 'SWE', 'FIN', 'NOR') )

plt.show()

In this case it is remarkable the number of physician provided by Germany and France compared to the other European Countries included in the study.
Also in this case the indicator remain constant in the interval 2000-2010

### Correlation between "Health expenditure per capita" and "Hospital beds (per 1,000 people)"

In [None]:
# Reset Contry filter to have all data available for both indicators
countriesFilter = ['CAN', 'FRA', 'DEU', 'JPN', 'GBR', 'USA', 
                   'DNK', 'BEL', 'FIN']
filterCorr = (data['CountryCode'].isin(countriesFilter)) & (data['IndicatorName'] == 'Health expenditure per capita (current US$)') & (data['Year'].isin(yearsFilter))

# Extract the two datasets
health_exp_df = data[filterCorr]
hosp_bed_df = pd.concat([hosp_bed_FRA_merged,hosp_bed_CAN_merged,hosp_bed_DEU_merged,
                    hosp_bed_JPN_merged,hosp_bed_GBR_merged,hosp_bed_USA_merged,
                    hosp_bed_DNK_merged,hosp_bed_BEL_merged,hosp_bed_FIN_merged])

Scatterplot of the 2 Indicators

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

fig, axis = plt.subplots()
# Grid lines, Xticks, Xlabel, Ylabel

axis.yaxis.grid(True)
axis.set_title('Health expenditure per capita vs. Hospital beds (per 1,000 people)',fontsize=10)
axis.set_xlabel(health_exp_df['IndicatorName'].iloc[0],fontsize=10)
axis.set_ylabel(hosp_bed_df['IndicatorName'].iloc[0],fontsize=10)

X = health_exp_df['Value']
Y = hosp_bed_df['Value']

axis.scatter(X, Y)
plt.show()

Surprisingly there seems to be no correlation between the two indicators.
Let's calculate the correlation matrix

In [None]:
np.corrcoef(health_exp_df['Value'],hosp_bed_df['Value'])

A correlation of 0.14 is very weak

## Acknowledgements
For the analysis only the dataset World Development Indicators has been used. 
It has been necessary to perform some data cleaning. Unfortunately there was no data for Italy (at least for the Health expenditure indicator), so they have been removed form the dataset.
Where possible data has been filled keeping the indicator constant from the previous year.