### Hanan Sherka
### Final Project
### Maternal Mortality Predictive Modeling
### Sconyers, Period 2
### Due: December 18th, 2018 

#### Introduction 

   For this project, I returned to the Maternal Mortality dataset and found the correlation between the maternal mortality rates and the percent of births with a skilled health professional present. Maternal mortality in the context of this data is described as death of a woman while she is pregnant or withing 42 days after her pregnancy, if it is directly related to, in any way, her pregnancy. The ratios provided is the number of deaths per 100,000 live births. 

I got this data from gapminder, a website that provides international data on an array of topics. The dataset on skilled health professional originally came from The World Bank . 
     Health Professional Presence: http://data.worldbank.org/indicator/SH.STA.BRTC.ZS
     Link to Maternal Mortality Rates data: https://www.gapminder.org/data/documentation/gd010/

#### Dataset Preparation

##### In Excel 

There was a lot of data that was missing in the "births attended by skilled health professionals" dataset. I did not want this lack of data to skew the predictive model I made, so in the edited excel file, I created a new column that calculated the number of non-NaN values for each country. And for the 22 years of data (I would only be using 18), if there were less than 7 non-NaN values, then I deleted that country. 

##### Open, Read, and Basic Info 

In [115]:
# Always include aliased packages
import math as m 
import numpy as np
import scipy as sp 
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.linear_model import LogisticRegression

In [116]:
# One line to open and read an excel file into a data frame data structure, make the first row the 
# header,then close the file. 
df_skilled_edit = pd.read_csv('deleted_countries.csv')
df_mm = pd.read_csv('maternal_mortality_ratio_per_100000_live_births.csv')

In [117]:
# Print the first couple of lines to ensure that everything is correct in the maternal 
# mortality data frame
df_mm.head ()

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013
0,Afghanistan,,,,,,,,,,...,,730.0,,,,,500.0,,,400.0
1,Albania,,,,,,,,,,...,,24.0,,,,,21.0,,,21.0
2,Algeria,,,,,,,,,,...,,100.0,,,,,92.0,,,89.0
3,Andorra,,,,,,,,,,...,,,,,,,,,,
4,Angola,,,,,,,,,,...,,750.0,,,,,530.0,,,460.0


In [118]:
# Print basic information on the dataset about maternal mortality 
df_mm.info ()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 187 entries, 0 to 186
Columns: 215 entries, country to 2013
dtypes: float64(214), object(1)
memory usage: 314.2+ KB


In [119]:
# Print the first couple of lines to ensure that everything is correct in the skilled
# health professionals present at birth data frame 
df_skilled_edit.head ()

Unnamed: 0,country,1984,1985,1986,1987,1988,1989,1990,1991,1992,...,2005,2006,2007,2008,2009,2010,2011,2012,2013,Number of non-Nan Values
0,Cote d'Ivoire,,,,,,,,,,...,55.1,56.8,,,,,,59.4,,7
1,Tanzania,,,,,,,,,43.9,...,45.1,,,,,48.9,,61.4,,7
2,Turkey,,,,,,,,,,...,,,89.3,91.3,95.0,,,,97.4,7
3,Turkmenistan,,,,,,,,,,...,99.7,99.5,,,,,,,,7
4,Afghanistan,,,,,,,,,,...,,18.9,,24.0,,34.3,38.6,39.9,,7


In [120]:
# Print basic information on the dataset about skilled health professtionals present at birth 
df_skilled_edit.info ()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 108 entries, 0 to 107
Data columns (total 32 columns):
country                      108 non-null object
1984                         2 non-null float64
1985                         1 non-null float64
1986                         9 non-null float64
1987                         9 non-null float64
1988                         5 non-null float64
1989                         27 non-null float64
1990                         47 non-null float64
1991                         44 non-null float64
1992                         42 non-null float64
1993                         48 non-null float64
1994                         48 non-null float64
1995                         66 non-null float64
1996                         56 non-null float64
1997                         57 non-null float64
1998                         70 non-null float64
1999                         69 non-null float64
2000                         94 non-null float64
2001              

##### Merging 

In [121]:
# Create a dataset for the years 1995-2014
df_new = pd.merge (df_mm [['country','1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', 
                           '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', 
                          '2011', '2012', '2013']],
                   df_skilled_edit [['country','1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', 
                           '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', 
                          '2011', '2012', '2013']],
                    on = 'country')
df_new.head ()

Unnamed: 0,country,1995_x,1996_x,1997_x,1998_x,1999_x,2000_x,2001_x,2002_x,2003_x,...,2004_y,2005_y,2006_y,2007_y,2008_y,2009_y,2010_y,2011_y,2012_y,2013_y
0,Afghanistan,1200.0,,,,,1100.0,,,,...,,,18.9,,24.0,,34.3,38.6,39.9,
1,Albania,29.0,,,,,28.0,,,,...,99.3,99.8,99.0,,,99.3,,,,
2,Antigua and Barbuda,,,,,,25.2,,,,...,100.0,99.9,99.9,100.0,100.0,100.0,100.0,,100.0,100.0
3,Argentina,60.0,,,,,63.0,,,,...,99.1,99.1,99.4,99.4,94.8,97.9,95.0,97.1,98.2,97.0
4,Armenia,51.0,,,,,43.0,,,,...,99.5,97.8,99.7,99.9,99.9,100.0,99.5,100.0,100.0,100.0


In [122]:
# Set the original dataframe it came from to be there instead of x or y 
df_new.rename(columns={df_new.columns[1]: "1995_mm",
                         df_new.columns[2]: "1996_mm",
                         df_new.columns[3]: "1997_mm",
                         df_new.columns[4]: "1998_mm",
                         df_new.columns[5]: "1999_mm",
                         df_new.columns[6]: "2000_mm",
                         df_new.columns[7]: "2001_mm",
                         df_new.columns[8]: "2002_mm",
                         df_new.columns[9]: "2003_mm",
                         df_new.columns[10]: "2004_mm",
                         df_new.columns[11]: "2005_mm",
                         df_new.columns[12]: "2006_mm",
                         df_new.columns[13]: "2007_mm",
                         df_new.columns[14]: "2008_mm",
                         df_new.columns[15]: "2009_mm",
                         df_new.columns[16]: "2010_mm",
                         df_new.columns[17]: "2011_mm",
                         df_new.columns[18]: "2012_mm",
                         df_new.columns[19]: "2013_mm",
                         df_new.columns[20]: "1995_skilled",
                         df_new.columns[21]: "1996_skilled",
                         df_new.columns[22]: "1997_skilled",
                         df_new.columns[23]: "1998_skilled",
                         df_new.columns[24]: "1999_skilled",
                         df_new.columns[25]: "2000_skilled",
                         df_new.columns[26]: "2001_skilled",
                         df_new.columns[27]: "2002_skilled",
                         df_new.columns[28]: "2003_skilled",
                         df_new.columns[29]: "2004_skilled",
                         df_new.columns[30]: "2005_skilled",
                         df_new.columns[31]: "2006_skilled",
                         df_new.columns[32]: "2007_skilled",
                         df_new.columns[33]: "2008_skilled",
                         df_new.columns[34]: "2009_skilled",
                         df_new.columns[35]: "2010_skilled",
                         df_new.columns[36]: "2011_skilled",
                         df_new.columns[37]: "2012_skilled",
                         df_new.columns[38]: "2013_skilled"}, 
                 inplace=True)
df_new.head(118)
df_new.head ()

Unnamed: 0,country,1995_mm,1996_mm,1997_mm,1998_mm,1999_mm,2000_mm,2001_mm,2002_mm,2003_mm,...,2004_skilled,2005_skilled,2006_skilled,2007_skilled,2008_skilled,2009_skilled,2010_skilled,2011_skilled,2012_skilled,2013_skilled
0,Afghanistan,1200.0,,,,,1100.0,,,,...,,,18.9,,24.0,,34.3,38.6,39.9,
1,Albania,29.0,,,,,28.0,,,,...,99.3,99.8,99.0,,,99.3,,,,
2,Antigua and Barbuda,,,,,,25.2,,,,...,100.0,99.9,99.9,100.0,100.0,100.0,100.0,,100.0,100.0
3,Argentina,60.0,,,,,63.0,,,,...,99.1,99.1,99.4,99.4,94.8,97.9,95.0,97.1,98.2,97.0
4,Armenia,51.0,,,,,43.0,,,,...,99.5,97.8,99.7,99.9,99.9,100.0,99.5,100.0,100.0,100.0


#### Data Modeling

##### Finding Correlation Coefficient  

I just selected some years to see the correlation coeffiecient, to understand if this data has a strong enough correlation to make predictive models from

In [123]:
# Created a new dataframe that included just 1995 skilled health professionals and maternal mortality, and found the
# correlation coeffiecient
df_new1995 = df_new[['1995_skilled', '1995_mm']]
df_new1995.corr()

Unnamed: 0,1995_skilled,1995_mm
1995_skilled,1.0,-0.686061
1995_mm,-0.686061,1.0


Something that is interesting to note is, when I had not deleted the countries that had less than 7 nonNan values, the correlation coefficient was approximately -0.8 for the year 1995, which is a lot stronger. 

In [124]:
# Created a new dataframe that included just 2000 skilled health professionals and maternal mortality, and found the
# correlation coeffiecient
df_new2000 = df_new[['2000_skilled', '2000_mm']]
df_new2000.corr()

Unnamed: 0,2000_skilled,2000_mm
2000_skilled,1.0,-0.798158
2000_mm,-0.798158,1.0


In [125]:
# Created a new dataframe that included just 2013 skilled health professionals and maternal mortality, and found the
# correlation coeffiecient
df_new2013 = df_new[['2013_skilled', '2013_mm']]
df_new2013.corr()

Unnamed: 0,2013_skilled,2013_mm
2013_skilled,1.0,-0.753106
2013_mm,-0.753106,1.0


##### Predictive Modeling

My goal was to be able to create a working predictive model. The code did not work, but I left the code in. 

In [126]:
# Create x (features) and y (response)

array = df_new.values
x = array[0,17]
y = array[18,35]

LOGISTIC REGRESSION 

In [127]:
# import the class
from sklearn.linear_model import LogisticRegression

#instantiate the model (using default parameters)
logreg = LogisticRegression()

# Fit the model with the data 
logreg.fit(x,y)

# Predict the response values for the observations in x 
logreg.predict(x)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

In [None]:
# Store the predicted response values 
y_pred = logreg.predict(x)

# Check how many predictions were generated 
(len(y_pred))

In [None]:
# Compute classification accuracy for the logistic regresion model
from sklearn import metrics 
print metrics/accuracy_score (y, y_pred)

KNN (K=5)

In [None]:
from sklearn.neighbors import KNeighborsClassifier 
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(x,y)
y+pred = knn.predict(x)
print metrics.accuracy_score(y, y_prod)

KNN (K=1)

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(x,y)
y_pred - knn.predict(x)
print metrics.accuracy_score

#### VISUALIZATION 

In [None]:
# 2013 scatter plot of correlation between skilled people present and maternal mortality rate
sns.relplot(x="2013_skilled", y = "2013_mm",  
            data = df_new, kind="scatter" )

In [None]:
# 1995 scatter plot of correlation between skilled people present and maternal mortality rate
sns.relplot(x="1995_skilled", y = "1995_mm",  
            data = df_new, kind="scatter" )

#### Analysis and Conclusion 

The correlation between the percent of skilled professionals and present at births and maternal mortality rates is very high. I was not able to make a working predictive model, but I had a lot of questions about why so much data was missing. I had to remove so much data that it is difficult say that there was anything difinitive. 

#### Acknowledge 

I got most of the code from the Kaggle tutorial video http://blog.kaggle.com/2015/10/23/scikit-learn-video-9-better-evaluation-of-classification-models/
And from the old lab 