# Conducting a Regression Analysis to Predict 'Survival to 65 years' 
In the following notebook, I have used data from the 'indicator.csv' file, specifically world development indicator data on India to predict the 'Survival to 65 years (% of cohort)' feature (taken as average of 'Survival to age 65, female (% of cohort)' and 'Survival to age 65, male (% of cohort)' features. To extract the relevant data, I performed some sieving by hand and some on python. 

I ended up extracted the following features of interest: Birth rate, crude (per 1,000 people);	CO2 emissions;	Death rate, crude (per 1,000 people);	Fertility rate, total (births per woman);	Final consumption expenditure;	GDP per capita;	Gross domestic income;	Gross national expenditure;	Household final consumption expenditure;	Mortality rate, adult, female (per 1,000 female adults);	Mortality rate, adult, male (per 1,000 male adults);	Net bilateral aid flows from DAC donors;	Population, total;  Year

I found that 'Survival to 65 years' could be effectively predicted given the above features using a DecisionTreeRegression classifier. As it turned out, the data proved very tenable to my analysis - I ended up with a very high r2_score, something I found quite strange! I speculate that a similar analysis on data from other countries and regions would likely produce similar results. 

In [None]:
# Much of the code on this page is inspired by code I came across through the udacity nanodegree on machine learning...

import numpy as np
import pandas as pd
from IPython.display import display
%matplotlib inline

try:
    data = pd.read_csv("newKaggleFile.csv")
    print "dataset loaded..."
except:
    print "The dataset could not be loaded..."

data = data[np.isfinite(data['Birth rate, crude (per 1,000 people)'])]
survival = data['Survival to age 65 (% of cohort)']

features = data.drop(['Survival to age 65, male (% of cohort)', 'Survival to age 65, female (% of cohort)', 'Survival to age 65 (% of cohort)'], axis = 1)

print "Number of datapoints: {}\nNumber of variables per datapoint: {}".format(*data.shape)

display(data[:10])

In [None]:
# Statistics of Dataset

# Survival to 65 through the years
lowestSurvival = np.min(survival)
highestSurvival = np.max(survival)
meanSurvival = np.mean(survival)
medianSurvival = np.median(survival)
stdDevSurvival = np.std(survival)


print "Survival to 65"
print "Minimum:", lowestSurvival
print "Maximum:", highestSurvival
print "Mean:", meanSurvival
print "Median", medianSurvival
print "Standard deviation", stdDevSurvival

### Analyzing economic and non-economic features
In the absence of information as to which features can be used to predict 'Survival to 65 years', I resorted to analyzing economic and non economic features separately, and finally the entire feature space ...

In [None]:
economicfeatures = features.drop(['Birth rate, crude (per 1,000 people)', 'CO2 emissions', 'Death rate, crude (per 1,000 people)', 'Fertility rate, total (births per woman)', 'Mortality rate, adult, female (per 1,000 female adults)', 'Population, total', 'Mortality rate, adult, male (per 1,000 male adults)'], axis = 1)
noneconomicfeatures = features.drop(['Final consumption expenditure', 'GDP per capita', 'Gross domestic income', 'Gross national expenditure', 'Household final consumption expenditure', 'Net bilateral aid flows from DAC donors'], axis = 1) 
display(economicfeatures[:5])
display(noneconomicfeatures[:5])

### Preparing the data
The following code splits the data into training and testing points for the regression model analysis. 

In [None]:
from sklearn.cross_validation import train_test_split 
# Divide data into training and testing sets
X_train_noneconomic, X_test_noneconomic, y_train_noneconomic, y_test_noneconomic = train_test_split(noneconomicfeatures, survival, test_size = 0.2, random_state = 2)
X_train_economic,X_test_economic, y_train_economic, y_test_economic = train_test_split(economicfeatures, survival, test_size = 0.2, random_state = 2)
X_train, X_test, y_train, y_test = train_test_split(features, survival, test_size = 0.2, random_state = 2)


### Principle Component Analysis
There are a large number of features in the features dataset. I suspected that there are underlying latent features driving the economic and the non-economic features respectively. I sought to find them using PCA. 

In [None]:
from sklearn.decomposition import PCA
# Find PCA transformed data (non-economic features) 
pca_noneconomic = PCA(n_components = 3)
pca_noneconomic.fit(X_train_noneconomic)

X_train_noneconomic_pca = pca_noneconomic.transform(X_train_noneconomic)
X_test_noneconomic_pca = pca_noneconomic.transform(X_test_noneconomic)

# Find PCA transformed data (economic features)
pca_economic = PCA(n_components = 3)
pca_economic.fit(X_train_economic)

X_train_economic_pca = pca_economic.transform(X_train_economic)
X_test_economic_pca = pca_economic.transform(X_test_economic)

# Find PCA transformed data (economic & non-economic features)
pca = PCA(n_components = 5)
pca.fit(X_train)

X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

### Building a metric for guaging the model efficacy...

In [None]:
from sklearn.metrics import r2_score
def performance_metric(y_true, y_predict):
    return r2_score(y_true, y_predict)

### Implementing the Decision Tree Regression model on the data
I found that the entirety of the data (economic + non-economic features) yielded the best r2_score. In the remainder of the analysis, I used the entirety of the data to build my final model.

In [None]:
from sklearn.tree import DecisionTreeRegressor
# Testing DecisionTreeRegressor on data 

# Economic feature set
reg_economic = DecisionTreeRegressor()
reg_economic.fit(X_train_economic_pca, y_train_economic)
y_pred_economic = reg_economic.predict(X_test_economic_pca)
print 'Decision Tree Regressor r2_score on Economic data is:', performance_metric(y_test_economic, y_pred_economic)

# Non-economic feature set
reg_noneconomic = DecisionTreeRegressor()
reg_noneconomic.fit(X_train_noneconomic_pca, y_train_noneconomic)
y_pred_noneconomic = reg_economic.predict(X_test_noneconomic_pca)
print 'Decision Tree Regressor r2_score on Non-economic data is:', performance_metric(y_test_noneconomic, y_pred_noneconomic)

# (Economic + non-economic) feature set
reg_ = DecisionTreeRegressor()
reg_.fit(X_train_pca, y_train)
y_pred_ = reg_.predict(X_test_pca)
print 'Decision Tree Regressor r2_score on (Economic + non-economic) data is:', performance_metric(y_test, y_pred_)

### Optimizing the Decision Tree Regression classifier parameters on the data
Oddly enough, I found that optimizing the Decision Tree classifier parameters did not yield appreciably better results. Am I not optimizing the correct parameters?

In [None]:
from sklearn.metrics import make_scorer
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import ShuffleSplit
def fit_model(X, y):
    # Setting parameters for GridSearchCV 
    cv_sets = ShuffleSplit(X.shape[0], n_iter = 10, test_size = 0.2, random_state = 2)
    regressor = DecisionTreeRegressor()
    params = {'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'splitter': ['best', 'random']}
    scoring_func = make_scorer(performance_metric)
    
    # Create GridSearch object
    grid = GridSearchCV(regressor, params, cv = cv_sets, scoring = scoring_func)
    
    # Fit GridSearch object to the data
    grid = grid.fit(X, y)
    
    # Return best estimator using inbuilt method
    return grid.best_estimator_

In [None]:
optimalReg = fit_model(X_train_pca, y_train)
y_pred_optimal = optimalReg.predict(X_test_pca)
print 'The r2_score of the optimal classifier is:', performance_metric(y_test, y_pred_optimal)

### Conclusion
So there you have it. The decision tree classifier model performed surprisingly well on the data in predicting 'Survival to 65 years'. As mentioned in the introduction, I believe similar analyses on data from other countries too shall yield similarly strong results.  