# Scoring
Let's create a merged dataframe for food consumption data from 2000-2012 with the same columns as our merged dataframe from 1970-2000 for use in scoring the models that we created.

In [1]:
import pickle
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer
import statsmodels.formula.api as sm
import sklearn as sk
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor as rf
%matplotlib inline

## Get predictors from 1970-2000 food consumption dataframe

In [2]:
# Load merged dataframe from 1970-2000
out = open('data/final/food_1970_2000_cleaned.p', 'r')
food_1970_2000_cleaned = pickle.load(out)
out.close()

In [32]:
food_1970_2000.head()

Unnamed: 0_level_0,Plantains,"Sugar, Raw Equivalent","Beverages, Alcoholic",Olives (including preserved),Cloves,Coconuts - Incl Copra,"Vegetables, Other",Sesame seed,Wine,Apples and products,...,Mutton & Goat Meat,Pelagic Fish,Bovine Meat,"Molluscs, Other","Fish, Body Oil","Aquatic Animals, Others",Honey,"Offals, Edible",Demersal Fish,Cream
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,29.068781,14.402,0.002333,0.065333,29.068781,0.015,92.210667,2.55,0.0,3.435667,...,27.100667,29.068781,16.762667,29.068781,29.068781,29.068781,0.687667,9.166,29.068781,0.0
Albania,32.502693,50.132333,3.201,7.017667,32.502693,0.006333,206.557667,32.502693,6.353667,15.048,...,12.073,1.050333,18.681,0.416,0.0,0.0,0.488667,6.734667,0.604,0.005333
Algeria,0.0,69.079333,0.108667,1.432667,0.004667,0.012,88.779667,0.124,0.033333,3.993667,...,12.729333,6.693333,9.742333,0.001,0.0,0.0,0.247333,3.079333,0.950333,25.450712
Angola,26.325172,26.933333,1.855,26.325172,0.0,0.0,63.024333,0.512,8.947333,0.423333,...,1.418,25.743,21.081,0.000333,0.0,0.0,5.557,3.521333,3.859667,26.325172
Argentina,44.950187,113.051667,10.766,1.235333,0.000333,0.584,127.668333,0.006667,175.032667,39.242667,...,9.304333,2.740667,192.834333,0.675,0.0,0.0,0.505,19.933333,13.161333,0.157


In [33]:
# Get list of columns in 1970-2000 dataframe
predictors_1970_2000 = food_1970_2000.columns

In [34]:
print len(predictors_1970_2000)
print predictors_1970_2000

82
Index([u'Plantains', u'Sugar, Raw Equivalent', u'Beverages, Alcoholic',
       u'Olives (including preserved)', u'Cloves', u'Coconuts - Incl Copra',
       u'Vegetables, Other', u'Sesame seed', u'Wine', u'Apples and products',
       u'Rape and Mustard Oil', u'Maize and products',
       u'Groundnuts (Shelled Eq)', u'Barley and products', u'Maize Germ Oil',
       u'Beer', u'Groundnut Oil', u'Pineapples and products',
       u'Pulses, Other and products', u'Sugar (Raw Equivalent)', u'Palm Oil',
       u'Oilcrops, Other', u'Dates', u'Oats', u'Soyabeans', u'Beans',
       u'Sesameseed Oil', u'Grapes and products (excl wine)',
       u'Beverages, Fermented', u'Potatoes and products', u'Cottonseed Oil',
       u'Onions', u'Coffee and products', u'Roots, Other', u'Infant food',
       u'Cereals, Other', u'Pepper', u'Peas', u'Nuts and products',
       u'Cocoa Beans and products', u'Wheat and products',
       u'Cassava and products', u'Sunflowerseed Oil', u'Palmkernel Oil',
       u'Pime

## Create merged dataframe for food consumption data from 2000-2012

In [35]:
out = open('data/clean/crops.p', 'r')
crops = pickle.load(out)
out.close()
out = open('data/clean/meat.p', 'r')
meat = pickle.load(out)
out.close()

In [36]:
time_period = range(2000, 2012)

# Calculate the mean for each crop/meat over the period 2000-2012
food_2000_2012 = pd.DataFrame(index=food_1970_2000.index)

for crop in crops.iterkeys():
    food_2000_2012[crop] = crops[crop][time_period].mean(axis=1)
    
for m in meat.iterkeys():
    food_2000_2012[m] = meat[m][time_period].mean(axis=1)

food_2000_2012.head()

Unnamed: 0_level_0,Ricebran Oil,Oilcrops,Plantains,"Sugar, Raw Equivalent","Beverages, Alcoholic",Roots & Tuber Dry Equiv,Vegetable Oils,Olives (including preserved),Cloves,Millet and products,...,Offals,Bovine Meat,"Molluscs, Other","Fish, Body Oil","Aquatic Animals, Others",Animal fats,Honey,"Offals, Edible",Demersal Fish,Cream
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,,1.73,,17.044167,0.014167,5.510833,7.859167,0.110833,,1.921667,...,5.305833,14.946667,,,,5.070833,0.385,5.305833,,0.035833
Albania,,33.371667,,79.6525,2.841667,18.851667,17.659167,32.464167,,,...,11.9125,39.284167,0.798333,0.0,0.005833,8.506667,1.718333,11.964167,1.660833,0.335
Algeria,,4.374167,0.0,80.4225,0.12,27.8775,37.601667,3.374167,0.008333,,...,3.27,14.145,0.0075,0.0,0.0,1.9425,0.2525,3.27,1.1475,
Angola,,5.051667,,33.684167,3.4525,190.768333,23.63,,0.0,12.783333,...,2.9575,23.000833,0.018333,0.0,0.0,1.321667,3.910833,2.9575,15.949167,
Argentina,,1.890833,,130.554167,4.538333,28.936667,38.73,0.973333,0.0,,...,15.671667,152.111667,2.696667,0.0,0.0,9.235833,0.310833,15.673333,8.484167,0.224167


In [37]:
print "Percentage NaN cells before dropping:", food_2000_2012.isnull().sum().sum() / float(food_2000_2012.shape[0] * food_2000_2012.shape[1])

Percentage NaN cells before dropping: 0.146263572493


As expected there are some sparse columns even after averaging over 12 years. Let's see if this sparsity is alleviated by dropping columns not found in the merged 1970-2000 dataframe.

In [38]:
food_2000_2012_cleaned = food_2000_2012[predictors_1970_2000]
food_2000_2012_cleaned.head()

Unnamed: 0_level_0,Plantains,"Sugar, Raw Equivalent","Beverages, Alcoholic",Olives (including preserved),Cloves,Coconuts - Incl Copra,"Vegetables, Other",Sesame seed,Wine,Apples and products,...,Mutton & Goat Meat,Pelagic Fish,Bovine Meat,"Molluscs, Other","Fish, Body Oil","Aquatic Animals, Others",Honey,"Offals, Edible",Demersal Fish,Cream
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,,17.044167,0.014167,0.110833,,0.000833,90.563333,1.6175,0.026667,3.591667,...,15.015,,14.946667,,,,0.385,5.305833,,0.035833
Albania,,79.6525,2.841667,32.464167,,0.196667,344.9525,,13.975,44.381667,...,18.471667,6.845,39.284167,0.798333,0.0,0.005833,1.718333,11.964167,1.660833,0.335
Algeria,0.0,80.4225,0.12,3.374167,0.008333,0.4275,177.48,0.243333,0.02,21.436667,...,16.134167,10.195833,14.145,0.0075,0.0,0.0,0.2525,3.27,1.1475,
Angola,,33.684167,3.4525,,0.0,0.243333,84.195,0.341667,11.0425,1.181667,...,2.984167,16.210833,23.000833,0.018333,0.0,0.0,3.910833,2.9575,15.949167,
Argentina,,130.554167,4.538333,0.973333,0.0,0.863333,107.671667,0.0475,79.016667,40.675,...,3.7,3.441667,152.111667,2.696667,0.0,0.0,0.310833,15.673333,8.484167,0.224167


In [39]:
# Sanity check to see what percentage of cells are missing
print "Percentage NaN cells after dropping:", food_2000_2012_cleaned.isnull().sum().sum() / float(food_2000_2012_cleaned.shape[0] * food_2000_2012_cleaned.shape[1])

Percentage NaN cells after dropping: 0.0883750395946


### Imputation of Missing Values
Let's use the same method of mean imputation (global average per food item) to fill in the remaining NaNs (justification can be found in the notebook called Missing Data). 

In [42]:
# Impute by mean for each column (i.e. global average per crop)
imp = Imputer(axis=1)
food_2000_2012_cleaned = pd.DataFrame(imp.fit_transform(food_2000_2012_cleaned), index=food_2000_2012_cleaned.index, columns=food_2000_2012_cleaned.columns)

In [16]:
food_2000_2012_cleaned.head()

Unnamed: 0_level_0,Plantains,"Sugar, Raw Equivalent","Beverages, Alcoholic",Olives (including preserved),Cloves,Coconuts - Incl Copra,"Vegetables, Other",Sesame seed,Wine,Apples and products,...,Mutton & Goat Meat,Pelagic Fish,Bovine Meat,"Molluscs, Other","Fish, Body Oil","Aquatic Animals, Others",Honey,"Offals, Edible",Demersal Fish,Cream
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,14.654771,17.044167,0.014167,0.110833,14.654771,0.000833,90.563333,1.6175,0.026667,3.591667,...,15.015,14.654771,14.946667,14.654771,14.654771,14.654771,0.385,5.305833,14.654771,0.035833
Albania,24.197568,79.6525,2.841667,32.464167,24.197568,0.196667,344.9525,24.197568,13.975,44.381667,...,18.471667,6.845,39.284167,0.798333,0.0,0.005833,1.718333,11.964167,1.660833,0.335
Algeria,0.0,80.4225,0.12,3.374167,0.008333,0.4275,177.48,0.243333,0.02,21.436667,...,16.134167,10.195833,14.145,0.0075,0.0,0.0,0.2525,3.27,1.1475,19.935206
Angola,20.927607,33.684167,3.4525,20.927607,0.0,0.243333,84.195,0.341667,11.0425,1.181667,...,2.984167,16.210833,23.000833,0.018333,0.0,0.0,3.910833,2.9575,15.949167,20.927607
Argentina,22.523144,130.554167,4.538333,0.973333,0.0,0.863333,107.671667,0.0475,79.016667,40.675,...,3.7,3.441667,152.111667,2.696667,0.0,0.0,0.310833,15.673333,8.484167,0.224167


In [44]:
# Sanity check to see what percentage of cells are missing
print "Percentage NaN cells after dropping:", food_2000_2012_cleaned.isnull().sum().sum() / float(food_2000_2012_cleaned.shape[0] * food_2000_2012_cleaned.shape[1])

Percentage NaN cells after dropping: 0.0


In [47]:
# Save dataframe for use later 
pickle.dump(food_2000_2012_cleaned, open('data/final/food_2000_2012_cleaned.p', 'wb'))

# Scoring Models

Now that we have cleaned the 2000-2012 food categories, we can use this dataframe and the corresponding various response datasets for the age-standardized mortality rates and percentage risk of death for the non-communicable disease to score the models we fit earlier on the 1970-2000 food categories and the 2000 response datasets.

In [4]:
# loading the response datasets from pickled files
out = open('data/clean/deaths_100k.p', 'r')
deaths_100k = pickle.load(out)
out.close()
out = open('data/clean/risk.p', 'r')
risk_of_death = pickle.load(out)
out.close()
out = open('data/clean/countries_to_drop.p', 'r')
countries_to_drop = pickle.load(out)
out.close()

In [3]:
# Load merged dataframe from 2000-2012
out = open('data/final/food_2000_2012_cleaned.p', 'r')
food_2000_2012_cleaned = pickle.load(out)
out.close()

First we need to drop the same countries we dropped from the corresponding response datasets from 2000.

In [5]:
# collecting the individual datasets
deaths_100k_all_2012 = deaths_100k['all'][2012]
deaths_100k_cancer_2012 = deaths_100k['cancer'][2012]
deaths_100k_cardio_2012 = deaths_100k['cardio'][2012]
deaths_100k_diabetes_2012 = deaths_100k['diabetes'][2012]
deaths_100k_resp_2012 = deaths_100k['resp'][2012]

risk_of_death_2012 = risk_of_death[2012]

In [6]:
# dropping the countries we previously dropped from the response datasets for 2000
deaths_100k_all_2012 = deaths_100k_all_2012.drop(countries_to_drop)
deaths_100k_cancer_2012 = deaths_100k_cancer_2012.drop(countries_to_drop)
deaths_100k_cardio_2012 = deaths_100k_cardio_2012.drop(countries_to_drop)
deaths_100k_diabetes_2012 = deaths_100k_diabetes_2012.drop(countries_to_drop)
deaths_100k_resp_2012 = deaths_100k_resp_2012.drop(countries_to_drop)

risk_of_death_2012 = risk_of_death_2012.drop(countries_to_drop)

Now we can use these response datasets to check the predictive accuracy of the various models we fit on the 1970-2000 food data.

## Scoring the Random Forest Models

First we load the random forest models we calculated earlier.

In [67]:
# loading the random forest models
out = open('data/models/deaths_all_forest_best.p', 'r')
deaths_all_forest_best = pickle.load(out)
out.close()
out = open('data/models/deaths_cancer_forest_best.p', 'r')
deaths_cancer_forest_best = pickle.load(out)
out.close()
out = open('data/models/deaths_cardio_forest_best.p', 'r')
deaths_cardio_forest_best = pickle.load(out)
out.close()
out = open('data/models/deaths_diabetes_forest_best.p', 'r')
deaths_diabetes_forest_best = pickle.load(out)
out.close()
out = open('data/models/deaths_resp_forest_best.p', 'r')
deaths_resp_forest_best = pickle.load(out)
out.close()

Now we can use these models to compute predictions using the food data from 2000-2012 and check the accuracy of these random forest models on the actual 2012 response datasets.

In [68]:
deaths_all_forest_score = deaths_all_forest_best.score(food_2000_2012_cleaned, deaths_100k_all_2012)
deaths_cancer_forest_score = deaths_cancer_forest_best.score(food_2000_2012_cleaned, deaths_100k_cancer_2012)
deaths_cardio_forest_score = deaths_cardio_forest_best.score(food_2000_2012_cleaned, deaths_100k_cardio_2012)
deaths_diabetes_forest_score = deaths_diabetes_forest_best.score(food_2000_2012_cleaned, deaths_100k_diabetes_2012)
deaths_resp_forest_score = deaths_resp_forest_best.score(food_2000_2012_cleaned, deaths_100k_resp_2012)
forest_scores = [deaths_all_forest_score, deaths_cancer_forest_score, deaths_cardio_forest_score,
                deaths_diabetes_forest_score, deaths_resp_forest_score]

In [69]:
diseases = ['All Diseases', 'Cancer', 'Cardiovascular Disease', 'Diabetes', 'Respiratory Disease']
for i in range(4):
    print 'Random Forest Score for {}: {}'.format(diseases[i], forest_scores[i])

Random Forest Score for All Diseases: 0.495116535531
Random Forest Score for Cancer: 0.340476365255
Random Forest Score for Cardiovascular Disease: 0.522441892445
Random Forest Score for Diabetes: 0.322944245907


While our Random Forest models had interpretable results for the significant features, they performed rather poorly in terms of prediction accuracy. Perhaps some of the simpler models might do better.

# Scoring the Multiple Linear and Lasso Regression Models

## Part I: Original Responses

While using the Statsmodels version of Linear Regression provides very helpful summaries, one of its major downsides is that it cannot calculate R^2 directly on new feature and response values. Thus, we need to write a function that can do so.

In [51]:
# defining a function to calculate R^2 on the 2000-2012 data
def r_squared(model, x, y):
    # getting predictions
    y_pred = model.predict(x)
    # calculating R^2
    rss = 0
    tss = 0
    for i in range(len(y)):
        rss = rss + (y[i] - y_pred[i])**2
        tss = tss + (y[i])**2 # note that we use uncentered total sum of squares since statsmodels does not include intercept
    r_2 = 1 - (rss/tss)
    return r_2

Now we can load the initial Multiple Linear and Lasso Regression models on the original response datasets.

In [53]:
# loading the linreg models
out = open('data/models/risk_results.p', 'r')
risk_results = pickle.load(out)
out.close()
out = open('data/models/deaths_all_results.p', 'r')
deaths_all_results = pickle.load(out)
out.close()
out = open('data/models/deaths_cancer_results.p', 'r')
deaths_cancer_results = pickle.load(out)
out.close()
out = open('data/models/deaths_cardio_results.p', 'r')
deaths_cardio_results = pickle.load(out)
out.close()
out = open('data/models/deaths_diabetes_results.p', 'r')
deaths_diabetes_results = pickle.load(out)
out.close()
out = open('data/models/deaths_resp_results.p', 'r')
deaths_resp_results = pickle.load(out)
out.close()

# and now lasso
out = open('data/models/risk_results_lasso.p', 'r')
risk_results_lasso = pickle.load(out)
out.close()
out = open('data/models/deaths_all_results_lasso.p', 'r')
deaths_all_results_lasso = pickle.load(out)
out.close()
out = open('data/models/deaths_cancer_results_lasso.p', 'r')
deaths_cancer_results_lasso = pickle.load(out)
out.close()
out = open('data/models/deaths_cardio_results_lasso.p', 'r')
deaths_cardio_results_lasso = pickle.load(out)
out.close()
out = open('data/models/deaths_diabetes_results_lasso.p', 'r')
deaths_diabetes_results_lasso = pickle.load(out)
out.close()
out = open('data/models/deaths_resp_results_lasso.p', 'r')
deaths_resp_results_lasso = pickle.load(out)
out.close()

And now we can use the function we defined earlier to check the accuracy of these models. First we will calculate for all the Multiple Linear Regression Models.

In [62]:
# computing scores for each linear regression model
risk_2012_score = r_squared(risk_results, food_2000_2012_cleaned, risk_of_death_2012)
deaths_all_2012_score = r_squared(deaths_all_results, food_2000_2012_cleaned, deaths_100k_all_2012)
deaths_cancer_2012_score = r_squared(deaths_cancer_results, food_2000_2012_cleaned, deaths_100k_cancer_2012)
deaths_cardio_2012_score = r_squared(deaths_cardio_results, food_2000_2012_cleaned, deaths_100k_cardio_2012)
deaths_diabetes_2012_score = r_squared(deaths_diabetes_results, food_2000_2012_cleaned, deaths_100k_diabetes_2012)
deaths_resp_2012_score = r_squared(deaths_resp_results, food_2000_2012_cleaned, deaths_100k_resp_2012)
linreg_scores = [deaths_all_2012_score, deaths_cancer_2012_score, deaths_cardio_2012_score, 
                     deaths_diabetes_2012_score, deaths_resp_2012_score]

In [70]:
print 'Score of the Multiple Linear Regression Model on Percent Risk of Death: {}'.format(risk_2012_score)
for i in range(4):
    print 'Score of the Multiple Linear Regression Model on Mortality Rate of {}: {}'.format(diseases[i], linreg_scores[i])

Score of the Multiple Linear Regression Model on Percent Risk of Death: 0.915253484891
Score of the Multiple Linear Regression Model on Mortality Rate of All Diseases: 0.919602530282
Score of the Multiple Linear Regression Model on Mortality Rate of Cancer: 0.889741617712
Score of the Multiple Linear Regression Model on Mortality Rate of Cardiovascular Disease: 0.872543320109
Score of the Multiple Linear Regression Model on Mortality Rate of Diabetes: 0.701972348111


In terms of prediction, the basic Multiple Linear Regression Models we fit performed quite well compared, significantly outperforming the Random Forests.

Now we will calculate R^2 for all Lasso Models.

In [65]:
# computing scores for each lasso model
risk_2012_lasso_score = r_squared(risk_results_lasso, food_2000_2012_cleaned, risk_of_death_2012)
deaths_all_2012_lasso_score = r_squared(deaths_all_results_lasso, food_2000_2012_cleaned, deaths_100k_all_2012)
deaths_cancer_2012_lasso_score = r_squared(deaths_cancer_results_lasso, food_2000_2012_cleaned, deaths_100k_cancer_2012)
deaths_cardio_2012_lasso_score = r_squared(deaths_cardio_results_lasso, food_2000_2012_cleaned, deaths_100k_cardio_2012)
deaths_diabetes_2012_lasso_score = r_squared(deaths_diabetes_results_lasso, food_2000_2012_cleaned, deaths_100k_diabetes_2012)
deaths_resp_2012_lasso_score = r_squared(deaths_resp_results_lasso, food_2000_2012_cleaned, deaths_100k_resp_2012)
lasso_scores = [deaths_all_2012_lasso_score, deaths_cancer_2012_lasso_score, deaths_cardio_2012_lasso_score, 
                     deaths_diabetes_2012_lasso_score, deaths_resp_2012_lasso_score]

In [66]:
print 'Score of the Lasso Model on Percent Risk of Death: {}'.format(risk_2012_lasso_score)
for i in range(4):
    print 'Score of the Lasso Model on Mortality Rate of {}: {}'.format(diseases[i], lasso_scores[i])

Score of the Lasso Model on Percent Risk of Death: 0.930914320073
Score of the Lasso Model on Mortality Rate of All Diseases: 0.929043911136
Score of the Lasso Model on Mortality Rate of Cancer: 0.926452510359
Score of the Lasso Model on Mortality Rate of Cardiovascular Disease: 0.88974124637
Score of the Lasso Model on Mortality Rate of Diabetes: 0.782755182613


Based on the R^2 calculations, it appears that Lasso has the best performance for prediction accuracy, significantly outperforming our Random Forest models as well as marginally beating the Multiple Regression models. However, in our analysis we found some potential issues with the assumptions of linear regression that we attempted to address using log transformations, so we should evaluate those models as well.

## Part II: Log Transformed Responses

Now we will check the accuracy of the Multiple Linear Regression and Lasso models on the log transformed response datasets. As we mentioned before, performing a log transformation on the percentage risk of death from non-communicable diseases made little sense and we focused mainly on the age-standardized mortality rate data in this analysis.

In [47]:
# loading the log models
out = open('data/models/deaths_all_results_log.p', 'r')
deaths_all_results_log = pickle.load(out)
out.close()
out = open('data/models/deaths_cancer_results_log.p', 'r')
deaths_cancer_results_log = pickle.load(out)
out.close()
out = open('data/models/deaths_cardio_results_log.p', 'r')
deaths_cardio_results_log = pickle.load(out)
out.close()
out = open('data/models/deaths_diabetes_results_log.p', 'r')
deaths_diabetes_results_log = pickle.load(out)
out.close()
out = open('data/models/deaths_resp_results_log.p', 'r')
deaths_resp_results_log = pickle.load(out)
out.close()

# and now lasso
out = open('data/models/deaths_all_results_lasso_log.p', 'r')
deaths_all_results_lasso_log = pickle.load(out)
out.close()
out = open('data/models/deaths_cancer_results_lasso_log.p', 'r')
deaths_cancer_results_lasso_log = pickle.load(out)
out.close()
out = open('data/models/deaths_cardio_results_lasso_log.p', 'r')
deaths_cardio_results_lasso_log = pickle.load(out)
out.close()
out = open('data/models/deaths_diabetes_results_lasso_log.p', 'r')
deaths_diabetes_results_lasso_log = pickle.load(out)
out.close()
out = open('data/models/deaths_resp_results_lasso_log.p', 'r')
deaths_resp_results_lasso_log = pickle.load(out)
out.close()

Since we are dealing with log transformed responses with these models, to calculate R^2 we need to back-transform the predictions we calculate from the models via exponentiation and then compute their accuracy on the 2012 response data in their original units, allowing us to compare these R^2 values to our other calculations.

In [41]:
# defining a function to calculate R^2 on the log transformed 2000-2012 data
def r_squared_log(model, x, y):
    # getting predictions
    y_pred_log = model.predict(x)
    # transforming predictions from log model back to original units
    y_pred_norm = np.exp(y_pred_log)
    # calculating R^2
    rss = 0
    tss = 0
    for i in range(len(y)):
        rss = rss + (y[i] - y_pred_norm[i])**2 
        tss = tss + (y[i])**2 # note that we use uncentered total sum of squares since statsmodels does not include intercept
    r_2 = 1 - (rss/tss)
    return r_2

In [42]:
# scoring the log models on the transformed 2012 data
deaths_all_log_2012_score = r_squared_log(deaths_all_results_log, food_2000_2012_cleaned, deaths_100k_all_2012)
deaths_cancer_log_2012_score = r_squared_log(deaths_cancer_results_log, food_2000_2012_cleaned, deaths_100k_cancer_2012)
deaths_cardio_log_2012_score = r_squared_log(deaths_cardio_results_log, food_2000_2012_cleaned, deaths_100k_cardio_2012)
deaths_diabetes_log_2012_score = r_squared_log(deaths_diabetes_results_log, food_2000_2012_cleaned, deaths_100k_diabetes_2012)
deaths_resp_log_2012_score = r_squared_log(deaths_resp_results_log, food_2000_2012_cleaned, deaths_100k_resp_2012)
linreg_log_scores = [deaths_all_log_2012_score, deaths_cancer_log_2012_score, deaths_cardio_log_2012_score, 
                     deaths_diabetes_log_2012_score, deaths_resp_log_2012_score]

In [55]:
# showing the scores
for i in range(4):
    print 'Score of transformed Linear Regression Model for Mortality Rate of {}: {}'.format(diseases[i], linreg_log_scores[i])

Score of transformed Linear Regression Model for Mortality Rate of All Diseases: -5.06540622297
Score of transformed Linear Regression Model for Mortality Rate of Cancer: -0.427193256646
Score of transformed Linear Regression Model for Mortality Rate of Cardiovascular Disease: -2.2903460927
Score of transformed Linear Regression Model for Mortality Rate of Diabetes: 0.419601282373


In [49]:
# and now the same for lasso
deaths_all_log_2012_lasso_score = r_squared_log(deaths_all_results_lasso_log, food_2000_2012_cleaned, deaths_100k_all_2012)
deaths_cancer_log_2012_lasso_score = r_squared_log(deaths_cancer_results_lasso_log, food_2000_2012_cleaned, deaths_100k_cancer_2012)
deaths_cardio_log_2012_lasso_score = r_squared_log(deaths_cardio_results_lasso_log, food_2000_2012_cleaned, deaths_100k_cardio_2012)
deaths_diabetes_log_2012_lasso_score = r_squared_log(deaths_diabetes_results_lasso_log, food_2000_2012_cleaned, deaths_100k_diabetes_2012)
deaths_resp_log_2012_lasso_score = r_squared_log(deaths_resp_results_lasso_log, food_2000_2012_cleaned, deaths_100k_resp_2012)
lasso_log_scores = [deaths_all_log_2012_lasso_score, deaths_cancer_log_2012_lasso_score, deaths_cardio_log_2012_lasso_score, 
                     deaths_diabetes_log_2012_lasso_score, deaths_resp_log_2012_lasso_score]

In [50]:
# showing the scores
for i in range(4):
    print 'Score of transformed Lasso Model for {}: {}'.format(diseases[i], lasso_log_scores[i])

Score of transformed Lasso Model for All Diseases: -2.43198498402
Score of transformed Lasso Model for Cancer: 0.213108769092
Score of transformed Lasso Model for Cardiovascular Disease: -1.15191140322
Score of transformed Lasso Model for Diabetes: 0.636022677391


Although we addressed the concerns of normality in the residuals somewhat with the log transformed models, in terms of prediction accuracy these models performed very poorly.

# Conclusion

Based on these scoring results, despite the concerns about assumptions the Lasso models with the original response variables seem to be the best choice due to their exemplary prediction accuracy. Moreover, with the coefficients and p values we can identify food categories that seem to have significant positive and negative effects on both percentage risk of death from all non-communicable diseases and the different age-standardized mortality rates, which we could not do with Random Forests. 