In [1]:
%matplotlib inline

from copy import copy
import os
import sys

import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
import numpy as np

pd.set_option("display.max_columns", 101)
pd.set_option("display.float_format", lambda x: "%.2f" % x )

# shhhhhhhhh
import warnings
warnings.filterwarnings('ignore')

## Load features

We extract features for each county in notebooks 1.5 and 1.6. We extract target variables in notebook 1.0. In this seciton, we'll load that previous work so we can build our models.

In [2]:
geo_features = pd.read_csv("../data/processed/county-geo-features.csv", index_col=0)
demo_features = pd.read_csv("../data/processed/county-demo-features.csv", index_col=0)

all_features = geo_features.join(demo_features)

targets = pd.read_csv("../data/processed/targets-by-county.csv", index_col=0)

### Clean data
We can't fit models with `nan`s in the data, so we'll drop those counties for now.

In [3]:
# 3 countys don't have data for some columns
countys_to_model = all_features.apply(lambda x: pd.notnull(x).all(), axis=1)

geo_features = geo_features[countys_to_model]
demo_features = demo_features[countys_to_model]
all_features = all_features[countys_to_model]

targets = targets[countys_to_model]

## Model classes

To start, we'll fit a simple linear model and a simple non-linear model. These will give us a sense for how our different features can predict our different targets across model classes.

In [4]:
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor

from sklearn.grid_search import GridSearchCV 

nonlinear = GridSearchCV(RandomForestRegressor(),
                           dict(n_estimators=[5, 10, 20, 40],
                                max_depth=[2, 3, 5, 10]),
                           verbose=1)

linear = GridSearchCV(Lasso(),
                       dict(alpha=[1.0, 10.0, 100.0],
                            fit_intercept=[True, False],
                            normalize=[True, False]),
                       verbose=1)



## Fit the models

Now we'll fit both classes of models for each combination of features + target variable. After fitting the models, we can see which combinations are the most successful.

In [5]:
model_fits_dfs = []
fitted_models = {}

for model_class_str, model_class in zip(['linear', 'nonlinear'], [linear, nonlinear]):
    model_results = dict()
    fitted_models[model_class_str] = dict()
    
    for target in targets.columns:
        model_results[target] = dict()
        fitted_models[model_class_str][target] = dict()
        
        for f_name, features in zip(['geo', 'demo', 'both'],
                                    [geo_features, demo_features, all_features]):
            
            model_class.fit(features, targets[target])
            
            model_results[target][f_name] = model_class.best_score_
            fitted_models[model_class_str][target][f_name] = copy(model_class)
            
    model_fits_dfs.append(pd.DataFrame(model_results))
    
            
            

Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:    0.2s finished


Fitting 3 folds for each of 12 candidates, totalling 36 fits
Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:    0.1s finished


Fitting 3 folds for each of 12 candidates, totalling 36 fits
Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:    0.2s finished


Fitting 3 folds for each of 12 candidates, totalling 36 fits
Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:    0.2s finished


Fitting 3 folds for each of 12 candidates, totalling 36 fits
Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:    0.4s finished


Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:    0.3s finished


Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:    0.2s finished


Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:    0.4s finished


Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=1)]: Done  48 out of  48 | elapsed:    1.2s finished


Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=1)]: Done  48 out of  48 | elapsed:    1.1s finished


Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=1)]: Done  48 out of  48 | elapsed:    1.4s finished


Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=1)]: Done  48 out of  48 | elapsed:    1.2s finished


Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=1)]: Done  48 out of  48 | elapsed:    1.1s finished


Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=1)]: Done  48 out of  48 | elapsed:    1.4s finished


Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=1)]: Done  48 out of  48 | elapsed:    1.2s finished


Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=1)]: Done  48 out of  48 | elapsed:    1.1s finished


Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=1)]: Done  48 out of  48 | elapsed:    1.4s finished


Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=1)]: Done  48 out of  48 | elapsed:    1.2s finished


Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=1)]: Done  48 out of  48 | elapsed:    1.1s finished


Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=1)]: Done  48 out of  48 | elapsed:    1.3s finished


### Results of Linear Models

Scores are reported as $R^2$, which is from (-inf, 1) where -inf is infinitely bad a predicting, and 1 is a model that explains all of the variance in the data perfectly.

In [6]:
model_fits_dfs[0]

Unnamed: 0,expenditure,loan_repay_rate,monthly_income,wealth
both,0.47,-0.01,-17.37,0.35
demo,0.72,0.18,-5.13,0.67
geo,-0.06,0.02,-17.11,-0.27


### Results of NonLinear Models

Scores are reported as $R^2$, which is from (-inf, 1) where -inf is infinitely bad a predicting, and 1 is a model that explains all of the variance in the data perfectly.

In [8]:
model_fits_dfs[1]

Unnamed: 0,expenditure,loan_repay_rate,monthly_income,wealth
both,0.42,0.34,-3.38,0.48
demo,0.59,0.35,-3.74,0.8
geo,0.09,0.28,-3.9,0.27


# Summary

To start, a caveat: our models are rough estimates for how we can effectively make predictions about county-level wealth. There are two issues which entail that these modeling exercises are a guide for our future work (not conclusions themselves). First, we have more variables than data examples (>100 features, only 47 counties). Second, many of our features are highly co-linear. Given these caveats, the coefficients/feature importances in these models are more distracting than enlightening. Instead, we just want to look at explanatory power of the different target/feature combinations.

We can see that a non-linear model is strictly better at making county-level predictions than a linear model (which is not surprising). We can also see that both classes of model are best at predicting the wealth variable (followed by expenditure and then loan repayment). As we recall from our analysis of target variables, wealth is most highly correlated with expenditure and then loan repayment.

The geographic variables have little explanatory power at this level of aggregation. This is not entirely surprising given that at the county-level of aggregation, we're capturing indiviudals both within the agricultural value chains and outside of them. We may expect things like rainfall and soil properties to be more predictive if we just consider farming households when looking at wealth. It may be productive to re-assess if we can select just for these households in the DHS data and re-evaluate our models.