## Lecture 27 Notebook: SVM and Parameter Tuning
Duncan Callaway
November 27 2018

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

In [None]:
pd.options.display.max_columns = 100

Let's import the environmental and demographic datasets from CES:

In [None]:
env = pd.read_csv('ces3results_environment.csv')
demog = pd.read_csv('ces3results_demographics.csv')

In [None]:
env.head()

In [None]:
demog.head()

In [None]:
print('Enviro cols are ', env.columns)
print('Demographics cols are ',demog.columns)

Now merge them...

In [None]:
all = 

Let's look at the size of the new dataframe

There is a lot of data in this frame, much of which is correlated.  What kind of prediction exercises could we do?  Why would that be relevant as a resource allocation problem?









For this demonstration we're going to look at impaired water bodies.  
- CES documents how many pollutants are found in nearby water bodies.
- Map (here)[https://oehha.ca.gov/calenviroscreen/indicator/impaired-water-bodies]
- Hypothetical situation: suppose other indicators in the data set are updated more quickly than the impaired water body measures.  In this case we'd like to predict impaired water body statistics from the other CES data.  
- Let's go a step further and see if we can do that prediction with only demographic and health measures for the communities.

In [None]:
np.mean(all.loc[:,'Imp. Water Bodies']>=3)

In [None]:
all = all.dropna()

In [None]:
X = all.loc[:,'Asthma':]
X = X.drop(['Census Tract ', ' CES 3.0 Score', 'CES 3.0 Percentile', ' CES 3.0 \nPercentile Range', 'California \nCounty'], axis = 1);
X.columns

In [None]:
y_waste = all[['Solid Waste']]
y_water = all[['Imp. Water Bodies']]!=0 

## Predicting whether water bodies are contaminated

In this section we'll run an SVM -- checking different parameter options by cross validation -- to predict whether or not water bodies near each community are contaminated on the basis of their socio-economic metrics. 

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = 

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

In [None]:
SV_model = SVC()
param_dist = {...}

rnd_search = RandomizedSearchCV(SV_model, param_distributions=param_dist, 
                                cv=3, n_iter=4, n_jobs=4)

rnd_search.fit(X_train, y_train['Imp. Water Bodies'])

print(rnd_search.best_score_)
print(rnd_search.best_params_)

In [None]:
tuned_train_score = rnd_search.score(X_train, y_train)
tuned_test_score = rnd_search.score(X_test, y_test)

print('Train Score: ', tuned_train_score)
print('Test Score: ', tuned_test_score)

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
y_pred = rnd_search.predict(X_test)
confusion_matrix(y_test, y_pred)

## Let's try different classifiers

### KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier as KNC

In [None]:
X_train, X_test, y_train, y_test = 

In [None]:
KNC_model = ...
param_dist = {...}

KNC_search = RandomizedSearchCV(KNC_model, param_distributions=param_dist, 
                                cv=3, n_iter=100, n_jobs=4)

KNC_search.fit(X_train, y_train['Imp. Water Bodies'])

print(KNC_search.best_score_)
print(KNC_search.best_params_)

In [None]:
KNC_train_score = KNC_search.score(X_train, y_train)
KNC_test_score = KNC_search.score(X_test, y_test)

print('Train Score: ', KNC_train_score)
print('Test Score: ', KNC_test_score)

y_pred = KNC_search.predict(X_test)
confusion_matrix(y_test, y_pred)

### Random forest

In [None]:
from sklearn.ensemble import RandomForestClassifier as RFC

In [None]:
RFC_model = ...

RFC_model.fit(X_train, y_train['Imp. Water Bodies'])

In [None]:
RFC_train_score = RFC_model.score(X_train, y_train)
RFC_test_score = RFC_model.score(X_test, y_test)

print('Train Score: ', RFC_train_score)
print('Test Score: ', RFC_test_score)

y_pred = RFC_model.predict(X_test)
confusion_matrix(y_test, y_pred)