### Merging Train Data with Weather data to begin modeling process


I've set up everything I've already done around modeling with the train set below.  From here, I think we'll want to try various methods with each of the models (including bagging and grid search) to optimize our models and find the best parameters to use.  I would normally just go back to previous labs, etc. to find the code to program this in, but I trust you'll have this available to yourselves as well.     

You can let me know if you have any questions as you go about tuning the models.

Let's hope we get some great results!

In [2]:
import pandas as pd
import numpy as np
import pandas_profiling as pdp

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn import metrics, svm
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier



# display plots in the notebook
%matplotlib inline
# increase default figure and font sizes for easier viewing
plt.rcParams['figure.figsize'] = (16, 8)
plt.rcParams['font.size'] = 12

In [3]:
#reading in cleaned train data
train = pd.read_csv("../assets/clean_train.csv", index_col=0)
train["Date"] = pd.to_datetime(train["Date"], infer_datetime_format=True)

In [4]:
train.head()

Unnamed: 0,Date,Species,Trap,Latitude,Longitude,NumMosquitos,WnvPresent,Year,Parent_Trap,Is_Satellite,Mos_WNV_Prob,Trap_Ever_Wnv,Num_Years_Trap_WNV_Detection,Max_One_Year_Trap_WNV_Detections,Total_Train_Trap_WNV_Detections,Trap_WNV_Prob,NumMos_3ob_avg,Trap_Species_WNV_Prob
0,2007-05-29,CULEX PIPIENS/RESTUANS,T002,41.95469,-87.800991,1.0,0.0,2007.0,T002,0.0,0.058808,1.0,4,7.0,15.0,0.102041,10.0,0.142857
1,2007-06-05,CULEX PIPIENS/RESTUANS,T002,41.95469,-87.800991,3.0,0.0,2007.0,T002,0.0,0.058808,1.0,4,7.0,15.0,0.102041,19.0,0.142857
2,2007-06-26,CULEX PIPIENS/RESTUANS,T002,41.95469,-87.800991,1.0,0.0,2007.0,T002,0.0,0.058808,1.0,4,7.0,15.0,0.102041,35.666667,0.142857
3,2007-06-29,CULEX PIPIENS/RESTUANS,T002,41.95469,-87.800991,2.0,0.0,2007.0,T002,0.0,0.058808,1.0,4,7.0,15.0,0.102041,40.333333,0.142857
4,2007-07-02,CULEX PIPIENS/RESTUANS,T002,41.95469,-87.800991,3.0,0.0,2007.0,T002,0.0,0.058808,1.0,4,7.0,15.0,0.102041,75.0,0.142857


In [5]:
#scaling train data
scaled_train = StandardScaler().fit_transform(train[["Latitude", "Longitude", "NumMosquitos", "Mos_WNV_Prob", 
                                                     "Trap_Ever_Wnv", "Num_Years_Trap_WNV_Detection", 
                                                     "Max_One_Year_Trap_WNV_Detections", "Total_Train_Trap_WNV_Detections", 
                                                     "Trap_WNV_Prob", "NumMos_3ob_avg", "Trap_Species_WNV_Prob"]])

scaled_train = pd.DataFrame(scaled_train, columns = ["Latitude", "Longitude", "NumMosquitos", "Mos_WNV_Prob", 
                                                     "Trap_Ever_Wnv", "Num_Years_Trap_WNV_Detection", 
                                                     "Max_One_Year_Trap_WNV_Detections", "Total_Train_Trap_WNV_Detections", 
                                                     "Trap_WNV_Prob", "NumMos_3ob_avg", "Trap_Species_WNV_Prob"])

#adding date values and other categories/unscaled variables into scaled dataframe
scaled_train["Date"] = train["Date"]
scaled_train["Species"] = train["Species"]
scaled_train["Trap"] = train["Trap"]
scaled_train["WnvPresent"] = train["WnvPresent"]
scaled_train["Year"] = train["Year"]
scaled_train["Parent_Trap"] = train["Parent_Trap"]
scaled_train["Is_Satellite"] = train["Is_Satellite"]

scaled_train.head()


Unnamed: 0,Latitude,Longitude,NumMosquitos,Mos_WNV_Prob,Trap_Ever_Wnv,Num_Years_Trap_WNV_Detection,Max_One_Year_Trap_WNV_Detections,Total_Train_Trap_WNV_Detections,Trap_WNV_Prob,NumMos_3ob_avg,Trap_Species_WNV_Prob,Date,Species,Trap,WnvPresent,Year,Parent_Trap,Is_Satellite
0,1.032541,-1.263449,-0.198905,0.17102,0.359928,2.050194,1.162658,1.998768,1.154647,-1.126237,1.395763,2007-05-29,CULEX PIPIENS/RESTUANS,T002,0.0,2007.0,T002,0.0
1,1.032541,-1.263449,-0.172266,0.17102,0.359928,2.050194,1.162658,1.998768,1.154647,-1.110581,1.395763,2007-06-05,CULEX PIPIENS/RESTUANS,T002,0.0,2007.0,T002,0.0
2,1.032541,-1.263449,-0.198905,0.17102,0.359928,2.050194,1.162658,1.998768,1.154647,-1.081589,1.395763,2007-06-26,CULEX PIPIENS/RESTUANS,T002,0.0,2007.0,T002,0.0
3,1.032541,-1.263449,-0.185585,0.17102,0.359928,2.050194,1.162658,1.998768,1.154647,-1.073471,1.395763,2007-06-29,CULEX PIPIENS/RESTUANS,T002,0.0,2007.0,T002,0.0
4,1.032541,-1.263449,-0.172266,0.17102,0.359928,2.050194,1.162658,1.998768,1.154647,-1.013169,1.395763,2007-07-02,CULEX PIPIENS/RESTUANS,T002,0.0,2007.0,T002,0.0


In [6]:
#reading in cleaned weather data
weather = pd.read_csv("../assets/clean_weather.csv", index_col=0)
weather.reset_index(inplace=True, drop=True)
weather["Date"] = pd.to_datetime(weather["Date"], infer_datetime_format=True)

In [7]:
weather.head()

Unnamed: 0,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,Sunrise,Sunset,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,Precip_7d_avg,wind_abv_1std
0,2007-05-01,83,50,67.0,14.0,51,56.0,0.0,2.0,448,1849,0.0,29.1,29.82,1.7,27,9.2,0.0,0.0
1,2007-05-02,59,42,51.0,-3.0,42,47.0,14.0,0.0,447,1850,0.0,29.38,30.09,13.0,4,13.4,0.0,1.0
2,2007-05-03,66,46,56.0,2.0,40,48.0,9.0,0.0,446,1851,0.0,29.39,30.12,11.7,7,11.9,0.0,1.0
3,2007-05-04,66,49,58.0,4.0,41,50.0,7.0,0.0,444,1852,0.001,29.31,30.05,10.4,8,10.8,0.00025,0.0
4,2007-05-05,66,53,60.0,5.0,38,49.0,5.0,0.0,443,1853,0.001,29.4,30.1,11.7,7,12.0,0.0004,1.0


In [8]:
#dropping Depart, Wetbulb, Heat, Cool, Sunrise, Sunset, SeaLevel, ResultSpeed from weather, as many of these columns
#seem either duplicative or unimportant for purposes of this analysis
#after conducting EDA of weather, I'm also dropping Tmax, Tmin, and ResultDir

#If you want to mess around with changing the variables, you may need to adjust the code for scaling the weather data
#below

weather.drop(["Tmax", "Tmin", "ResultDir","Depart", "WetBulb", "Heat", "Cool", "Sunrise", "Sunset", "SeaLevel", 
              "ResultSpeed"], axis=1, inplace=True)

weather.head()

Unnamed: 0,Date,Tavg,DewPoint,PrecipTotal,StnPressure,AvgSpeed,Precip_7d_avg,wind_abv_1std
0,2007-05-01,67.0,51,0.0,29.1,9.2,0.0,0.0
1,2007-05-02,51.0,42,0.0,29.38,13.4,0.0,1.0
2,2007-05-03,56.0,40,0.0,29.39,11.9,0.0,1.0
3,2007-05-04,58.0,41,0.001,29.31,10.8,0.00025,0.0
4,2007-05-05,60.0,38,0.001,29.4,12.0,0.0004,1.0


In [9]:
#scaling weather data before merging with train data, to take advantage of ALL weather data when scaling, rather than
#just observations that would appear in the train set
scaled_weather = StandardScaler().fit_transform(weather[["Tavg", "DewPoint", "PrecipTotal", "StnPressure", 
                                                                     "AvgSpeed", "Precip_7d_avg"]])

scaled_weather = pd.DataFrame(scaled_weather, columns = ["Tavg", "DewPoint", "PrecipTotal", "StnPressure", 
                                                                     "AvgSpeed", "Precip_7d_avg"])

#adding date values into scaled dataframe
scaled_weather["Date"] = weather["Date"]
scaled_weather["wind_abv_1std"] = weather["wind_abv_1std"]

scaled_weather.head()

Unnamed: 0,Tavg,DewPoint,PrecipTotal,StnPressure,AvgSpeed,Precip_7d_avg,Date,wind_abv_1std
0,0.037433,-0.222912,-0.319916,-0.987494,0.197482,-0.755462,2007-05-01,0.0
1,-1.485236,-1.066664,-0.319916,0.806746,1.515681,-0.755462,2007-05-02,1.0
2,-1.009402,-1.254164,-0.319916,0.870826,1.044896,-0.755462,2007-05-03,1.0
3,-0.819068,-1.160414,-0.31754,0.358186,0.699653,-0.754059,2007-05-04,0.0
4,-0.628734,-1.441665,-0.31754,0.934906,1.076281,-0.753217,2007-05-05,1.0


In [10]:
#inner merging weather with train data
combine = pd.merge(scaled_weather, scaled_train, how= "inner", left_on='Date', right_on='Date')
combine.drop("Year", axis=1, inplace=True)

combine

Unnamed: 0,Tavg,DewPoint,PrecipTotal,StnPressure,AvgSpeed,Precip_7d_avg,Date,wind_abv_1std,Latitude,Longitude,...,Max_One_Year_Trap_WNV_Detections,Total_Train_Trap_WNV_Detections,Trap_WNV_Prob,NumMos_3ob_avg,Trap_Species_WNV_Prob,Species,Trap,WnvPresent,Parent_Trap,Is_Satellite
0,0.703601,0.433340,-0.319916,0.870826,-0.649931,0.103875,2007-05-29,0.0,1.032541,-1.263449,...,1.162658,1.998768,1.154647,-1.126237,1.395763,CULEX PIPIENS/RESTUANS,T002,0.0,T002,0.0
1,0.703601,0.433340,-0.319916,0.870826,-0.649931,0.103875,2007-05-29,0.0,1.214515,-1.546838,...,0.163922,-0.243101,1.705585,-1.126237,1.769439,CULEX PIPIENS/RESTUANS,T015,0.0,T015,0.0
2,0.703601,0.433340,-0.319916,0.870826,-0.649931,0.103875,2007-05-29,0.0,0.210971,0.482575,...,-0.501901,-0.243101,-0.603421,-1.126237,-0.305108,CULEX PIPIENS/RESTUANS,T048,0.0,T048,0.0
3,0.703601,0.433340,-0.319916,0.870826,-0.649931,0.103875,2007-05-29,0.0,0.700966,0.006295,...,-1.167725,-1.058326,-1.293965,-1.126237,-0.846294,CULEX PIPIENS/RESTUANS,T050,0.0,T050,0.0
4,0.703601,0.433340,-0.319916,0.870826,-0.649931,0.103875,2007-05-29,0.0,0.725562,0.745953,...,0.496834,-0.039294,-0.349225,-1.126237,-0.074438,CULEX PIPIENS/RESTUANS,T054,0.0,T054,0.0
5,0.703601,0.433340,-0.319916,0.870826,-0.649931,0.103875,2007-05-29,0.0,-1.466128,0.215080,...,1.162658,0.368318,2.612426,-1.126237,2.096405,CULEX PIPIENS/RESTUANS,T086,0.0,T086,0.0
6,0.703601,0.433340,-0.319916,0.870826,-0.649931,0.103875,2007-05-29,0.0,0.436274,0.990130,...,-1.167725,-1.058326,-1.293965,-1.126237,-0.846294,CULEX PIPIENS/RESTUANS,T129,0.0,T129,0.0
7,0.703601,0.433340,-0.319916,0.870826,-0.649931,0.103875,2007-05-29,0.0,1.449405,-1.199137,...,0.496834,-0.039294,2.991107,-1.126237,1.769439,CULEX PIPIENS/RESTUANS,T143,0.0,T143,0.0
8,0.703601,0.433340,-0.319916,0.870826,-0.649931,0.103875,2007-05-29,0.0,1.621079,0.083504,...,-1.167725,-1.058326,-1.293965,-1.126237,-0.846294,CULEX PIPIENS/RESTUANS,T148,0.0,T148,0.0
9,0.703601,0.433340,-0.319916,0.870826,-0.649931,0.103875,2007-05-29,0.0,1.032541,-1.263449,...,1.162658,1.998768,1.154647,-1.093186,-0.846294,CULEX RESTUANS,T002,0.0,T002,0.0


In [11]:
combine.columns

Index([u'Tavg', u'DewPoint', u'PrecipTotal', u'StnPressure', u'AvgSpeed',
       u'Precip_7d_avg', u'Date', u'wind_abv_1std', u'Latitude', u'Longitude',
       u'NumMosquitos', u'Mos_WNV_Prob', u'Trap_Ever_Wnv',
       u'Num_Years_Trap_WNV_Detection', u'Max_One_Year_Trap_WNV_Detections',
       u'Total_Train_Trap_WNV_Detections', u'Trap_WNV_Prob', u'NumMos_3ob_avg',
       u'Trap_Species_WNV_Prob', u'Species', u'Trap', u'WnvPresent',
       u'Parent_Trap', u'Is_Satellite'],
      dtype='object')

# Modeling to Predict Presence of WNV

In [12]:
#dropping "Parent_Trap", "Is_Satellite" because I don't believe these will be helpful at this point
#also dropping Date, Species, and Trap, as these are categorical variables that we're already accounting for in a 
#number of ways (including global Mos_WNV_Prob and Trap_WNV_Prob variables, among others)

X = combine.copy()

#You can add or remove variables as you like and rerun the cell to produce your desired X - you shouldn't 
#need to go back above this cell unless you want to change the variables or scaling or data that was done above.
X.drop(["Date", "Parent_Trap", "Is_Satellite", "Species", "Trap", "WnvPresent"], axis=1, inplace=True)

#preparing y values
y = combine["WnvPresent"]

In [13]:
X.head()

Unnamed: 0,Tavg,DewPoint,PrecipTotal,StnPressure,AvgSpeed,Precip_7d_avg,wind_abv_1std,Latitude,Longitude,NumMosquitos,Mos_WNV_Prob,Trap_Ever_Wnv,Num_Years_Trap_WNV_Detection,Max_One_Year_Trap_WNV_Detections,Total_Train_Trap_WNV_Detections,Trap_WNV_Prob,NumMos_3ob_avg,Trap_Species_WNV_Prob
0,0.703601,0.43334,-0.319916,0.870826,-0.649931,0.103875,0.0,1.032541,-1.263449,-0.198905,0.17102,0.359928,2.050194,1.162658,1.998768,1.154647,-1.126237,1.395763
1,0.703601,0.43334,-0.319916,0.870826,-0.649931,0.103875,0.0,1.214515,-1.546838,-0.198905,0.17102,0.359928,-0.715267,0.163922,-0.243101,1.705585,-1.126237,1.769439
2,0.703601,0.43334,-0.319916,0.870826,-0.649931,0.103875,0.0,0.210971,0.482575,-0.198905,0.17102,0.359928,0.206553,-0.501901,-0.243101,-0.603421,-1.126237,-0.305108
3,0.703601,0.43334,-0.319916,0.870826,-0.649931,0.103875,0.0,0.700966,0.006295,-0.198905,0.17102,-2.778333,-1.637088,-1.167725,-1.058326,-1.293965,-1.126237,-0.846294
4,0.703601,0.43334,-0.319916,0.870826,-0.649931,0.103875,0.0,0.725562,0.745953,-0.185585,0.17102,0.359928,-0.715267,0.496834,-0.039294,-0.349225,-1.126237,-0.074438


In [14]:
y.head()

0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: WnvPresent, dtype: float64

In [15]:
#creating function to test and fit classification model (created for SVM lab)
def do_cm_cr(model, X, y, names): 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)
    model.fit(X_train, y_train)
    y_probs = model.predict(X_test)    #predict y values for X_test
    print "Output for Tested Model:"
    print "Confusion Matrix of Predictions: "
    print
    print(confusion_matrix(y_test, y_probs)) # Actual values are rows (0, 1), while predicted are columns (0, 1); 
    print
    #printing classification report
    #precision is true positives / (true positives + false positives) - of all predicted, % correct
    #recall is true positives / (true positives + false negatives) - of all actual, % correct
    #f1-score is a weighted harmonic mean of the precision and recall, f1-score reaches best value at 1 and worst at 0.
    #support is number of true values for each class
    print "Classification Matrix: "
    print
    print(classification_report(y_test, y_probs, target_names=names))
    return model.score(X_test,y_test)

In [16]:
#we'll start with Logistic Regression
logreg = LogisticRegression()

do_cm_cr(logreg, X, y, ["no_Wnv", "yes_Wnv"])


Output for Tested Model:
Confusion Matrix of Predictions: 

[[2628   18]
 [ 136   15]]

Classification Matrix: 

             precision    recall  f1-score   support

     no_Wnv       0.95      0.99      0.97      2646
    yes_Wnv       0.45      0.10      0.16       151

avg / total       0.92      0.94      0.93      2797



0.94494100822309612

In [17]:
#k-nearest neighbors
knnc = KNeighborsClassifier()

do_cm_cr(knnc, X, y, ["no_Wnv", "yes_Wnv"])

#After scaling the X data, the knnc performs effectively as well as logistic regression


Output for Tested Model:
Confusion Matrix of Predictions: 

[[2629   17]
 [ 135   16]]

Classification Matrix: 

             precision    recall  f1-score   support

     no_Wnv       0.95      0.99      0.97      2646
    yes_Wnv       0.48      0.11      0.17       151

avg / total       0.93      0.95      0.93      2797



0.94565606006435465

In [18]:
#SVM with linear kernal
lin_svm = svm.SVC(kernel='linear')

do_cm_cr(lin_svm, X, y, ["no_Wnv", "yes_Wnv"])

#SVM with linear model predict 0 instances of WNV - wow.

Output for Tested Model:
Confusion Matrix of Predictions: 

[[2646    0]
 [ 151    0]]

Classification Matrix: 

             precision    recall  f1-score   support

     no_Wnv       0.95      1.00      0.97      2646
    yes_Wnv       0.00      0.00      0.00       151

avg / total       0.89      0.95      0.92      2797



  'precision', 'predicted', average, warn_for)


0.94601358598498386

In [19]:
#SVM with rbf kernal
rbf_svm = svm.SVC(kernel='rbf')

do_cm_cr(rbf_svm, X, y, ["no_Wnv", "yes_Wnv"])

#SVM with rbf model predicted a whopping 1 instance of WNV, but at least it predicted that one correctly.

Output for Tested Model:
Confusion Matrix of Predictions: 

[[2642    4]
 [ 149    2]]

Classification Matrix: 

             precision    recall  f1-score   support

     no_Wnv       0.95      1.00      0.97      2646
    yes_Wnv       0.33      0.01      0.03       151

avg / total       0.91      0.95      0.92      2797



0.94529853414372544

In [20]:
#normal decision tree
drc = DecisionTreeClassifier()

do_cm_cr(drc, X, y, ["no_Wnv", "yes_Wnv"])

Output for Tested Model:
Confusion Matrix of Predictions: 

[[2527  119]
 [ 104   47]]

Classification Matrix: 

             precision    recall  f1-score   support

     no_Wnv       0.96      0.96      0.96      2646
    yes_Wnv       0.28      0.31      0.30       151

avg / total       0.92      0.92      0.92      2797



0.92027171969967825

In [21]:
#random forest
rfc = RandomForestClassifier()

do_cm_cr(rfc, X, y, ["no_Wnv", "yes_Wnv"])

Output for Tested Model:
Confusion Matrix of Predictions: 

[[2629   17]
 [ 128   23]]

Classification Matrix: 

             precision    recall  f1-score   support

     no_Wnv       0.95      0.99      0.97      2646
    yes_Wnv       0.57      0.15      0.24       151

avg / total       0.93      0.95      0.93      2797



0.94815874150875934

In [22]:
#gradient boosting classifier
gbc = GradientBoostingClassifier()

do_cm_cr(gbc, X, y, ["no_Wnv", "yes_Wnv"])

Output for Tested Model:
Confusion Matrix of Predictions: 

[[2628   18]
 [ 121   30]]

Classification Matrix: 

             precision    recall  f1-score   support

     no_Wnv       0.96      0.99      0.97      2646
    yes_Wnv       0.62      0.20      0.30       151

avg / total       0.94      0.95      0.94      2797



0.95030389703253482

In [23]:
#Checking feature importances of variables following gbc
for i, c in enumerate(X.columns):
    for j, d in enumerate(gbc.feature_importances_):
        if i == j:
            print([c, d])
            
#I previously tried dropping Trap_Ever_Wnv, wind_abv_1std, Total_Train_Trap_WNV_Detections, and 
#Num_Years_Trap_WNV_Detection, as these had the worst feature importances among the tested variables, however, this
#did not drastically change the results of the models.

#Number of mosquitos is by far the most important feature in classifying the presences of WNV.  If we can develop an
#effective model for predicting WNV, we should be golden.

['Tavg', 0.054068586306598852]
['DewPoint', 0.034558000876238636]
['PrecipTotal', 0.030168226142438498]
['StnPressure', 0.020448777286988989]
['AvgSpeed', 0.029106315536584545]
['Precip_7d_avg', 0.048946398767138098]
['wind_abv_1std', 0.0057970443830018502]
['Latitude', 0.044301300002922393]
['Longitude', 0.025813409379131725]
['NumMosquitos', 0.3426699954978209]
['Mos_WNV_Prob', 0.024597136825983804]
['Trap_Ever_Wnv', 0.0]
['Num_Years_Trap_WNV_Detection', 0.0029263096854629991]
['Max_One_Year_Trap_WNV_Detections', 0.0091364806509597225]
['Total_Train_Trap_WNV_Detections', 0.010370844238486211]
['Trap_WNV_Prob', 0.032302624790540219]
['NumMos_3ob_avg', 0.093196311649226593]
['Trap_Species_WNV_Prob', 0.19159223798047581]
