### Merging Train Data with Weather data to begin modeling process


I've set up everything I've already done around modeling with the train set below.  From here, I think we'll want to try various methods with each of the models (including bagging and grid search) to optimize our models and find the best parameters to use.  I would normally just go back to previous labs, etc. to find the code to program this in, but I trust you'll have this available to yourselves as well.     

You can let me know if you have any questions as you go about tuning the models.

Let's hope we get some great results!

In [1]:
import pandas as pd
import numpy as np
import pandas_profiling as pdp

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn import metrics, svm
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier



# display plots in the notebook
%matplotlib inline
# increase default figure and font sizes for easier viewing
plt.rcParams['figure.figsize'] = (16, 8)
plt.rcParams['font.size'] = 12

In [7]:
#reading in cleaned train data
train = pd.read_csv("../assets/clean_train.csv", index_col=0)
train["Date"] = pd.to_datetime(train["Date"], infer_datetime_format=True)

In [8]:
train.head()

Unnamed: 0,Date,Species,Trap,Latitude,Longitude,NumMosquitos,WnvPresent,Year,Parent_Trap,Is_Satellite,Mos_WNV_Prob,Trap_Ever_Wnv,Num_Years_Trap_WNV_Detection,Max_One_Year_Trap_WNV_Detections,Total_Train_Trap_WNV_Detections,Trap_WNV_Prob,NumMos_3ob_avg,Trap_Species_WNV_Prob
0,2007-05-29,CULEX PIPIENS/RESTUANS,T002,41.95469,-87.800991,1.0,0.0,2007.0,T002,0.0,0.058808,1.0,4,7.0,15.0,0.102041,10.0,0.142857
1,2007-06-05,CULEX PIPIENS/RESTUANS,T002,41.95469,-87.800991,3.0,0.0,2007.0,T002,0.0,0.058808,1.0,4,7.0,15.0,0.102041,19.0,0.142857
2,2007-06-26,CULEX PIPIENS/RESTUANS,T002,41.95469,-87.800991,1.0,0.0,2007.0,T002,0.0,0.058808,1.0,4,7.0,15.0,0.102041,35.666667,0.142857
3,2007-06-29,CULEX PIPIENS/RESTUANS,T002,41.95469,-87.800991,2.0,0.0,2007.0,T002,0.0,0.058808,1.0,4,7.0,15.0,0.102041,40.333333,0.142857
4,2007-07-02,CULEX PIPIENS/RESTUANS,T002,41.95469,-87.800991,3.0,0.0,2007.0,T002,0.0,0.058808,1.0,4,7.0,15.0,0.102041,75.0,0.142857


In [9]:
#scaling train data
scaled_train = StandardScaler().fit_transform(train[["Latitude", "Longitude", "NumMosquitos", "Mos_WNV_Prob", 
                                                     "Trap_Ever_Wnv", "Num_Years_Trap_WNV_Detection", 
                                                     "Max_One_Year_Trap_WNV_Detections", "Total_Train_Trap_WNV_Detections", 
                                                     "Trap_WNV_Prob", "NumMos_3ob_avg", "Trap_Species_WNV_Prob"]])

scaled_train = pd.DataFrame(scaled_train, columns = ["Latitude", "Longitude", "NumMosquitos", "Mos_WNV_Prob", 
                                                     "Trap_Ever_Wnv", "Num_Years_Trap_WNV_Detection", 
                                                     "Max_One_Year_Trap_WNV_Detections", "Total_Train_Trap_WNV_Detections", 
                                                     "Trap_WNV_Prob", "NumMos_3ob_avg", "Trap_Species_WNV_Prob"])

#adding date values and other categories/unscaled variables into scaled dataframe
scaled_train["Date"] = train["Date"]
scaled_train["Species"] = train["Species"]
scaled_train["Trap"] = train["Trap"]
scaled_train["WnvPresent"] = train["WnvPresent"]
scaled_train["Year"] = train["Year"]
scaled_train["Parent_Trap"] = train["Parent_Trap"]
scaled_train["Is_Satellite"] = train["Is_Satellite"]

scaled_train.head()


Unnamed: 0,Latitude,Longitude,NumMosquitos,Mos_WNV_Prob,Trap_Ever_Wnv,Num_Years_Trap_WNV_Detection,Max_One_Year_Trap_WNV_Detections,Total_Train_Trap_WNV_Detections,Trap_WNV_Prob,NumMos_3ob_avg,Trap_Species_WNV_Prob,Date,Species,Trap,WnvPresent,Year,Parent_Trap,Is_Satellite
0,1.032541,-1.263449,-0.198905,0.17102,0.359928,2.050194,1.162658,1.998768,1.154647,-1.126237,1.395763,2007-05-29,CULEX PIPIENS/RESTUANS,T002,0.0,2007.0,T002,0.0
1,1.032541,-1.263449,-0.172266,0.17102,0.359928,2.050194,1.162658,1.998768,1.154647,-1.110581,1.395763,2007-06-05,CULEX PIPIENS/RESTUANS,T002,0.0,2007.0,T002,0.0
2,1.032541,-1.263449,-0.198905,0.17102,0.359928,2.050194,1.162658,1.998768,1.154647,-1.081589,1.395763,2007-06-26,CULEX PIPIENS/RESTUANS,T002,0.0,2007.0,T002,0.0
3,1.032541,-1.263449,-0.185585,0.17102,0.359928,2.050194,1.162658,1.998768,1.154647,-1.073471,1.395763,2007-06-29,CULEX PIPIENS/RESTUANS,T002,0.0,2007.0,T002,0.0
4,1.032541,-1.263449,-0.172266,0.17102,0.359928,2.050194,1.162658,1.998768,1.154647,-1.013169,1.395763,2007-07-02,CULEX PIPIENS/RESTUANS,T002,0.0,2007.0,T002,0.0


In [10]:
#reading in cleaned weather data
weather = pd.read_csv("../assets/clean_weather.csv", index_col=0)
weather.reset_index(inplace=True, drop=True)
weather["Date"] = pd.to_datetime(weather["Date"], infer_datetime_format=True)

In [11]:
weather.head()

Unnamed: 0,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,Sunrise,Sunset,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,Precip_7d_avg,wind_abv_1std
0,2007-05-01,83,50,67.0,14.0,51,56.0,0.0,2.0,448,1849,0.0,29.1,29.82,1.7,27,9.2,0.0,0.0
1,2007-05-02,59,42,51.0,-3.0,42,47.0,14.0,0.0,447,1850,0.0,29.38,30.09,13.0,4,13.4,0.0,1.0
2,2007-05-03,66,46,56.0,2.0,40,48.0,9.0,0.0,446,1851,0.0,29.39,30.12,11.7,7,11.9,0.0,1.0
3,2007-05-04,66,49,58.0,4.0,41,50.0,7.0,0.0,444,1852,0.001,29.31,30.05,10.4,8,10.8,0.00025,0.0
4,2007-05-05,66,53,60.0,5.0,38,49.0,5.0,0.0,443,1853,0.001,29.4,30.1,11.7,7,12.0,0.0004,1.0


In [12]:
#dropping Depart, Wetbulb, Heat, Cool, Sunrise, Sunset, SeaLevel, ResultSpeed from weather, as many of these columns
#seem either duplicative or unimportant for purposes of this analysis
#after conducting EDA of weather, I'm also dropping Tmax, Tmin, and ResultDir

#If you want to mess around with changing the variables, you may need to adjust the code for scaling the weather data
#below

weather.drop(["Tmax", "Tmin", "ResultDir","Depart", "WetBulb", "Heat", "Cool", "Sunrise", "Sunset", "SeaLevel", 
              "ResultSpeed"], axis=1, inplace=True)

weather.head()

Unnamed: 0,Date,Tavg,DewPoint,PrecipTotal,StnPressure,AvgSpeed,Precip_7d_avg,wind_abv_1std
0,2007-05-01,67.0,51,0.0,29.1,9.2,0.0,0.0
1,2007-05-02,51.0,42,0.0,29.38,13.4,0.0,1.0
2,2007-05-03,56.0,40,0.0,29.39,11.9,0.0,1.0
3,2007-05-04,58.0,41,0.001,29.31,10.8,0.00025,0.0
4,2007-05-05,60.0,38,0.001,29.4,12.0,0.0004,1.0


In [13]:
#scaling weather data before merging with train data, to take advantage of ALL weather data when scaling, rather than
#just observations that would appear in the train set
scaled_weather = StandardScaler().fit_transform(weather[["Tavg", "DewPoint", "PrecipTotal", "StnPressure", 
                                                                     "AvgSpeed", "Precip_7d_avg"]])

scaled_weather = pd.DataFrame(scaled_weather, columns = ["Tavg", "DewPoint", "PrecipTotal", "StnPressure", 
                                                                     "AvgSpeed", "Precip_7d_avg"])

#adding date values into scaled dataframe
scaled_weather["Date"] = weather["Date"]
scaled_weather["wind_abv_1std"] = weather["wind_abv_1std"]

scaled_weather.head()

Unnamed: 0,Tavg,DewPoint,PrecipTotal,StnPressure,AvgSpeed,Precip_7d_avg,Date,wind_abv_1std
0,0.037433,-0.222912,-0.319916,-0.987494,0.197482,-0.755462,2007-05-01,0.0
1,-1.485236,-1.066664,-0.319916,0.806746,1.515681,-0.755462,2007-05-02,1.0
2,-1.009402,-1.254164,-0.319916,0.870826,1.044896,-0.755462,2007-05-03,1.0
3,-0.819068,-1.160414,-0.31754,0.358186,0.699653,-0.754059,2007-05-04,0.0
4,-0.628734,-1.441665,-0.31754,0.934906,1.076281,-0.753217,2007-05-05,1.0


In [14]:
#inner merging weather with train data
combine = pd.merge(scaled_weather, scaled_train, how= "inner", left_on='Date', right_on='Date')
combine.drop("Year", axis=1, inplace=True)

combine

Unnamed: 0,Tavg,DewPoint,PrecipTotal,StnPressure,AvgSpeed,Precip_7d_avg,Date,wind_abv_1std,Latitude,Longitude,...,Max_One_Year_Trap_WNV_Detections,Total_Train_Trap_WNV_Detections,Trap_WNV_Prob,NumMos_3ob_avg,Trap_Species_WNV_Prob,Species,Trap,WnvPresent,Parent_Trap,Is_Satellite
0,0.703601,0.433340,-0.319916,0.870826,-0.649931,0.103875,2007-05-29,0.0,1.032541,-1.263449,...,1.162658,1.998768,1.154647,-1.126237,1.395763,CULEX PIPIENS/RESTUANS,T002,0.0,T002,0.0
1,0.703601,0.433340,-0.319916,0.870826,-0.649931,0.103875,2007-05-29,0.0,1.214515,-1.546838,...,0.163922,-0.243101,1.705585,-1.126237,1.769439,CULEX PIPIENS/RESTUANS,T015,0.0,T015,0.0
2,0.703601,0.433340,-0.319916,0.870826,-0.649931,0.103875,2007-05-29,0.0,0.210971,0.482575,...,-0.501901,-0.243101,-0.603421,-1.126237,-0.305108,CULEX PIPIENS/RESTUANS,T048,0.0,T048,0.0
3,0.703601,0.433340,-0.319916,0.870826,-0.649931,0.103875,2007-05-29,0.0,0.700966,0.006295,...,-1.167725,-1.058326,-1.293965,-1.126237,-0.846294,CULEX PIPIENS/RESTUANS,T050,0.0,T050,0.0
4,0.703601,0.433340,-0.319916,0.870826,-0.649931,0.103875,2007-05-29,0.0,0.725562,0.745953,...,0.496834,-0.039294,-0.349225,-1.126237,-0.074438,CULEX PIPIENS/RESTUANS,T054,0.0,T054,0.0
5,0.703601,0.433340,-0.319916,0.870826,-0.649931,0.103875,2007-05-29,0.0,-1.466128,0.215080,...,1.162658,0.368318,2.612426,-1.126237,2.096405,CULEX PIPIENS/RESTUANS,T086,0.0,T086,0.0
6,0.703601,0.433340,-0.319916,0.870826,-0.649931,0.103875,2007-05-29,0.0,0.436274,0.990130,...,-1.167725,-1.058326,-1.293965,-1.126237,-0.846294,CULEX PIPIENS/RESTUANS,T129,0.0,T129,0.0
7,0.703601,0.433340,-0.319916,0.870826,-0.649931,0.103875,2007-05-29,0.0,1.449405,-1.199137,...,0.496834,-0.039294,2.991107,-1.126237,1.769439,CULEX PIPIENS/RESTUANS,T143,0.0,T143,0.0
8,0.703601,0.433340,-0.319916,0.870826,-0.649931,0.103875,2007-05-29,0.0,1.621079,0.083504,...,-1.167725,-1.058326,-1.293965,-1.126237,-0.846294,CULEX PIPIENS/RESTUANS,T148,0.0,T148,0.0
9,0.703601,0.433340,-0.319916,0.870826,-0.649931,0.103875,2007-05-29,0.0,1.032541,-1.263449,...,1.162658,1.998768,1.154647,-1.093186,-0.846294,CULEX RESTUANS,T002,0.0,T002,0.0


In [15]:
combine.columns

Index([u'Tavg', u'DewPoint', u'PrecipTotal', u'StnPressure', u'AvgSpeed',
       u'Precip_7d_avg', u'Date', u'wind_abv_1std', u'Latitude', u'Longitude',
       u'NumMosquitos', u'Mos_WNV_Prob', u'Trap_Ever_Wnv',
       u'Num_Years_Trap_WNV_Detection', u'Max_One_Year_Trap_WNV_Detections',
       u'Total_Train_Trap_WNV_Detections', u'Trap_WNV_Prob', u'NumMos_3ob_avg',
       u'Trap_Species_WNV_Prob', u'Species', u'Trap', u'WnvPresent',
       u'Parent_Trap', u'Is_Satellite'],
      dtype='object')

# Modeling to Predict Presence of WNV

In [16]:
#dropping "Parent_Trap", "Is_Satellite" because I don't believe these will be helpful at this point
#also dropping Date, Species, and Trap, as these are categorical variables that we're already accounting for in a 
#number of ways (including global Mos_WNV_Prob and Trap_WNV_Prob variables, among others)

X = combine.copy()

#You can add or remove variables as you like and rerun the cell to produce your desired X - you shouldn't 
#need to go back above this cell unless you want to change the variables or scaling or data that was done above.
X.drop(["Date", "Parent_Trap", "Is_Satellite", "Species", "Trap", "WnvPresent"], axis=1, inplace=True)

#preparing y values
y = combine["WnvPresent"]

In [17]:
X.head()

Unnamed: 0,Tavg,DewPoint,PrecipTotal,StnPressure,AvgSpeed,Precip_7d_avg,wind_abv_1std,Latitude,Longitude,NumMosquitos,Mos_WNV_Prob,Trap_Ever_Wnv,Num_Years_Trap_WNV_Detection,Max_One_Year_Trap_WNV_Detections,Total_Train_Trap_WNV_Detections,Trap_WNV_Prob,NumMos_3ob_avg,Trap_Species_WNV_Prob
0,0.703601,0.43334,-0.319916,0.870826,-0.649931,0.103875,0.0,1.032541,-1.263449,-0.198905,0.17102,0.359928,2.050194,1.162658,1.998768,1.154647,-1.126237,1.395763
1,0.703601,0.43334,-0.319916,0.870826,-0.649931,0.103875,0.0,1.214515,-1.546838,-0.198905,0.17102,0.359928,-0.715267,0.163922,-0.243101,1.705585,-1.126237,1.769439
2,0.703601,0.43334,-0.319916,0.870826,-0.649931,0.103875,0.0,0.210971,0.482575,-0.198905,0.17102,0.359928,0.206553,-0.501901,-0.243101,-0.603421,-1.126237,-0.305108
3,0.703601,0.43334,-0.319916,0.870826,-0.649931,0.103875,0.0,0.700966,0.006295,-0.198905,0.17102,-2.778333,-1.637088,-1.167725,-1.058326,-1.293965,-1.126237,-0.846294
4,0.703601,0.43334,-0.319916,0.870826,-0.649931,0.103875,0.0,0.725562,0.745953,-0.185585,0.17102,0.359928,-0.715267,0.496834,-0.039294,-0.349225,-1.126237,-0.074438


In [18]:
X.corr()

Unnamed: 0,Tavg,DewPoint,PrecipTotal,StnPressure,AvgSpeed,Precip_7d_avg,wind_abv_1std,Latitude,Longitude,NumMosquitos,Mos_WNV_Prob,Trap_Ever_Wnv,Num_Years_Trap_WNV_Detection,Max_One_Year_Trap_WNV_Detections,Total_Train_Trap_WNV_Detections,Trap_WNV_Prob,NumMos_3ob_avg,Trap_Species_WNV_Prob
Tavg,1.0,0.876242,0.180548,-0.383693,0.091863,0.225744,0.124256,-0.018759,0.048384,0.067075,0.047317,-0.063175,-0.080766,-0.055277,-0.065027,-0.021037,0.318111,0.001931
DewPoint,0.876242,1.0,0.350445,-0.457547,0.120259,0.324439,0.049358,0.001223,0.031337,0.052781,0.060987,-0.05684,-0.069494,-0.056513,-0.061721,-0.017744,0.264425,0.010483
PrecipTotal,0.180548,0.350445,1.0,-0.415715,0.31407,0.402501,-0.011569,0.031448,-0.010485,-0.021465,-0.02577,-0.010639,-0.014292,-0.024479,-0.023609,-0.024019,-0.045547,-0.031834
StnPressure,-0.383693,-0.457547,-0.415715,1.0,-0.424773,-0.342939,-0.051756,-0.007825,-0.003726,-0.007707,0.050455,-0.000967,0.00215,0.014225,0.009382,0.014488,-0.04867,0.041679
AvgSpeed,0.091863,0.120259,0.31407,-0.424773,1.0,0.054371,0.554625,-0.013066,0.030484,0.001905,-0.060709,0.005615,-0.002054,0.011423,0.00585,0.003031,-0.091766,-0.024117
Precip_7d_avg,0.225744,0.324439,0.402501,-0.342939,0.054371,1.0,-0.078754,0.002294,-0.001225,0.012598,0.004222,0.010164,-0.005639,-0.012124,-0.011677,-0.003251,0.091575,-0.003401
wind_abv_1std,0.124256,0.049358,-0.011569,-0.051756,0.554625,-0.078754,1.0,0.008527,0.030369,0.009803,0.038263,-0.036466,-0.049007,-0.009173,-0.021068,-0.006716,0.054369,0.012886
Latitude,-0.018759,0.001223,0.031448,-0.007825,-0.013066,0.002294,0.008527,1.0,-0.636842,-0.058984,-0.013314,-0.173041,-0.047374,0.048137,0.132531,0.167725,-0.017432,0.10947
Longitude,0.048384,0.031337,-0.010485,-0.003726,0.030484,-0.001225,0.030369,-0.636842,1.0,-0.001723,0.00508,-0.058438,-0.296591,-0.302407,-0.41647,-0.416883,0.043603,-0.271536
NumMosquitos,0.067075,0.052781,-0.021465,-0.007707,0.001905,0.012598,0.009803,-0.058984,-0.001723,1.0,0.071555,0.058374,0.086617,0.236292,0.21633,0.158996,0.167137,0.199527


In [19]:
y.head()

0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: WnvPresent, dtype: float64

In [20]:
#creating function to test and fit classification model (created for SVM lab)
def do_cm_cr(model, X, y, names): 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)
    model.fit(X_train, y_train)
    y_probs = model.predict(X_test)    #predict y values for X_test
    print "Output for Tested Model:"
    print "Confusion Matrix of Predictions: "
    print
    print(confusion_matrix(y_test, y_probs)) # Actual values are rows (0, 1), while predicted are columns (0, 1); 
    print
    #printing classification report
    #precision is true positives / (true positives + false positives) - of all predicted, % correct
    #recall is true positives / (true positives + false negatives) - of all actual, % correct
    #f1-score is a weighted harmonic mean of the precision and recall, f1-score reaches best value at 1 and worst at 0.
    #support is number of true values for each class
    print "Classification Matrix: "
    print
    print(classification_report(y_test, y_probs, target_names=names))
    return model.score(X_test,y_test)

In [21]:
#we'll start with Logistic Regression
logreg = LogisticRegression()

do_cm_cr(logreg, X, y, ["no_Wnv", "yes_Wnv"])


Output for Tested Model:
Confusion Matrix of Predictions: 

[[2628   18]
 [ 136   15]]

Classification Matrix: 

             precision    recall  f1-score   support

     no_Wnv       0.95      0.99      0.97      2646
    yes_Wnv       0.45      0.10      0.16       151

avg / total       0.92      0.94      0.93      2797



0.94494100822309612

In [22]:
#k-nearest neighbors
knnc = KNeighborsClassifier()

do_cm_cr(knnc, X, y, ["no_Wnv", "yes_Wnv"])

#After scaling the X data, the knnc performs effectively as well as logistic regression


Output for Tested Model:
Confusion Matrix of Predictions: 

[[2629   17]
 [ 135   16]]

Classification Matrix: 

             precision    recall  f1-score   support

     no_Wnv       0.95      0.99      0.97      2646
    yes_Wnv       0.48      0.11      0.17       151

avg / total       0.93      0.95      0.93      2797



0.94565606006435465

In [17]:
#SVM with linear kernal
lin_svm = svm.SVC(kernel='linear')

do_cm_cr(lin_svm, X, y, ["no_Wnv", "yes_Wnv"])

#SVM with linear model predict 0 instances of WNV - wow.

Output for Tested Model:
Confusion Matrix of Predictions: 

[[2646    0]
 [ 151    0]]

Classification Matrix: 

             precision    recall  f1-score   support

     no_Wnv       0.95      1.00      0.97      2646
    yes_Wnv       0.00      0.00      0.00       151

avg / total       0.89      0.95      0.92      2797



  'precision', 'predicted', average, warn_for)


0.94601358598498386

In [18]:
#SVM with rbf kernal
rbf_svm = svm.SVC(kernel='rbf')

do_cm_cr(rbf_svm, X, y, ["no_Wnv", "yes_Wnv"])

#SVM with rbf model predicted a whopping 1 instance of WNV, but at least it predicted that one correctly.

Output for Tested Model:
Confusion Matrix of Predictions: 

[[2642    4]
 [ 149    2]]

Classification Matrix: 

             precision    recall  f1-score   support

     no_Wnv       0.95      1.00      0.97      2646
    yes_Wnv       0.33      0.01      0.03       151

avg / total       0.91      0.95      0.92      2797



0.94529853414372544

In [19]:
#normal decision tree
drc = DecisionTreeClassifier()

do_cm_cr(drc, X, y, ["no_Wnv", "yes_Wnv"])

Output for Tested Model:
Confusion Matrix of Predictions: 

[[2530  116]
 [ 105   46]]

Classification Matrix: 

             precision    recall  f1-score   support

     no_Wnv       0.96      0.96      0.96      2646
    yes_Wnv       0.28      0.30      0.29       151

avg / total       0.92      0.92      0.92      2797



0.92098677154093667

In [20]:
#random forest
rfc = RandomForestClassifier()

do_cm_cr(rfc, X, y, ["no_Wnv", "yes_Wnv"])

Output for Tested Model:
Confusion Matrix of Predictions: 

[[2632   14]
 [ 132   19]]

Classification Matrix: 

             precision    recall  f1-score   support

     no_Wnv       0.95      0.99      0.97      2646
    yes_Wnv       0.58      0.13      0.21       151

avg / total       0.93      0.95      0.93      2797



0.94780121558813013

In [21]:
#gradient boosting classifier
gbc = GradientBoostingClassifier()

do_cm_cr(gbc, X, y, ["no_Wnv", "yes_Wnv"])

Output for Tested Model:
Confusion Matrix of Predictions: 

[[2627   19]
 [ 121   30]]

Classification Matrix: 

             precision    recall  f1-score   support

     no_Wnv       0.96      0.99      0.97      2646
    yes_Wnv       0.61      0.20      0.30       151

avg / total       0.94      0.95      0.94      2797



0.94994637111190561

In [22]:
#Checking feature importances of variables following gbc
for i, c in enumerate(X.columns):
    for j, d in enumerate(gbc.feature_importances_):
        if i == j:
            print([c, d])
            
#I previously tried dropping Trap_Ever_Wnv, wind_abv_1std, Total_Train_Trap_WNV_Detections, and 
#Num_Years_Trap_WNV_Detection, as these had the worst feature importances among the tested variables, however, this
#did not drastically change the results of the models.

#Number of mosquitos is by far the most important feature in classifying the presences of WNV.  If we can develop an
#effective model for predicting WNV, we should be golden.

['Tavg', 0.053006167244174639]
['DewPoint', 0.031450611979099152]
['PrecipTotal', 0.029645393908876754]
['StnPressure', 0.022909706786445789]
['AvgSpeed', 0.030933443377738606]
['Precip_7d_avg', 0.04705209946408808]
['wind_abv_1std', 0.0034367008223394335]
['Latitude', 0.042982894170365979]
['Longitude', 0.025842454251265922]
['NumMosquitos', 0.34266999549776522]
['Mos_WNV_Prob', 0.024597136825989643]
['Trap_Ever_Wnv', 0.0]
['Num_Years_Trap_WNV_Detection', 0.0044232200903721467]
['Max_One_Year_Trap_WNV_Detections', 0.0091364806509659485]
['Total_Train_Trap_WNV_Detections', 0.010370844238486689]
['Trap_WNV_Prob', 0.032084885623779001]
['NumMos_3ob_avg', 0.095173741825279309]
['Trap_Species_WNV_Prob', 0.19428422324296782]


In [23]:
from time import time
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
        ('logreg', LogisticRegression()),('svm', svm.SVC())])


parameters = {
    'penalty': ('l1', 'l2'),
    'C': (0.0001, 0.001, 0.01, 0.1, 0.5, 0.75, 1.0, 2.5, 5.0, 10.0, 100.0, 1000.0),
#     'kernel': ('linear', 'rbf', 'sigmoid')
}

if __name__ == "__main__":
    grid_search = GridSearchCV(LogisticRegression(), parameters, n_jobs=-1)

    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    t0 = time()
    grid_search.fit(X, y)
    print("done in %0.3fs" % (time() - t0))
    print()
    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

Performing grid search...
('pipeline:', ['logreg', 'svm'])
parameters:
done in 4.004s
()
Best score: 0.947
Best parameters set:
	C: 0.01
	penalty: 'l1'


In [24]:
from time import time
from sklearn.pipeline import Pipeline

# pipeline = Pipeline([
#         ('logreg', LogisticRegression()),('svm', svm.SVC())])


parameters = {
    'gamma': (0.00001, 0.0001, 0.001, 0.01, 0.1, 0.5),
    'C': (0.0001, 0.001, 0.01, 0.1, 0.5, 0.75, 1.0),
}

if __name__ == "__main__":
    grid_search = GridSearchCV(svm.SVC(kernel='rbf'), parameters, n_jobs=-1)

    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    t0 = time()
    grid_search.fit(X, y)
    print("done in %0.3fs" % (time() - t0))
    print()
    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

Performing grid search...
('pipeline:', ['logreg', 'svm'])
parameters:


KeyboardInterrupt: 

In [None]:
do_cm_cr(grid_search, X, y, ["no_Wnv", "yes_Wnv"])

In [25]:
logreg = LogisticRegression(penalty='l1', C= 0.01)

In [26]:
do_cm_cr(logreg, X, y, ["no_Wnv", "yes_Wnv"])

Output for Tested Model:
Confusion Matrix of Predictions: 

[[2644    2]
 [ 148    3]]

Classification Matrix: 

             precision    recall  f1-score   support

     no_Wnv       0.95      1.00      0.97      2646
    yes_Wnv       0.60      0.02      0.04       151

avg / total       0.93      0.95      0.92      2797



0.94637111190561318

In [29]:
svm2 = svm.SVC(C=.01, kernel='linear')
do_cm_cr(svm2, X, y, ["no_Wnv", "yes_Wnv"])

Output for Tested Model:
Confusion Matrix of Predictions: 

[[2646    0]
 [ 151    0]]

Classification Matrix: 

             precision    recall  f1-score   support

     no_Wnv       0.95      1.00      0.97      2646
    yes_Wnv       0.00      0.00      0.00       151

avg / total       0.89      0.95      0.92      2797



0.94601358598498386

In [90]:
svm1 = svm.SVC(C=.75, gamma=0.01, kernel='rbf')
do_cm_cr(svm1, X, y, ["no_Wnv", "yes_Wnv"])

Output for Tested Model:
Confusion Matrix of Predictions: 

[[2645    1]
 [ 145    6]]

Classification Matrix: 

             precision    recall  f1-score   support

     no_Wnv       0.95      1.00      0.97      2646
    yes_Wnv       0.86      0.04      0.08       151

avg / total       0.94      0.95      0.92      2797



0.94780121558813013

In [42]:
from sklearn.ensemble import AdaBoostClassifier

clf = AdaBoostClassifier(n_estimators=150)
# scores = cross_val_score(clf, X, y)
do_cm_cr(clf, X, y, ["no_Wnv", "yes_Wnv"])

Output for Tested Model:
Confusion Matrix of Predictions: 

[[2630   16]
 [ 115   36]]

Classification Matrix: 

             precision    recall  f1-score   support

     no_Wnv       0.96      0.99      0.98      2646
    yes_Wnv       0.69      0.24      0.35       151

avg / total       0.94      0.95      0.94      2797



0.95316410439756882

In [None]:
pipeline = 

In [36]:
from time import time
from sklearn.ensemble import AdaBoostClassifier

parameters = {
    'base_estimator': ('ExtraTreeClassifier', 'RandomForestClassifier') 
}


if __name__ == "__main__":
    grid_search = GridSearchCV(AdaBoostClassifier(algorithm = 'SAMME'), parameters, n_jobs=-1)
    print("Performing grid search...")
    print("parameters:")
    t0 = time()
    grid_search.fit(X, y)
    print("done in %0.3fs" % (time() - t0))
    print()
    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

Performing grid search...
parameters:


JoblibAttributeError: JoblibAttributeError
___________________________________________________________________________
Multiprocessing exception:
...........................................................................
/Users/rowan/anaconda/lib/python2.7/runpy.py in _run_module_as_main(mod_name='ipykernel.__main__', alter_argv=1)
    169     pkg_name = mod_name.rpartition('.')[0]
    170     main_globals = sys.modules["__main__"].__dict__
    171     if alter_argv:
    172         sys.argv[0] = fname
    173     return _run_code(code, main_globals, None,
--> 174                      "__main__", fname, loader, pkg_name)
        fname = '/Users/rowan/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py'
        loader = <pkgutil.ImpLoader instance>
        pkg_name = 'ipykernel'
    175 
    176 def run_module(mod_name, init_globals=None,
    177                run_name=None, alter_sys=False):
    178     """Execute a module's code without importing it

...........................................................................
/Users/rowan/anaconda/lib/python2.7/runpy.py in _run_code(code=<code object <module> at 0x104ed22b0, file "/Use...2.7/site-packages/ipykernel/__main__.py", line 1>, run_globals={'__builtins__': <module '__builtin__' (built-in)>, '__doc__': None, '__file__': '/Users/rowan/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py', '__loader__': <pkgutil.ImpLoader instance>, '__name__': '__main__', '__package__': 'ipykernel', 'app': <module 'ipykernel.kernelapp' from '/Users/rowan...python2.7/site-packages/ipykernel/kernelapp.pyc'>}, init_globals=None, mod_name='__main__', mod_fname='/Users/rowan/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py', mod_loader=<pkgutil.ImpLoader instance>, pkg_name='ipykernel')
     67         run_globals.update(init_globals)
     68     run_globals.update(__name__ = mod_name,
     69                        __file__ = mod_fname,
     70                        __loader__ = mod_loader,
     71                        __package__ = pkg_name)
---> 72     exec code in run_globals
        code = <code object <module> at 0x104ed22b0, file "/Use...2.7/site-packages/ipykernel/__main__.py", line 1>
        run_globals = {'__builtins__': <module '__builtin__' (built-in)>, '__doc__': None, '__file__': '/Users/rowan/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py', '__loader__': <pkgutil.ImpLoader instance>, '__name__': '__main__', '__package__': 'ipykernel', 'app': <module 'ipykernel.kernelapp' from '/Users/rowan...python2.7/site-packages/ipykernel/kernelapp.pyc'>}
     73     return run_globals
     74 
     75 def _run_module_code(code, init_globals=None,
     76                     mod_name=None, mod_fname=None,

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py in <module>()
      1 
      2 
----> 3 
      4 if __name__ == '__main__':
      5     from ipykernel import kernelapp as app
      6     app.launch_new_instance()
      7 
      8 
      9 
     10 

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/traitlets/config/application.py in launch_instance(cls=<class 'ipykernel.kernelapp.IPKernelApp'>, argv=None, **kwargs={})
    653 
    654         If a global instance already exists, this reinitializes and starts it
    655         """
    656         app = cls.instance(**kwargs)
    657         app.initialize(argv)
--> 658         app.start()
        app.start = <bound method IPKernelApp.start of <ipykernel.kernelapp.IPKernelApp object>>
    659 
    660 #-----------------------------------------------------------------------------
    661 # utility functions, for convenience
    662 #-----------------------------------------------------------------------------

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/ipykernel/kernelapp.py in start(self=<ipykernel.kernelapp.IPKernelApp object>)
    469             return self.subapp.start()
    470         if self.poller is not None:
    471             self.poller.start()
    472         self.kernel.start()
    473         try:
--> 474             ioloop.IOLoop.instance().start()
    475         except KeyboardInterrupt:
    476             pass
    477 
    478 launch_new_instance = IPKernelApp.launch_instance

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/zmq/eventloop/ioloop.py in start(self=<zmq.eventloop.ioloop.ZMQIOLoop object>)
    172             )
    173         return loop
    174     
    175     def start(self):
    176         try:
--> 177             super(ZMQIOLoop, self).start()
        self.start = <bound method ZMQIOLoop.start of <zmq.eventloop.ioloop.ZMQIOLoop object>>
    178         except ZMQError as e:
    179             if e.errno == ETERM:
    180                 # quietly return on ETERM
    181                 pass

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/tornado/ioloop.py in start(self=<zmq.eventloop.ioloop.ZMQIOLoop object>)
    882                 self._events.update(event_pairs)
    883                 while self._events:
    884                     fd, events = self._events.popitem()
    885                     try:
    886                         fd_obj, handler_func = self._handlers[fd]
--> 887                         handler_func(fd_obj, events)
        handler_func = <function null_wrapper>
        fd_obj = <zmq.sugar.socket.Socket object>
        events = 1
    888                     except (OSError, IOError) as e:
    889                         if errno_from_exception(e) == errno.EPIPE:
    890                             # Happens when the client closes the connection
    891                             pass

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/tornado/stack_context.py in null_wrapper(*args=(<zmq.sugar.socket.Socket object>, 1), **kwargs={})
    270         # Fast path when there are no active contexts.
    271         def null_wrapper(*args, **kwargs):
    272             try:
    273                 current_state = _state.contexts
    274                 _state.contexts = cap_contexts[0]
--> 275                 return fn(*args, **kwargs)
        args = (<zmq.sugar.socket.Socket object>, 1)
        kwargs = {}
    276             finally:
    277                 _state.contexts = current_state
    278         null_wrapper._wrapped = True
    279         return null_wrapper

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py in _handle_events(self=<zmq.eventloop.zmqstream.ZMQStream object>, fd=<zmq.sugar.socket.Socket object>, events=1)
    435             # dispatch events:
    436             if events & IOLoop.ERROR:
    437                 gen_log.error("got POLLERR event on ZMQStream, which doesn't make sense")
    438                 return
    439             if events & IOLoop.READ:
--> 440                 self._handle_recv()
        self._handle_recv = <bound method ZMQStream._handle_recv of <zmq.eventloop.zmqstream.ZMQStream object>>
    441                 if not self.socket:
    442                     return
    443             if events & IOLoop.WRITE:
    444                 self._handle_send()

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py in _handle_recv(self=<zmq.eventloop.zmqstream.ZMQStream object>)
    467                 gen_log.error("RECV Error: %s"%zmq.strerror(e.errno))
    468         else:
    469             if self._recv_callback:
    470                 callback = self._recv_callback
    471                 # self._recv_callback = None
--> 472                 self._run_callback(callback, msg)
        self._run_callback = <bound method ZMQStream._run_callback of <zmq.eventloop.zmqstream.ZMQStream object>>
        callback = <function null_wrapper>
        msg = [<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>]
    473                 
    474         # self.update_state()
    475         
    476 

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py in _run_callback(self=<zmq.eventloop.zmqstream.ZMQStream object>, callback=<function null_wrapper>, *args=([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],), **kwargs={})
    409         close our socket."""
    410         try:
    411             # Use a NullContext to ensure that all StackContexts are run
    412             # inside our blanket exception handler rather than outside.
    413             with stack_context.NullContext():
--> 414                 callback(*args, **kwargs)
        callback = <function null_wrapper>
        args = ([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],)
        kwargs = {}
    415         except:
    416             gen_log.error("Uncaught exception, closing connection.",
    417                           exc_info=True)
    418             # Close the socket on an uncaught exception from a user callback

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/tornado/stack_context.py in null_wrapper(*args=([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],), **kwargs={})
    270         # Fast path when there are no active contexts.
    271         def null_wrapper(*args, **kwargs):
    272             try:
    273                 current_state = _state.contexts
    274                 _state.contexts = cap_contexts[0]
--> 275                 return fn(*args, **kwargs)
        args = ([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],)
        kwargs = {}
    276             finally:
    277                 _state.contexts = current_state
    278         null_wrapper._wrapped = True
    279         return null_wrapper

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/ipykernel/kernelbase.py in dispatcher(msg=[<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>])
    271         if self.control_stream:
    272             self.control_stream.on_recv(self.dispatch_control, copy=False)
    273 
    274         def make_dispatcher(stream):
    275             def dispatcher(msg):
--> 276                 return self.dispatch_shell(stream, msg)
        msg = [<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>]
    277             return dispatcher
    278 
    279         for s in self.shell_streams:
    280             s.on_recv(make_dispatcher(s), copy=False)

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/ipykernel/kernelbase.py in dispatch_shell(self=<ipykernel.ipkernel.IPythonKernel object>, stream=<zmq.eventloop.zmqstream.ZMQStream object>, msg={'buffers': [], 'content': {u'allow_stdin': True, u'code': u'from time import time\nfrom sklearn.ensemble i...%r" % (param_name, best_parameters[param_name]))', u'silent': False, u'stop_on_error': True, u'store_history': True, u'user_expressions': {}}, 'header': {'date': '2017-04-25T16:52:12.230652', u'msg_id': u'C9EF0CFED444482A81AA05F64C2FB4B3', u'msg_type': u'execute_request', u'session': u'64E2A59B07284E4B867D980F90A6F066', u'username': u'username', u'version': u'5.0'}, 'metadata': {}, 'msg_id': u'C9EF0CFED444482A81AA05F64C2FB4B3', 'msg_type': u'execute_request', 'parent_header': {}})
    223             self.log.error("UNKNOWN MESSAGE TYPE: %r", msg_type)
    224         else:
    225             self.log.debug("%s: %s", msg_type, msg)
    226             self.pre_handler_hook()
    227             try:
--> 228                 handler(stream, idents, msg)
        handler = <bound method IPythonKernel.execute_request of <ipykernel.ipkernel.IPythonKernel object>>
        stream = <zmq.eventloop.zmqstream.ZMQStream object>
        idents = ['64E2A59B07284E4B867D980F90A6F066']
        msg = {'buffers': [], 'content': {u'allow_stdin': True, u'code': u'from time import time\nfrom sklearn.ensemble i...%r" % (param_name, best_parameters[param_name]))', u'silent': False, u'stop_on_error': True, u'store_history': True, u'user_expressions': {}}, 'header': {'date': '2017-04-25T16:52:12.230652', u'msg_id': u'C9EF0CFED444482A81AA05F64C2FB4B3', u'msg_type': u'execute_request', u'session': u'64E2A59B07284E4B867D980F90A6F066', u'username': u'username', u'version': u'5.0'}, 'metadata': {}, 'msg_id': u'C9EF0CFED444482A81AA05F64C2FB4B3', 'msg_type': u'execute_request', 'parent_header': {}}
    229             except Exception:
    230                 self.log.error("Exception in message handler:", exc_info=True)
    231             finally:
    232                 self.post_handler_hook()

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/ipykernel/kernelbase.py in execute_request(self=<ipykernel.ipkernel.IPythonKernel object>, stream=<zmq.eventloop.zmqstream.ZMQStream object>, ident=['64E2A59B07284E4B867D980F90A6F066'], parent={'buffers': [], 'content': {u'allow_stdin': True, u'code': u'from time import time\nfrom sklearn.ensemble i...%r" % (param_name, best_parameters[param_name]))', u'silent': False, u'stop_on_error': True, u'store_history': True, u'user_expressions': {}}, 'header': {'date': '2017-04-25T16:52:12.230652', u'msg_id': u'C9EF0CFED444482A81AA05F64C2FB4B3', u'msg_type': u'execute_request', u'session': u'64E2A59B07284E4B867D980F90A6F066', u'username': u'username', u'version': u'5.0'}, 'metadata': {}, 'msg_id': u'C9EF0CFED444482A81AA05F64C2FB4B3', 'msg_type': u'execute_request', 'parent_header': {}})
    385         if not silent:
    386             self.execution_count += 1
    387             self._publish_execute_input(code, parent, self.execution_count)
    388 
    389         reply_content = self.do_execute(code, silent, store_history,
--> 390                                         user_expressions, allow_stdin)
        user_expressions = {}
        allow_stdin = True
    391 
    392         # Flush output before sending the reply.
    393         sys.stdout.flush()
    394         sys.stderr.flush()

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/ipykernel/ipkernel.py in do_execute(self=<ipykernel.ipkernel.IPythonKernel object>, code=u'from time import time\nfrom sklearn.ensemble i...%r" % (param_name, best_parameters[param_name]))', silent=False, store_history=True, user_expressions={}, allow_stdin=True)
    191 
    192         self._forward_input(allow_stdin)
    193 
    194         reply_content = {}
    195         try:
--> 196             res = shell.run_cell(code, store_history=store_history, silent=silent)
        res = undefined
        shell.run_cell = <bound method ZMQInteractiveShell.run_cell of <ipykernel.zmqshell.ZMQInteractiveShell object>>
        code = u'from time import time\nfrom sklearn.ensemble i...%r" % (param_name, best_parameters[param_name]))'
        store_history = True
        silent = False
    197         finally:
    198             self._restore_input()
    199 
    200         if res.error_before_exec is not None:

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/ipykernel/zmqshell.py in run_cell(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, *args=(u'from time import time\nfrom sklearn.ensemble i...%r" % (param_name, best_parameters[param_name]))',), **kwargs={'silent': False, 'store_history': True})
    496             )
    497         self.payload_manager.write_payload(payload)
    498 
    499     def run_cell(self, *args, **kwargs):
    500         self._last_traceback = None
--> 501         return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
        self.run_cell = <bound method ZMQInteractiveShell.run_cell of <ipykernel.zmqshell.ZMQInteractiveShell object>>
        args = (u'from time import time\nfrom sklearn.ensemble i...%r" % (param_name, best_parameters[param_name]))',)
        kwargs = {'silent': False, 'store_history': True}
    502 
    503     def _showtraceback(self, etype, evalue, stb):
    504         # try to preserve ordering of tracebacks and print statements
    505         sys.stdout.flush()

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.py in run_cell(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, raw_cell=u'from time import time\nfrom sklearn.ensemble i...%r" % (param_name, best_parameters[param_name]))', store_history=True, silent=False, shell_futures=True)
   2712                 self.displayhook.exec_result = result
   2713 
   2714                 # Execute the user code
   2715                 interactivity = "none" if silent else self.ast_node_interactivity
   2716                 has_raised = self.run_ast_nodes(code_ast.body, cell_name,
-> 2717                    interactivity=interactivity, compiler=compiler, result=result)
        interactivity = 'last_expr'
        compiler = <IPython.core.compilerop.CachingCompiler instance>
   2718                 
   2719                 self.last_execution_succeeded = not has_raised
   2720 
   2721                 # Reset this so later displayed values do not modify the

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.py in run_ast_nodes(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, nodelist=[<_ast.ImportFrom object>, <_ast.ImportFrom object>, <_ast.Assign object>, <_ast.If object>], cell_name='<ipython-input-36-25a0cda1b9d6>', interactivity='none', compiler=<IPython.core.compilerop.CachingCompiler instance>, result=<ExecutionResult object at 116908650, execution_..._before_exec=None error_in_exec=None result=None>)
   2816 
   2817         try:
   2818             for i, node in enumerate(to_run_exec):
   2819                 mod = ast.Module([node])
   2820                 code = compiler(mod, cell_name, "exec")
-> 2821                 if self.run_code(code, result):
        self.run_code = <bound method ZMQInteractiveShell.run_code of <ipykernel.zmqshell.ZMQInteractiveShell object>>
        code = <code object <module> at 0x11790a730, file "<ipython-input-36-25a0cda1b9d6>", line 9>
        result = <ExecutionResult object at 116908650, execution_..._before_exec=None error_in_exec=None result=None>
   2822                     return True
   2823 
   2824             for i, node in enumerate(to_run_interactive):
   2825                 mod = ast.Interactive([node])

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.py in run_code(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, code_obj=<code object <module> at 0x11790a730, file "<ipython-input-36-25a0cda1b9d6>", line 9>, result=<ExecutionResult object at 116908650, execution_..._before_exec=None error_in_exec=None result=None>)
   2876         outflag = 1  # happens in more places, so it's easier as default
   2877         try:
   2878             try:
   2879                 self.hooks.pre_run_code_hook()
   2880                 #rprint('Running code', repr(code_obj)) # dbg
-> 2881                 exec(code_obj, self.user_global_ns, self.user_ns)
        code_obj = <code object <module> at 0x11790a730, file "<ipython-input-36-25a0cda1b9d6>", line 9>
        self.user_global_ns = {'AdaBoostClassifier': <class 'sklearn.ensemble.weight_boosting.AdaBoostClassifier'>, 'DecisionTreeClassifier': <class 'sklearn.tree.tree.DecisionTreeClassifier'>, 'GradientBoostingClassifier': <class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'>, 'GridSearchCV': <class 'sklearn.model_selection._search.GridSearchCV'>, 'In': ['', u"import pandas as pd\nimport numpy as np\nimpor...size'] = (16, 8)\nplt.rcParams['font.size'] = 12", u'#reading in cleaned train data\ntrain = pd.rea...etime(train["Date"], infer_datetime_format=True)', u'train.head()', u'#scaling train data\nscaled_train = StandardSc...] = train["Is_Satellite"]\n\nscaled_train.head()', u'#reading in cleaned weather data\nweather = pd...ime(weather["Date"], infer_datetime_format=True)', u'weather.head()', u'#dropping Depart, Wetbulb, Heat, Cool, Sunrise...Speed"], axis=1, inplace=True)\n\nweather.head()', u'#scaling weather data before merging with trai...eather["wind_abv_1std"]\n\nscaled_weather.head()', u'#inner merging weather with train data\ncombin...ne.drop("Year", axis=1, inplace=True)\n\ncombine', u'combine.columns', u'#dropping "Parent_Trap", "Is_Satellite" becaus...\n#preparing y values\ny = combine["WnvPresent"]', u'X.head()', u'y.head()', u'#creating function to test and fit classificat...s=names))\n    return model.score(X_test,y_test)', u'#we\'ll start with Logistic Regression\nlogreg...n\ndo_cm_cr(logreg, X, y, ["no_Wnv", "yes_Wnv"])', u'#k-nearest neighbors\nknnc = KNeighborsClassif...forms effectively as well as logistic regression', u'#SVM with linear kernal\nlin_svm = svm.SVC(ker...h linear model predict 0 instances of WNV - wow.', u'#SVM with rbf kernal\nrbf_svm = svm.SVC(kernel...V, but at least it predicted that one correctly.', u'#normal decision tree\ndrc = DecisionTreeClass...()\n\ndo_cm_cr(drc, X, y, ["no_Wnv", "yes_Wnv"])', ...], 'KNeighborsClassifier': <class 'sklearn.neighbors.classification.KNeighborsClassifier'>, 'LogisticRegression': <class 'sklearn.linear_model.logistic.LogisticRegression'>, 'Out': {3:         Date                 Species  Trap  Lati...            0.142857  
4               0.142857  , 4:    Latitude  Longitude  NumMosquitos  Mos_WNV_Pr...      0.0  
4  2007.0        T002           0.0  , 6:         Date  Tmax  Tmin  Tavg  Depart  DewPoint...      7      12.0        0.00040            1.0  , 7:         Date  Tavg  DewPoint  PrecipTotal  StnPr...          0.0  
4        0.00040            1.0  , 8:        Tavg  DewPoint  PrecipTotal  StnPressure ...04            0.0  
4 2007-05-05            1.0  , 9:           Tavg  DewPoint  PrecipTotal  StnPressu...    T233          0.0  

[8475 rows x 24 columns], 10: Index([u'Tavg', u'DewPoint', u'PrecipTotal', u'S...nt_Trap', u'Is_Satellite'],
      dtype='object'), 12:        Tavg  DewPoint  PrecipTotal  StnPressure ...0.349225       -1.126237              -0.074438  , 13: 0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: WnvPresent, dtype: float64, 15: 0.94494100822309612, ...}, 'RandomForestClassifier': <class 'sklearn.ensemble.forest.RandomForestClassifier'>, 'StandardScaler': <class 'sklearn.preprocessing.data.StandardScaler'>, ...}
        self.user_ns = {'AdaBoostClassifier': <class 'sklearn.ensemble.weight_boosting.AdaBoostClassifier'>, 'DecisionTreeClassifier': <class 'sklearn.tree.tree.DecisionTreeClassifier'>, 'GradientBoostingClassifier': <class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'>, 'GridSearchCV': <class 'sklearn.model_selection._search.GridSearchCV'>, 'In': ['', u"import pandas as pd\nimport numpy as np\nimpor...size'] = (16, 8)\nplt.rcParams['font.size'] = 12", u'#reading in cleaned train data\ntrain = pd.rea...etime(train["Date"], infer_datetime_format=True)', u'train.head()', u'#scaling train data\nscaled_train = StandardSc...] = train["Is_Satellite"]\n\nscaled_train.head()', u'#reading in cleaned weather data\nweather = pd...ime(weather["Date"], infer_datetime_format=True)', u'weather.head()', u'#dropping Depart, Wetbulb, Heat, Cool, Sunrise...Speed"], axis=1, inplace=True)\n\nweather.head()', u'#scaling weather data before merging with trai...eather["wind_abv_1std"]\n\nscaled_weather.head()', u'#inner merging weather with train data\ncombin...ne.drop("Year", axis=1, inplace=True)\n\ncombine', u'combine.columns', u'#dropping "Parent_Trap", "Is_Satellite" becaus...\n#preparing y values\ny = combine["WnvPresent"]', u'X.head()', u'y.head()', u'#creating function to test and fit classificat...s=names))\n    return model.score(X_test,y_test)', u'#we\'ll start with Logistic Regression\nlogreg...n\ndo_cm_cr(logreg, X, y, ["no_Wnv", "yes_Wnv"])', u'#k-nearest neighbors\nknnc = KNeighborsClassif...forms effectively as well as logistic regression', u'#SVM with linear kernal\nlin_svm = svm.SVC(ker...h linear model predict 0 instances of WNV - wow.', u'#SVM with rbf kernal\nrbf_svm = svm.SVC(kernel...V, but at least it predicted that one correctly.', u'#normal decision tree\ndrc = DecisionTreeClass...()\n\ndo_cm_cr(drc, X, y, ["no_Wnv", "yes_Wnv"])', ...], 'KNeighborsClassifier': <class 'sklearn.neighbors.classification.KNeighborsClassifier'>, 'LogisticRegression': <class 'sklearn.linear_model.logistic.LogisticRegression'>, 'Out': {3:         Date                 Species  Trap  Lati...            0.142857  
4               0.142857  , 4:    Latitude  Longitude  NumMosquitos  Mos_WNV_Pr...      0.0  
4  2007.0        T002           0.0  , 6:         Date  Tmax  Tmin  Tavg  Depart  DewPoint...      7      12.0        0.00040            1.0  , 7:         Date  Tavg  DewPoint  PrecipTotal  StnPr...          0.0  
4        0.00040            1.0  , 8:        Tavg  DewPoint  PrecipTotal  StnPressure ...04            0.0  
4 2007-05-05            1.0  , 9:           Tavg  DewPoint  PrecipTotal  StnPressu...    T233          0.0  

[8475 rows x 24 columns], 10: Index([u'Tavg', u'DewPoint', u'PrecipTotal', u'S...nt_Trap', u'Is_Satellite'],
      dtype='object'), 12:        Tavg  DewPoint  PrecipTotal  StnPressure ...0.349225       -1.126237              -0.074438  , 13: 0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: WnvPresent, dtype: float64, 15: 0.94494100822309612, ...}, 'RandomForestClassifier': <class 'sklearn.ensemble.forest.RandomForestClassifier'>, 'StandardScaler': <class 'sklearn.preprocessing.data.StandardScaler'>, ...}
   2882             finally:
   2883                 # Reset our crash handler in place
   2884                 sys.excepthook = old_excepthook
   2885         except SystemExit as e:

...........................................................................
/Users/rowan/Desktop/GA/Project 4/Kaggle_Comp_West_Nile/assets/<ipython-input-36-25a0cda1b9d6> in <module>()
      9 if __name__ == "__main__":
     10     grid_search = GridSearchCV(AdaBoostClassifier(algorithm = 'SAMME'), parameters, n_jobs=-1)
     11     print("Performing grid search...")
     12     print("parameters:")
     13     t0 = time()
---> 14     grid_search.fit(X, y)
     15     print("done in %0.3fs" % (time() - t0))
     16     print()
     17     print("Best score: %0.3f" % grid_search.best_score_)
     18     print("Best parameters set:")

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/sklearn/model_selection/_search.py in fit(self=GridSearchCV(cv=None, error_score='raise',
     ...train_score=True,
       scoring=None, verbose=0), X=          Tavg  DewPoint  PrecipTotal  StnPressu...             1.293851  

[8475 rows x 18 columns], y=0       0.0
1       0.0
2       0.0
3       0.0
... 0.0
8474    0.0
Name: WnvPresent, dtype: float64, groups=None)
    940 
    941         groups : array-like, with shape (n_samples,), optional
    942             Group labels for the samples used while splitting the dataset into
    943             train/test set.
    944         """
--> 945         return self._fit(X, y, groups, ParameterGrid(self.param_grid))
        self._fit = <bound method GridSearchCV._fit of GridSearchCV(...rain_score=True,
       scoring=None, verbose=0)>
        X =           Tavg  DewPoint  PrecipTotal  StnPressu...             1.293851  

[8475 rows x 18 columns]
        y = 0       0.0
1       0.0
2       0.0
3       0.0
... 0.0
8474    0.0
Name: WnvPresent, dtype: float64
        groups = None
        self.param_grid = {'base_estimator': ('ExtraTreeClassifier', 'RandomForestClassifier')}
    946 
    947 
    948 class RandomizedSearchCV(BaseSearchCV):
    949     """Randomized search on hyper parameters.

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/sklearn/model_selection/_search.py in _fit(self=GridSearchCV(cv=None, error_score='raise',
     ...train_score=True,
       scoring=None, verbose=0), X=          Tavg  DewPoint  PrecipTotal  StnPressu...             1.293851  

[8475 rows x 18 columns], y=0       0.0
1       0.0
2       0.0
3       0.0
... 0.0
8474    0.0
Name: WnvPresent, dtype: float64, groups=None, parameter_iterable=<sklearn.model_selection._search.ParameterGrid object>)
    559                                   fit_params=self.fit_params,
    560                                   return_train_score=self.return_train_score,
    561                                   return_n_test_samples=True,
    562                                   return_times=True, return_parameters=True,
    563                                   error_score=self.error_score)
--> 564           for parameters in parameter_iterable
        parameters = undefined
        parameter_iterable = <sklearn.model_selection._search.ParameterGrid object>
    565           for train, test in cv_iter)
    566 
    567         # if one choose to see train score, "out" will contain train score info
    568         if self.return_train_score:

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=Parallel(n_jobs=-1), iterable=<generator object <genexpr>>)
    763             if pre_dispatch == "all" or n_jobs == 1:
    764                 # The iterable was consumed all at once by the above for loop.
    765                 # No need to wait for async callbacks to trigger to
    766                 # consumption.
    767                 self._iterating = False
--> 768             self.retrieve()
        self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=-1)>
    769             # Make sure that we get a last message telling us we are done
    770             elapsed_time = time.time() - self._start_time
    771             self._print('Done %3i out of %3i | elapsed: %s finished',
    772                         (len(self._output), len(self._output),

---------------------------------------------------------------------------
Sub-process traceback:
---------------------------------------------------------------------------
AttributeError                                     Tue Apr 25 16:52:12 2017
PID: 40595                  Python 2.7.13: /Users/rowan/anaconda/bin/python
...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=<sklearn.externals.joblib.parallel.BatchedCalls object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        func = <function _fit_and_score>
        args = (AdaBoostClassifier(algorithm='SAMME', base_estim...ing_rate=1.0, n_estimators=50, random_state=None),           Tavg  DewPoint  PrecipTotal  StnPressu...             1.293851  

[8475 rows x 18 columns], 0       0.0
1       0.0
2       0.0
3       0.0
... 0.0
8474    0.0
Name: WnvPresent, dtype: float64, <function _passthrough_scorer>, array([1862, 1864, 1897, ..., 8472, 8473, 8474]), array([   0,    1,    2, ..., 2859, 2860, 2861]), 0, {'base_estimator': 'ExtraTreeClassifier'})
        kwargs = {'error_score': 'raise', 'fit_params': {}, 'return_n_test_samples': True, 'return_parameters': True, 'return_times': True, 'return_train_score': True}
        self.items = [(<function _fit_and_score>, (AdaBoostClassifier(algorithm='SAMME', base_estim...ing_rate=1.0, n_estimators=50, random_state=None),           Tavg  DewPoint  PrecipTotal  StnPressu...             1.293851  

[8475 rows x 18 columns], 0       0.0
1       0.0
2       0.0
3       0.0
... 0.0
8474    0.0
Name: WnvPresent, dtype: float64, <function _passthrough_scorer>, array([1862, 1864, 1897, ..., 8472, 8473, 8474]), array([   0,    1,    2, ..., 2859, 2860, 2861]), 0, {'base_estimator': 'ExtraTreeClassifier'}), {'error_score': 'raise', 'fit_params': {}, 'return_n_test_samples': True, 'return_parameters': True, 'return_times': True, 'return_train_score': True})]
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator=AdaBoostClassifier(algorithm='SAMME', base_estim...ing_rate=1.0, n_estimators=50, random_state=None), X=          Tavg  DewPoint  PrecipTotal  StnPressu...             1.293851  

[8475 rows x 18 columns], y=0       0.0
1       0.0
2       0.0
3       0.0
... 0.0
8474    0.0
Name: WnvPresent, dtype: float64, scorer=<function _passthrough_scorer>, train=array([1862, 1864, 1897, ..., 8472, 8473, 8474]), test=array([   0,    1,    2, ..., 2859, 2860, 2861]), verbose=0, parameters={'base_estimator': 'ExtraTreeClassifier'}, fit_params={}, return_train_score=True, return_parameters=True, return_n_test_samples=True, return_times=True, error_score='raise')
    233 
    234     try:
    235         if y_train is None:
    236             estimator.fit(X_train, **fit_params)
    237         else:
--> 238             estimator.fit(X_train, y_train, **fit_params)
        estimator.fit = <bound method AdaBoostClassifier.fit of AdaBoost...ng_rate=1.0, n_estimators=50, random_state=None)>
        X_train =           Tavg  DewPoint  PrecipTotal  StnPressu...             1.293851  

[5649 rows x 18 columns]
        y_train = 1862    1.0
1864    1.0
1897    1.0
1899    1.0
... 0.0
8474    0.0
Name: WnvPresent, dtype: float64
        fit_params = {}
    239 
    240     except Exception as e:
    241         # Note fit time as time until error
    242         fit_time = time.time() - start_time

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/sklearn/ensemble/weight_boosting.py in fit(self=AdaBoostClassifier(algorithm='SAMME', base_estim...ing_rate=1.0, n_estimators=50, random_state=None), X=          Tavg  DewPoint  PrecipTotal  StnPressu...             1.293851  

[5649 rows x 18 columns], y=1862    1.0
1864    1.0
1897    1.0
1899    1.0
... 0.0
8474    0.0
Name: WnvPresent, dtype: float64, sample_weight=None)
    406         # Check that algorithm is supported
    407         if self.algorithm not in ('SAMME', 'SAMME.R'):
    408             raise ValueError("algorithm %s is not supported" % self.algorithm)
    409 
    410         # Fit
--> 411         return super(AdaBoostClassifier, self).fit(X, y, sample_weight)
        self.fit = <bound method AdaBoostClassifier.fit of AdaBoost...ng_rate=1.0, n_estimators=50, random_state=None)>
        X =           Tavg  DewPoint  PrecipTotal  StnPressu...             1.293851  

[5649 rows x 18 columns]
        y = 1862    1.0
1864    1.0
1897    1.0
1899    1.0
... 0.0
8474    0.0
Name: WnvPresent, dtype: float64
        sample_weight = None
    412 
    413     def _validate_estimator(self):
    414         """Check the estimator and set the base_estimator_ attribute."""
    415         super(AdaBoostClassifier, self)._validate_estimator(

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/sklearn/ensemble/weight_boosting.py in fit(self=AdaBoostClassifier(algorithm='SAMME', base_estim...ing_rate=1.0, n_estimators=50, random_state=None), X=array([[ 0.79876753,  1.27709243,  0.25030721, ....  1.83600026,
        -0.07615402,  1.29385104]]), y=array([ 1.,  1.,  1., ...,  0.,  0.,  0.]), sample_weight=array([ 0.00017702,  0.00017702,  0.00017702, ...,  0.00017702,
        0.00017702,  0.00017702]))
    123                 raise ValueError(
    124                     "Attempting to fit with a non-positive "
    125                     "weighted number of samples.")
    126 
    127         # Check parameters
--> 128         self._validate_estimator()
        self._validate_estimator = <bound method AdaBoostClassifier._validate_estim...ng_rate=1.0, n_estimators=50, random_state=None)>
    129 
    130         # Clear any previous fit results
    131         self.estimators_ = []
    132         self.estimator_weights_ = np.zeros(self.n_estimators, dtype=np.float64)

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/sklearn/ensemble/weight_boosting.py in _validate_estimator(self=AdaBoostClassifier(algorithm='SAMME', base_estim...ing_rate=1.0, n_estimators=50, random_state=None))
    422                     "AdaBoostClassifier with algorithm='SAMME.R' requires "
    423                     "that the weak learner supports the calculation of class "
    424                     "probabilities with a predict_proba method.\n"
    425                     "Please change the base estimator or set "
    426                     "algorithm='SAMME' instead.")
--> 427         if not has_fit_parameter(self.base_estimator_, "sample_weight"):
        self.base_estimator_ = 'ExtraTreeClassifier'
    428             raise ValueError("%s doesn't support sample_weight."
    429                              % self.base_estimator_.__class__.__name__)
    430 
    431     def _boost(self, iboost, X, y, sample_weight, random_state):

...........................................................................
/Users/rowan/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py in has_fit_parameter(estimator='ExtraTreeClassifier', parameter='sample_weight')
    588     >>> from sklearn.svm import SVC
    589     >>> has_fit_parameter(SVC(), "sample_weight")
    590     True
    591 
    592     """
--> 593     return parameter in signature(estimator.fit).parameters
        parameter = 'sample_weight'
        estimator.fit.parameters = undefined
    594 
    595 
    596 def check_symmetric(array, tol=1E-10, raise_warning=True,
    597                     raise_exception=False):

AttributeError: 'str' object has no attribute 'fit'
___________________________________________________________________________

In [40]:
AdaBoostClassifier().get_params().keys()

['n_estimators',
 'base_estimator',
 'random_state',
 'learning_rate',
 'algorithm']

In [24]:
import xgboost
model = xgboost.XGBClassifier()
model.fit(X, y)

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)

In [25]:
do_cm_cr(model, X, y, ["no_Wnv", "yes_Wnv"])

Output for Tested Model:
Confusion Matrix of Predictions: 

[[2630   16]
 [ 125   26]]

Classification Matrix: 

             precision    recall  f1-score   support

     no_Wnv       0.95      0.99      0.97      2646
    yes_Wnv       0.62      0.17      0.27       151

avg / total       0.94      0.95      0.94      2797



0.9495888451912764