# Model Scoring and Kaggle Submisisons

In this notebook we will score all the models we made and select the best one for future use. This will require some brief data cleaning for the test data which we include below, it __mostly__ follows the same pattern as the cleaning of the training data.

In [57]:
import pandas as pd
import numpy as np
import pickle

from sklearn.model_selection import train_test_split


## Testing Data

In [44]:
test = pd.read_csv('../data/test.csv', index_col=0)
weather = pd.read_csv('../data/weather_cleaned.csv')
spray = pd.read_csv('../data/spray_cleaned.csv')

In [45]:
test.columns = test.columns.map(lambda x: x.lower())
weather.columns = weather.columns.map(lambda x: x.lower())

In [46]:
test.species = test.species.map({'CULEX PIPIENS/RESTUANS': 'CULEX PIPIENS/RESTUANS',
                   'CULEX RESTUANS': 'CULEX RESTUANS',
                   'CULEX PIPIENS': 'CULEX PIPIENS',
                   'CULEX TERRITANS': 'CULEX OTHER', 
                   'CULEX SALINARIUS': 'CULEX OTHER',
                   'CULEX TARSALIS': 'CULEX OTHER',
                   'CULEX ERRATICUS': 'CULEX OTHER'})

test.species = test.species.fillna('CULEX PIPIENS')

In [47]:
test.columns

Index(['date', 'address', 'species', 'block', 'street', 'trap',
       'addressnumberandstreet', 'latitude', 'longitude', 'addressaccuracy'],
      dtype='object')

In [48]:
test['station'] = np.where(test['latitude'] >= 41.892, 1, 2)

In [49]:
test_weather = pd.merge(test, weather, on=['date', 'station'], )

In [53]:
train = pd.read_csv('../data/train_weather_spray_merged.csv')

In [10]:
set(train.columns).difference(test_weather.columns)

{'nummosquitos', 'spray_nearby', 'wnvpresent'}

The only final feature to engineer for test data is the `spray_nearby` feature.

In [11]:
from math import sin, cos, radians, asin, sqrt
def global_distance(lon1, lat1, lon2, lat2):
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 3956
    
    return c * r

In [12]:
traps = {}
for index, row in test_weather.iterrows():
    traps[row['trap']] = (row['longitude'], row['latitude'])


In [13]:
trap_distances = {}

for trap in traps:
    lon, lat = traps[trap]
    
    for index, spray_row in spray.iterrows():
        tmp_dist = global_distance(lon, lat, spray_row['Longitude'], spray_row['Latitude'])
        if trap in trap_distances:
            trap_distances[trap] = min(tmp_dist, trap_distances[trap])
        else:
            trap_distances[trap] = tmp_dist

These are the only features we're actually using for our models, so we'll store our testing data so that it can be easily accessed by these names.


In [14]:
cols = ['latitude', 'longitude', 'addressaccuracy', 'spray_nearby', 'station',
       'tmax', 'tmin', 'tavg', 'dewpoint', 'wetbulb', 'heat', 'cool',
       'preciptotal', 'stnpressure', 'sealevel', 'resultspeed', 'resultdir',
       'avgspeed', 'ts', 'sq', 'fg+', 'gr', 'br', 'tsra', 'dz', 'bcfg', 'hz',
       'fu', 'sn', 'fg', 'vcts', 'ra', 'mifg', 'vcfg', 'species_CULEX OTHER',
       'species_CULEX PIPIENS', 'species_CULEX PIPIENS/RESTUANS',
       'species_CULEX RESTUANS']

In [15]:
test_weather['spray_nearby'] = (test_weather['trap'].map(trap_distances) < .125).map(float)

In [16]:
test_weather_dummies = pd.get_dummies(test_weather, columns=['species'])

Just doing a brief check that the shape is what we expect, we should have 38 columns.

In [17]:
test_weather_dummies = test_weather_dummies[cols]

In [18]:
test_weather_dummies.shape

(116293, 38)

We still have one object type, `preciptotal` and we should convert the 'T' for trace amounts to a very small floating point like we did in the training data.

In [19]:
test_weather_dummies.select_dtypes(include='O').columns

Index([], dtype='object')

In [20]:
test_weather_dummies.to_csv('../data/test_merged.csv', index=False)

## Model Scoring
We want to look to our training data to collect all of our scores here and our testing data so we can create and score our predictions for Kaggle.

In [92]:
X = pd.get_dummies(train, columns=['species'])[cols]
y = pd.read_csv('../data/train_weather_spray_merged.csv')['wnvpresent']

with open('../models/scaler.pkl', 'rb') as ss_prefit:
    ss = pickle.load(ss_prefit)

with open('../models/pca.pkl', 'rb') as pca_prefit:
    pca = pickle.load(pca_prefit)
    
with open('../models/pca_restricted.pkl', 'rb') as pca_prefit_rest:
    pca_rest = pickle.load(pca_prefit_rest)
    
    
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=41)

X_train = ss.transform(X_train)
X_test = ss.transform(X_test)
test_weather_dummies = ss.transform(test_weather_dummies)

#for the models with PCA and all features. 
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
test_weather_dummies_pca = pca.transform(test_weather_dummies)
    
#for the models with PCA and limited features. 
limited_cols = ['latitude', 'longitude', 'tmax', 'species_CULEX PIPIENS']
X_cols = X.columns.tolist()
limited_cols = [X_cols.index(col) for col in limited_cols]

X_train_limited = X_train[:,limited_cols]
X_test_limited = X_test[:,limited_cols]
test_weather_dummies_limited = test_weather_dummies[:,limited_cols]




In [101]:
def model_submission(name, model, test_data):
    preds = model.predict_proba(test_data)[:,1]
    preds = pd.Series(data=preds, index=range(1,preds.shape[0]+1))
    preds.to_csv(f'../kaggle_submissions/{name}.csv', index=True, index_label='Id', header=['WnvPresent'])

In [108]:
#we'll use a dictionary to keep track of our scores
scores = {}

# Logistic Regression and Adaboost with PCA
First we assess our models with a full set of features that was reduced with PCA.

In [103]:
with open('../models/ada.pkl', 'rb') as file:
    ada = pickle.load(file)
    
model_submission('ada', ada, test_weather_dummies_pca)
scores['ada'] = ada.score(X_train_pca, y_train), ada.score(X_test_pca, y_test)

In [130]:
with open('../models/log_reg_pca.pkl', 'rb') as file:
    log_reg_pca = pickle.load(file)
    
model_submission('log_reg_pca', log_reg_pca, test_weather_dummies_pca)
scores['log_reg_pca'] = log_reg_pca.score(X_train_pca, y_train), log_reg_pca.score(X_test_pca, y_test)


## Logistic Regression, Decision Tree, Random Forest, Bagged Decision Tree and KNN
Most of the models we tried, we did without using PCA first.

In [114]:
with open('../models/dt.pkl', 'rb') as file:
    dt = pickle.load(file)
    
model_submission('dt', dt, test_weather_dummies)
scores['dt'] = dt.score(X_train, y_train), dt.score(X_test, y_test)

In [115]:
with open('../models/rf.pkl', 'rb') as file:
    rf = pickle.load(file)
    
model_submission('rf', rf, test_weather_dummies)
scores['rf'] = rf.score(X_train, y_train), rf.score(X_test, y_test)

In [116]:
with open('../models/bag.pkl', 'rb') as file:
    bag = pickle.load(file)
    
model_submission('bag', bag, test_weather_dummies)
scores['bag'] = bag.score(X_train, y_train), bag.score(X_test, y_test)

In [117]:
with open('../models/knn.pkl', 'rb') as file:
    knn = pickle.load(file)
    
model_submission('knn', knn, test_weather_dummies)
scores['knn'] = knn.score(X_train, y_train), knn.score(X_test, y_test)

## Decision Tree and Random Forest with restricted features
We tried a few models with a restricted feature set.

In [120]:
with open('../models/dt_lf.pkl', 'rb') as file:
    dt_lf = pickle.load(file)
    
model_submission('dt_lf', dt_lf, test_weather_dummies_limited)
scores['dt_lf'] = dt_lf.score(X_train_pca_rest, y_train), dt_lf.score(X_test_pca_rest, y_test)

In [121]:
with open('../models/rf_lf.pkl', 'rb') as file:
    rf_lf = pickle.load(file)
    
model_submission('rf_lf', rf_lf, test_weather_dummies_limited)
scores['rf_lf'] = rf_lf.score(X_train_pca_rest, y_train), rf_lf.score(X_test_pca_rest, y_test)

In [122]:
log_reg_pca.best_estimator_


LogisticRegression(C=0.01291549665014884, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)