# Let's Combine Models!

### Joe and Keenan

In this notebook, we are going to try combining our two most successful models in order to get a better score

In [52]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
from sklearn.metrics import log_loss
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import log_loss
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import KFold
import numpy as np
import seaborn as sns
from sklearn.ensemble import GradientBoostingClassifier

%matplotlib inline

#Load Data with pandas, and parse the first column into datetime
train=pd.read_csv('train.csv', parse_dates = ['Dates'])
test=pd.read_csv('test.csv', parse_dates = ['Dates'])

# Keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.
train = train[np.abs(train.X-train.X.mean())<=(3*train.X.std())] 
train = train[np.abs(train.Y-train.Y.mean())<=(3*train.Y.std())] 

train.head()


Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


### Process the Data again

In [53]:
#Convert crime labels to numbers
le_crime = preprocessing.LabelEncoder()
crime = le_crime.fit_transform(train.Category)
 
#Get binarized weekdays, districts, and hours.
days = pd.get_dummies(train.DayOfWeek)
district = pd.get_dummies(train.PdDistrict)
hour = train.Dates.dt.hour
hour = pd.get_dummies(hour) 
x = train.X
y = train.Y
 
#Build new array
train_data = pd.concat([hour, days, district, x, y], axis=1)
train_data['crime']=crime
 
#Repeat for test data
days = pd.get_dummies(test.DayOfWeek)
district = pd.get_dummies(test.PdDistrict)
 
hour = test.Dates.dt.hour
hour = pd.get_dummies(hour) 
x = test.X
y = test.Y
 
test_data = pd.concat([hour, days, district, x, y], axis=1)

print train_data.head()


   0  1  2  3  4  5  6  7  8  9  ...    MISSION  NORTHERN  PARK  RICHMOND  \
0  0  0  0  0  0  0  0  0  0  0  ...          0         1     0         0   
1  0  0  0  0  0  0  0  0  0  0  ...          0         1     0         0   
2  0  0  0  0  0  0  0  0  0  0  ...          0         1     0         0   
3  0  0  0  0  0  0  0  0  0  0  ...          0         1     0         0   
4  0  0  0  0  0  0  0  0  0  0  ...          0         0     1         0   

   SOUTHERN  TARAVAL  TENDERLOIN           X          Y  crime  
0         0        0           0 -122.425892  37.774599     37  
1         0        0           0 -122.425892  37.774599     21  
2         0        0           0 -122.424363  37.800414     21  
3         0        0           0 -122.426995  37.800873     16  
4         0        0           0 -122.438738  37.771541     16  

[5 rows x 44 columns]


In [54]:
features = ['Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday',
 'Wednesday', 'BAYVIEW', 'CENTRAL', 'INGLESIDE', 'MISSION',
 'NORTHERN', 'PARK', 'RICHMOND', 'SOUTHERN', 'TARAVAL', 'TENDERLOIN', 'X', 'Y']

# Add in hours of the day into the features
features2 = [x for x in range(0,24)]
features = features + features2

print features

['Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday', 'Wednesday', 'BAYVIEW', 'CENTRAL', 'INGLESIDE', 'MISSION', 'NORTHERN', 'PARK', 'RICHMOND', 'SOUTHERN', 'TARAVAL', 'TENDERLOIN', 'X', 'Y', 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]


In [75]:
training, validation = train_test_split(train_data, train_size=.5)
print type(training)

model = BernoulliNB()
model.fit(training[features], training['crime'])
predicted = np.array(model.predict_proba(validation[features]))
print predicted.shape
print "Naive Bayes: " , log_loss(validation['crime'], predicted) 

 <class 'pandas.core.frame.DataFrame'>
(438991, 39)
Naive Bayes:  2.58478544941


We are going to try combining features, the BernoulliNB and the Logistic Regression, two of the models that worked the best according to kaggle in the past. I'm using a similar method to the way we learned the ensemble from the Dataquest with the titanic dataset. 

In [82]:
training, validation = train_test_split(train_data, train_size=.5)

algorithms = [[BernoulliNB(), features], [LogisticRegression(C=.01), features]]
full_predictions = []

for alg, features in algorithms:
    filtered = training[features]
    alg.fit(training[features], training['crime'])
    predictions = np.array(alg.predict_proba(validation[features]))
    full_predictions.append(predictions)

predictions = (full_predictions[0] * 2 + full_predictions[1]) / 3
predictions = predictions.astype(int)

print predictions.shape
print 'Ensemble: ', log_loss(validation['crime'], predictions)


 (438991, 39)
Ensemble:  3.66356164613


As you can see, this resulting in a score of 3.663, which is pretty terrible. Somehow, combining the two features did not do particularily well. 

## Let's try a Gradient Boosting Classifier!

Because why not? Let's import from sklearn, choose some parameters that seem decent based on using this model from the titanic dataset, and running the model to see how the log loss compares to previous models

In [84]:
from sklearn.ensemble import GradientBoostingClassifier

training, validation = train_test_split(train_data, train_size=.5)
model=GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3)
model.fit(training[features], training['crime'])
predicted = np.array(model.predict_proba(validation[features]))

print "Gradient Boosting!: " , log_loss(validation['crime'], predicted)

Gradient Boosting!:  2.54999099019


Oh Yeah! That's a significant improvement over our last score. However, I remember last time during the titanic dataset, gradient boosting was very prone to overfitting, so I don't want to get too confident at this point. Anyways, lets make a kaggle submission and see!

In [85]:
model=GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3)
model.fit(training[features], training['crime'])
predicted = model.predict_proba(test_data[features])
result=pd.DataFrame(predicted, columns=le_crime.classes_)
result.to_csv('testResultGradientBoosting.csv', index = True, index_label = 'Id' )


Yay! This model achieved a score of: 2.54949 on Kaggle, good enough for 425th place as of right now. It definitely took a lot longer to run, about 30-60 minutes. I would like to try a gridsearch with tuning parameters like random states, n_estimators (add more of them to make sure I'm not overfitting), or change the max depth. However, this is going to take way too long, unfortunetly. I am pleased with the improvement, however! 