In [44]:
import pandas as pd 
from matplotlib import pyplot as plt
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,precision_score


from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, ShuffleSplit

# Part 2: Random forest and weather
The aim here is to recreate the work you did in Part 1-3 of the Week 7 lecture. I've phrased things differently relative to the exercise to make the purpose more clear.

## Part 2A: Random forest binary classification.

1. Using the and instructions and material from Week 7, **build a random forest classifier to distinguish between two types (you choose) of crime using on spatio-temporal (where/when) features** of data describing the two crimes. When you're done, you should be able to give the classifier a place and a time, and it should tell you which of the two types of crime happened there.

In [45]:
data = pd.read_csv("../datasets/Police_Department_Incident_Reports__Historical_2003_to_May_2018.csv",index_col = 0,low_memory=False) 
data.drop(data.tail(5).index,inplace=True)#rimuove error alla fine

In [46]:
#Time measures
data["Hour"]= pd.to_datetime(data["Time"], format='%H:%M').dt.hour
data["Date"]= pd.to_datetime(data["Date"], format='%m/%d/%Y')
data["Month"] = data['Date'].dt.month
data["Day"] = data['Date'].dt.day

#crime categories analyzed
type1="VEHICLE THEFT" #influenced by weather conditions
type2="FRAUD" #not influenced

#take the same amount of samples and generate dataframe containing dataset for classification
df1=data[data["Category"]==type1].sample(12000)[['Hour', 'Day', 'Month', 'PdDistrict','Category','Date']]
df2=data[data["Category"]==type2].sample(12000)[['Hour', 'Day', 'Month', 'PdDistrict','Category','Date']]

df=pd.concat([df1,df2])

#modify categorical values
df['PdDistrict']=df['PdDistrict'].astype('category')
df['PdDistrict_cat']=df['PdDistrict'].cat.codes

In [47]:
#train-validation/ test split -> 70/30
msk = np.random.rand(len(df)) < 0.7
df_train = df[msk]
df_test = df[~msk]

In [48]:
#train with grid CV
ranfor_params = {'fit__n_estimators' : range(50,200,10) , 'fit__max_depth' : range(1,6) , 'fit__max_features' : ["sqrt","log2"]}

pipe = Pipeline( [('fit',RandomForestClassifier(bootstrap=True, oob_score=True,random_state=0))] )

randomForest = GridSearchCV(pipe, 
                            param_grid=ranfor_params,
                            scoring='accuracy',
                            cv = 5, verbose=0, n_jobs=-1).fit( np.array(df_train[['Hour', 'Day', 'Month', 'PdDistrict_cat']]), df_train['Category']).best_estimator_

In [49]:
#print(randomForest)
print("Accuracy from CV results on train set: "+str(randomForest.score( np.array(df_train[['Hour', 'Day', 'Month', 'PdDistrict_cat']]), df_train['Category'])))

#predict using best parameters
predictions=randomForest.predict(np.array(df_test[['Hour', 'Day', 'Month', 'PdDistrict_cat']]))

#compute evaluation measures
accuracy = accuracy_score(df_test['Category'],predictions)
print("Accuracy on the test set: "+str(accuracy))


Accuracy from CV results on train set: 0.6786926236828005
Accuracy on the test set: 0.6866583368041094


1. **Explain about your choices for training/test data, features, and encoding.** (You decide how to present your results, but here are some example topics to consider: Did you balance the training data? What are the pros/cons of balancing? Do you think your model is overfitting? Did you choose to do cross-validation? Which specific features did you end up using? Why? Which features (if any) did you one-hot encode? Why ... or why not?)
    
      I split data in 80% train+validation and 20% for the test set. I choose to differenciate as more as possible the time varibales to have more predictors to deal with, since we have a lot of samples the curse of dimensionality is not a main issues in this problem. I choose to simply code the categorical variables, using one-hot encoding might add many misleading features, it is safer to be implemented with regularization models. I used Grid Search CV, an "Exhaustive search over specified parameter values for an estimator." \[1\]  with that we tune the Forest and found the best parametrs. Random forests are ensemble methods known for rarely overfitting with many trees and proper parameter values\[2\].Then, adding CV give us more reliability in the parameter tuning generalization due to the repetited train/validation check. Thus, we can see that the C.V. training accuracy is really close to the Testing accuracy, this suggest us that we have a good generalization. On the other hand having both values above the random value we expect our model to not underfit as well. 

2. **Discuss the model performance. Report accuracy.**

     As stated above,we see that the performance is better than random. Accuracy is the number of correct predictions made over all the predictions. We can observe a performance on the training set of 67.83% of correct classified elements and 67.67% on the test set. This means that (as expected) we perform a little worse on unseen data, but with our reliable tuning procedure we can expect a performance on more unseen data point similar to that one we had on our test set.

## Part 2B: Info from weather features.

Now add features from weather data to your random forest

In [50]:
weather = pd.read_csv("../datasets/weather_data.csv",index_col = 0,low_memory=False) 

#merging databases
weather["Hour"]     =  pd.to_datetime(weather.index, format="%Y-%m-%dT%H:%M:%S.%fZ").hour
weather["Date"]     =  pd.to_datetime(weather.index, format="%Y-%m-%dT%H:%M:%S.%fZ").date
weather["DateHour"] =  weather["Date"].astype(str)+"-"+weather["Hour"].astype(str)
df1["DateHour"]     =  df1["Date"].astype(str)+"-"+df1["Hour"].astype(str)
df2["DateHour"]     =  df2["Date"].astype(str)+"-"+df2["Hour"].astype(str)

df1.set_index(["DateHour"],inplace= True)
df2.set_index(["DateHour"],inplace= True)
weather.set_index(["DateHour"],inplace= True)

merge1=pd.merge(df1,weather, how='inner', left_index=True, right_index=True)
merge1=merge1.dropna()
merge2=pd.merge(df2,weather, how='inner', left_index=True, right_index=True)
merge2=merge2.dropna()

In [51]:
#categorical codifications
df=pd.concat([merge1.sample(900),merge2.sample(900)])#due to the smaller csv bug

df['weather']=df['weather'].astype('category')
df['Weather_cat']=df['weather'].cat.codes


df['PdDistrict']=df['PdDistrict'].astype('category')
df['PdDistrict_cat']=df['PdDistrict'].cat.codes

In [54]:
#train-validation/ test split -> 70/30
msk = np.random.rand(len(df)) < 0.7
df_train = df[msk]
df_test = df[~msk]

In [55]:
#train with grid CV
ranfor_params = {'fit__n_estimators' : range(50,200,10) , 'fit__max_depth' : range(1,6) , 'fit__max_features' : ["sqrt","log2"]}

pipe = Pipeline( [('fit',RandomForestClassifier(bootstrap=True, oob_score=True,random_state=0))] )

randomForest = GridSearchCV(pipe, 
                            param_grid=ranfor_params,
                            scoring='accuracy',
                            cv = 5, verbose=0, n_jobs=-1).fit(np.array(df_train[['Hour_x', 'Day', 'Month', 'PdDistrict_cat','Weather_cat','temperature']]), df_train['Category']).best_estimator_

In [56]:
#print(randomForest)
print("Accuracy from CV results on train set: "+str(randomForest.score( np.array(df_train[['Hour_x', 'Day', 'Month', 'PdDistrict_cat','Weather_cat','temperature']]), df_train['Category'])))

#predict using best parameters
predictions=randomForest.predict(np.array(df_test[['Hour_x', 'Day', 'Month', 'PdDistrict_cat','Weather_cat','temperature']]))

#compute evaluation measures
accuracy = accuracy_score(df_test['Category'],predictions)
print("Accuracy on the test set: "+str(accuracy))


Accuracy from CV results on train set: 0.7180480247869868
Accuracy on the test set: 0.6365422396856582


1. **Report accuracy.**

    On the **train set the accuracy is 0.7167449139280125**
    
    On the **test set the accuracy is 0.6800766283524904**
    

2. **Discuss how the model performance changes relative to the version with no weather data.**

    We see a little increase in the performance, this is reasonable since we choose 2 classes that are expected to be related to the weather, however, the increasing in performance is small. This may reveal that our model is doing good(or bad) even without weather data or that data may not be so much useful for our classifictaion. We may also notice an increase in the difference between train and test performance. An high accuracy in the train set and a poor one in the test set is an indicator of overfitting. In our case, considering C.V. tuning and Leo Braiman assumptions on Random Forests \[2\], the difference is still reasonably small (around 3.6%) to do not expect overfitting.
    

3. **Discuss what you have learned about crime from including weather data in your model.**

    Since our performance is increasing by a very small value we do not have a statistical irrefutable proof that weather is strictly related to the "Veichle Theft" or "Fraud" crimes frequenxy. Anyway, we can conclude that the weather features we have added are good predictors, beacuse our performance, in any case, have increased after adding them.

## References

\[1\] -> [sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

\[2\] -> [Random Forests by LEO BREIMAN, Statistics Department, University of California, Berkeley](https://link.springer.com/content/pdf/10.1023/A:1010933404324.pdf)