# Part 1: Decision trees
* *There are two main kinds of decision trees depending on the type of output (numeric vs. categorical). What are they?*
* *Explain in your own words: Why is entropy useful when deciding where to split the data?*
* *Why are trees prone to overfitting?*
* *Explain (in your own words) how random forests help prevent overfitting.*

***[ANSWERS TO QUESTIONS]***

Loading the dataset, before we begin:

In [93]:
import requests
from matplotlib import pyplot as plt
import numpy as np
import csv
import pandas as pd
from pandas import DataFrame
%matplotlib inline
import geoplotlib
from geoplotlib.utils import BoundingBox
from geoplotlib.colors import ColorMap
import sklearn
from sklearn.tree import DecisionTreeClassifier,export_graphviz
import math
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

ImportError: No module named model_selection

In [2]:
# Load it into a Dataframe using pandas
path = '..\data\sfpd_incidents.csv'
df = pd.read_csv(path)
df.head()

Unnamed: 0,IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,Address,X,Y,Location,PdId
0,150060275,NON-CRIMINAL,LOST PROPERTY,Monday,01/19/2015,14:00,MISSION,NONE,18TH ST / VALENCIA ST,-122.421582,37.761701,"(37.7617007179518, -122.42158168137)",15006027571000
1,150098210,ROBBERY,"ROBBERY, BODILY FORCE",Sunday,02/01/2015,15:45,TENDERLOIN,NONE,300 Block of LEAVENWORTH ST,-122.414406,37.784191,"(37.7841907151119, -122.414406029855)",15009821003074
2,150098210,ASSAULT,AGGRAVATED ASSAULT WITH BODILY FORCE,Sunday,02/01/2015,15:45,TENDERLOIN,NONE,300 Block of LEAVENWORTH ST,-122.414406,37.784191,"(37.7841907151119, -122.414406029855)",15009821004014
3,150098210,SECONDARY CODES,DOMESTIC VIOLENCE,Sunday,02/01/2015,15:45,TENDERLOIN,NONE,300 Block of LEAVENWORTH ST,-122.414406,37.784191,"(37.7841907151119, -122.414406029855)",15009821015200
4,150098226,VANDALISM,"MALICIOUS MISCHIEF, VANDALISM OF VEHICLES",Tuesday,01/27/2015,19:00,NORTHERN,NONE,LOMBARD ST / LAGUNA ST,-122.431119,37.800469,"(37.8004687042875, -122.431118543788)",15009822628160


*The chief wants you to start from real data and build a system that replicates the functionality in the Minority Report system. Imagine, we find out that certain type of crime is going to take place - as well as the exact time of the crime - but that we don't know where, then Suneman wants an algorithm that will predict which district the crime is most likely to take place in. Specifically, let's build an algorithm that predicts the location of a crime based on its type and time.*

* *Use the category of the crimes to build a decision tree that predicts the corresponding district. You can implement the ID3 tree in the DSFS book, or use the [DecisionTreeClassifier](http://scikit-learn.org/stable/modules/tree.html) class in scikit-learn. For training, you can use 90% of the data and test the tree prediction on the remaining 10%.*

In [None]:
label_encoder = 

In [86]:
num_observations = len(df['IncidntNum'])
training_set_size = int(math.ceil(.9*num_observations))

In order to fit the classifier, we must transform the data and convert all the string attributes to categorical attributes, since they're not supported by the DecisionTreeClassifier. Therefore, we use **one-out-of-K** coding in order to associate to each category its numerical representation.

In [87]:
list_categories = list(set(df['Category']))
categories_coding = {category:i for i,category in enumerate(list_categories)}

list_districts = list(set(df['PdDistrict']))
districts_coding = {district:i for i,district in enumerate(list_districts)}

With two simple functions we convert the categorical value to a numerical value and add it as a new column to the original dataframe:

In [88]:
def get_num_value(coding_dict,value):
    return coding_dict[value]
    
df['Category_Num'] = df['Category'].apply(lambda x: get_num_value(categories_coding,x))
df['PdDistrict_Num'] = df['PdDistrict'].apply(lambda x: get_num_value(districts_coding,x))
df[['Category','Category_Num','PdDistrict','PdDistrict_Num']].head()

Unnamed: 0,Category,Category_Num,PdDistrict,PdDistrict_Num
0,NON-CRIMINAL,28,MISSION,5
1,ROBBERY,10,TENDERLOIN,6
2,ASSAULT,34,TENDERLOIN,6
3,SECONDARY CODES,2,TENDERLOIN,6
4,VANDALISM,27,NORTHERN,2


In [89]:
#Get total training and test set
training_set = df.sample(training_set_size)
test_set = df.drop(training_set.index) #total - training

#Get features and labels
training_features = np.array(training_set['Category_Num'])
training_labels = np.array(training_set['PdDistrict_Num'])
test_labels = np.array(test_set['PdDistrict_Num'])

#Reshaping arrays
training_features = training_features[:, None]
training_labels = training_labels[:, None]
test_labels = test_labels[:, None]

In [90]:
#Fit the classifier
clf = DecisionTreeClassifier()
clf.fit(training_features,training_labels)
pred = clf.predict(test_labels)

* *What is the fraction of correct predictions?*

Let's create a support DataFrame to compare the actual values with the predicted values and calculate the fraction of correct predictions:

In [91]:
df_predictions = DataFrame(np.column_stack([test_set['PdDistrict_Num'],pred]),columns=['PdDistrict_Real','PdDistrict_Pred'])
total = len(df_predictions.index)
num_correct_predictions = len(df_predictions[df_predictions['PdDistrict_Real']==df_predictions['PdDistrict_Pred']].index)
ratio = num_correct_predictions*100.0/total
print 'Fraction of correct predictions: %.2f%%' %ratio

Fraction of correct predictions: 18.04%


* *What are the correct predictions if you restrict the training/prediction to single districts (for example, predicting Mission vs. all other districts, etc)?*

* *Compare it to the random guess, what would you get if you'd guess a district randomly?*
* *And if you'd guess always one of the districts (for example the district with the most crimes)?*