The SF Crime Dataset asks as an interesting question. Given time and location, you must predict the category of crime that occurred.

In [2]:
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
from sklearn.metrics import log_loss
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold
import numpy as np

# convert the Dates column of our provided data from string to datetime format.
train=pd.read_csv('train.csv', parse_dates = ['Dates'])
test=pd.read_csv('test.csv', parse_dates = ['Dates'])

# Print the first 5 rows of the dataframe.
print train.head(2)

                Dates        Category                  Descript  DayOfWeek  \
0 2015-05-13 23:53:00        WARRANTS            WARRANT ARREST  Wednesday   
1 2015-05-13 23:53:00  OTHER OFFENSES  TRAFFIC VIOLATION ARREST  Wednesday   

  PdDistrict      Resolution             Address           X          Y  
0   NORTHERN  ARREST, BOOKED  OAK ST / LAGUNA ST -122.425892  37.774599  
1   NORTHERN  ARREST, BOOKED  OAK ST / LAGUNA ST -122.425892  37.774599  


The dataset is providing us with several information in its columns. One of my first interests was to handle categorical data and convert them to numerical features. However, we need to make sure that the nearby resulting numerical values do not imply any form of similarity with each other, hence, we binarize our findings.

Initially, let's convert the crime category labels to integer values using the method LabelEncoder.

In [10]:
#Convert crime labels to numbers
le_crime = preprocessing.LabelEncoder()
encodedCrime = le_crime.fit_transform(train.Category)
humanReadableCrime = le_crime.inverse_transform(encodedCrime)
print humanReadableCrime
print encodedCrime

(878049,)
[37 21 21 ..., 16 35 12]


Now we can create matrices that give us binarized information about the day and district certain incidents took place

In [6]:
#Get binarized weekdays and district
days = pd.get_dummies(train.DayOfWeek)
district = pd.get_dummies(train.PdDistrict)

After we handle the identification of hour in the Dates column, we can then create a new array that can be super useful

In [57]:
# Gets the hour portion form the "Dates" column
hour = train.Dates.dt.hour
# Creates binarized matrix that you can access the by hour[23] to see if an incident happened at that time
hour = pd.get_dummies(hour) 

#Build new array with binary info on location, day and hour and numbered crime
train_data = pd.concat([hour, days, district], axis=1)
train_data['crime']=crime

Since we want to perform a lot of tests using our train data, we are going to further split our data into a subtrain and subtest set, respectively, while leaving the test dataset untouched.

In [79]:
training, validation = train_test_split(train_data, train_size=.60)

Given that we have binarize the mapping for features that include the day and district an incident happened, it would be wise to considered our primary features to be the following:

In [59]:
features = ['Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday',
 'Wednesday', 'BAYVIEW', 'CENTRAL', 'INGLESIDE', 'MISSION',
 'NORTHERN', 'PARK', 'RICHMOND', 'SOUTHERN', 'TARAVAL', 'TENDERLOIN']

We can make use of the Naive Baysean algorithm and calculate the log_loss; i.e. the loss function used in (multinomial) logistic regression. In addition to that, we can use logistic regression to analyze the dataset, as one or more independent variables exist that determine the outcome of the type of crime. 

In [81]:
# training : 60% of training data
# validation: 40% of training data

# Naive Baysean
model = BernoulliNB()
model.fit(training[features], training['crime'])
predicted = np.array(model.predict_proba(validation[features]))
log_loss(validation['crime'], predicted) 
 
# #Logistic Regression
# model = LogisticRegression(C=.01)
# model.fit(training[features], training['crime'])
# predicted = np.array(model.predict_proba(validation[features]))
# log_loss(validation['crime'], predicted) 

2.6146195273756532