# Table of contents
1. [Introduction](#introduction)
2. [Imports](#imports)
3. [Describe Dataset](#describe-dataset)
4. [Quick Preprocessing & Feature Engineering](#preprocessing)
5. [Feature Selection](#feature-selection)
6. [Training with XGBoost (Cross Validation)](#training)

# Introduction <a name="introduction"></a>

This is my solution ( *late submission* ) to the **San Francisco Crime Classification** competition.

I found this Dataset to be very interesting for learning to deal with *slightly big* datasets, it contains *Spatial Coordinates*,  a *Datetime* column and *cyclic features*.

In this Kernel, I want to share my approach to this problem. I will focus on **Feature Engineering** & **Prediction** using **XGBoost**

( **I wont' be going through the visualizations** ( you can check [my Github Repo](https://github.com/hamzael1/kaggle-san-francisco-crime-classification) )

# Imports <a name="imports"></a>

In [1]:
%matplotlib inline
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Describe Dataset <a name="describe-dataset"></a>

In [2]:
train_df = pd.read_csv('../input/train.csv')
test_df = pd.read_csv('../input/test.csv')

## Show Random rows:

In [3]:
# Show 5 random rows from dataset
train_df.sample(5)

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
712557,2005-04-08 10:00:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Friday,TENDERLOIN,NONE,100 Block of GOLDEN GATE AV,-122.413048,37.781912
698381,2005-06-18 10:47:00,VANDALISM,"MALICIOUS MISCHIEF, VANDALISM",Saturday,PARK,"ARREST, BOOKED",400 Block of HUGO ST,-122.462638,37.765083
568220,2007-05-03 23:55:00,ASSAULT,BATTERY,Thursday,SOUTHERN,"ARREST, BOOKED",100 Block of 4TH ST,-122.403941,37.784301
75005,2014-05-15 22:04:00,ASSAULT,STALKING,Thursday,TENDERLOIN,NONE,400 Block of TURK ST,-122.416349,37.782557
288412,2011-06-02 22:00:00,MISSING PERSON,FOUND PERSON,Thursday,BAYVIEW,LOCATED,1400 Block of PHELPS ST,-122.394439,37.736444


In [4]:
test_df.sample(1)

Unnamed: 0,Id,Dates,DayOfWeek,PdDistrict,Address,X,Y
777150,777150,2004-06-01 13:00:00,Tuesday,MISSION,100 Block of SHOTWELL ST,-122.416535,37.766041


## Show useful information (columns, types, number of rows)

- Few important observations:
    - We have **878049** Observations of **9** variables
    - We have a **'Dates'** column which contains the date and time of the occurence of the crime, but it's a String.
    - We have **spatial coordinates** ( Latitude and Longitude ) of the exact place of the crime.
    - The Target column is **'Category'**, which is a Categorical Column ( 39 categories )
    - The **'DayOfWeek'** column is also Categorical ( 7 days )
    - The **'PdDistrict'** column is also Categorical ( 10 districts  )
    - The **'Address'** column indicates whether the crime location was an intersection of two roads
    - The **'Resolution'** column will be droped ( It won't help us with prediction )

In [5]:
print('Number of Categories: ', train_df.Category.nunique())
print('Number of PdDistricts: ', train_df.PdDistrict.nunique())
print('Number of DayOfWeeks: ', train_df.DayOfWeek.nunique())
print('_________________________________________________')
# Show some useful Information
train_df.info()

Number of Categories:  39
Number of PdDistricts:  10
Number of DayOfWeeks:  7
_________________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878049 entries, 0 to 878048
Data columns (total 9 columns):
Dates         878049 non-null object
Category      878049 non-null object
Descript      878049 non-null object
DayOfWeek     878049 non-null object
PdDistrict    878049 non-null object
Resolution    878049 non-null object
Address       878049 non-null object
X             878049 non-null float64
Y             878049 non-null float64
dtypes: float64(2), object(7)
memory usage: 60.3+ MB


# Quick Preprocessing & Feature Engineering <a name="preprocessing"></a>

 ## Drop the Resolution Column:

In [6]:
train_df = train_df.drop('Resolution', axis=1)
train_df.sample(1)

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Address,X,Y
300865,2011-03-25 21:04:00,SUSPICIOUS OCC,INVESTIGATIVE DETENTION,Friday,MISSION,1200 Block of 15TH ST,-122.413034,37.766793


## Parse the 'Dates' Column:

### The 'Dates' column type is String. It will be easier to work with by parsing it to Datetime.

In [7]:
train_df.Dates.dtype

dtype('O')

### Check if there are any missing values or typos:

In [8]:
assert train_df.Dates.isnull().any() == False
assert test_df.Dates.isnull().any() == False

In [9]:
assert train_df.Dates.str.match('\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d').all() == True
assert test_df.Dates.str.match('\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d').all() == True

### Now we proceed to parsing using the function `pandas.to_datetime` :
( We will also change the column name to 'Date' singular ) 

In [10]:
train_df['Date'] = pd.to_datetime(train_df.Dates)
test_df['Date'] = pd.to_datetime(test_df.Dates)

train_df = train_df.drop('Dates', axis=1)
test_df = test_df.drop('Dates', axis=1)
train_df.sample(1)

Unnamed: 0,Category,Descript,DayOfWeek,PdDistrict,Address,X,Y,Date
211275,VEHICLE THEFT,STOLEN AUTOMOBILE,Tuesday,CENTRAL,LEAVENWORTH ST / BAY ST,-122.418534,37.804998,2012-07-24 14:00:00


In [11]:
# Confirm that it was parsed to Datetime
train_df.Date.dtype

dtype('<M8[ns]')

## Engineer a feature to indicate whether the crime was commited by day or by night :

In [12]:
train_df['IsDay'] = 0
train_df.loc[ (train_df.Date.dt.hour > 6) & (train_df.Date.dt.hour < 20), 'IsDay' ] = 1
test_df['IsDay'] = 0
test_df.loc[ (test_df.Date.dt.hour > 6) & (test_df.Date.dt.hour < 20), 'IsDay' ] = 1

train_df.sample(3)

Unnamed: 0,Category,Descript,DayOfWeek,PdDistrict,Address,X,Y,Date,IsDay
839038,OTHER OFFENSES,CONSPIRACY,Thursday,INGLESIDE,1500 Block of DOLORES ST,-122.424576,37.744228,2003-07-10 06:10:00,0
113463,SUSPICIOUS OCC,SUSPICIOUS OCCURRENCE,Friday,TENDERLOIN,300 Block of ELLIS ST,-122.411988,37.785023,2013-11-01 19:00:00,1
184109,OTHER OFFENSES,"DRIVERS LICENSE, SUSPENDED OR REVOKED",Friday,BAYVIEW,BAY SHORE BL / JERROLD AV,-122.403564,37.747761,2012-11-30 17:40:00,1


## Create 'Month', 'Year' and 'DayOfWeekInt' columns

### Encode 'DayOfWeek' to Integer:

In [13]:
days_to_int_dic = {
        'Monday': 1,
        'Tuesday': 2,
        'Wednesday': 3,
        'Thursday': 4,
        'Friday': 5,
        'Saturday': 6,
        'Sunday': 7,
}
train_df['DayOfWeek'] = train_df['DayOfWeek'].map(days_to_int_dic)
test_df ['DayOfWeek'] = test_df ['DayOfWeek'].map(days_to_int_dic)

train_df.DayOfWeek.unique()

array([3, 2, 1, 7, 6, 5, 4])

### Create Hour, Month and Year Columns: 

In [14]:
train_df['Hour'] = train_df.Date.dt.hour
train_df['Month'] = train_df.Date.dt.month
train_df['Year'] = train_df.Date.dt.year
train_df['Year'] = train_df['Year'] - 2000 # The Algorithm doesn't know the difference. It's just easier to work like that

test_df['Hour'] = test_df.Date.dt.hour
test_df['Month'] = test_df.Date.dt.month
test_df['Year'] = test_df.Date.dt.year
test_df['Year'] = test_df['Year'] - 2000 # The Algorithm doesn't know the difference. It's just easier to work like that

train_df.sample(1)

Unnamed: 0,Category,Descript,DayOfWeek,PdDistrict,Address,X,Y,Date,IsDay,Hour,Month,Year
788317,FORGERY/COUNTERFEITING,"CHECKS, MAKE OR PASS FICTITIOUS",3,NORTHERN,1500 Block of POLK ST,-122.420714,37.791055,2004-03-17 17:40:00,1,17,3,4


### Deal with the cyclic characteristic of Months and Days of Week:

In [15]:
train_df['HourCos'] = np.cos((train_df['Hour']*2*np.pi)/24 )
train_df['DayOfWeekCos'] = np.cos((train_df['DayOfWeek']*2*np.pi)/7 )
train_df['MonthCos'] = np.cos((train_df['Month']*2*np.pi)/12 )

test_df['HourCos'] = np.cos((test_df['Hour']*2*np.pi)/24 )
test_df['DayOfWeekCos'] = np.cos((test_df['DayOfWeek']*2*np.pi)/7 )
test_df['MonthCos'] = np.cos((test_df['Month']*2*np.pi)/12 )

train_df.sample(1)

Unnamed: 0,Category,Descript,DayOfWeek,PdDistrict,Address,X,Y,Date,IsDay,Hour,Month,Year,HourCos,DayOfWeekCos,MonthCos
761703,BURGLARY,"BURGLARY OF APARTMENT HOUSE, UNLAWFUL ENTRY",4,PARK,100 Block of CASTRO ST,-122.435517,37.766932,2004-07-22 09:30:00,1,9,7,4,-0.707107,-0.900969,-0.866025


## Dummy Encoding of 'PdDistrict':

In [16]:
train_df = pd.get_dummies(train_df, columns=['PdDistrict'])
test_df  = pd.get_dummies(test_df,  columns=['PdDistrict'])
train_df.sample(2)

Unnamed: 0,Category,Descript,DayOfWeek,Address,X,Y,Date,IsDay,Hour,Month,Year,HourCos,DayOfWeekCos,MonthCos,PdDistrict_BAYVIEW,PdDistrict_CENTRAL,PdDistrict_INGLESIDE,PdDistrict_MISSION,PdDistrict_NORTHERN,PdDistrict_PARK,PdDistrict_RICHMOND,PdDistrict_SOUTHERN,PdDistrict_TARAVAL,PdDistrict_TENDERLOIN
611932,OTHER OFFENSES,PAROLE VIOLATION,4,RANDOLPH ST / VICTORIA ST,-122.465298,37.714287,2006-09-07 15:50:00,1,15,9,6,-0.707107,-0.900969,-1.83697e-16,0,0,0,0,0,0,0,0,1,0
161896,OTHER OFFENSES,"FRAUDULENT GAME OR TRICK, OBTAINING MONEY OR P...",5,400 Block of MANGELS AV,-122.447759,37.733128,2013-03-22 11:31:00,1,11,3,13,-0.965926,-0.222521,6.123234000000001e-17,0,0,1,0,0,0,0,0,0,0


## Label Encoding of 'Category':

In [17]:
from sklearn.preprocessing import LabelEncoder

cat_le = LabelEncoder()
train_df['CategoryInt'] = pd.Series(cat_le.fit_transform(train_df.Category))
train_df.sample(5)
#cat_le.classes_

Unnamed: 0,Category,Descript,DayOfWeek,Address,X,Y,Date,IsDay,Hour,Month,Year,HourCos,DayOfWeekCos,MonthCos,PdDistrict_BAYVIEW,PdDistrict_CENTRAL,PdDistrict_INGLESIDE,PdDistrict_MISSION,PdDistrict_NORTHERN,PdDistrict_PARK,PdDistrict_RICHMOND,PdDistrict_SOUTHERN,PdDistrict_TARAVAL,PdDistrict_TENDERLOIN,CategoryInt
244418,SECONDARY CODES,DOMESTIC VIOLENCE,6,100 Block of POWELL ST,-122.407878,37.785968,2012-01-28 03:45:00,0,3,1,12,0.707107,0.62349,0.866025,0,0,0,0,0,0,0,0,0,1,27
877187,OTHER OFFENSES,"DRIVERS LICENSE, SUSPENDED OR REVOKED",3,CIRCULAR AV / FLOOD AV,-122.440664,37.729919,2003-01-08 03:30:00,0,3,1,3,0.707107,-0.900969,0.866025,0,0,1,0,0,0,0,0,0,0,21
317182,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,3,6200 Block of GEARY BL,-122.48636,37.780133,2010-12-29 17:00:00,1,17,12,10,-0.258819,-0.900969,1.0,0,0,0,0,0,0,1,0,0,0,16
466753,DRUG/NARCOTIC,"CONTROLLED SUBSTANCE VIOLATION, LOITERING FOR",3,TURK ST / HYDE ST,-122.415695,37.782585,2008-10-15 17:56:00,1,17,10,8,-0.258819,-0.900969,0.5,0,0,0,0,0,0,0,0,0,1,7
479101,FORGERY/COUNTERFEITING,"CHECKS, MAKE OR PASS FICTITIOUS",1,300 Block of BRANNAN ST,-122.392417,37.781655,2008-08-18 09:00:00,1,9,8,8,-0.707107,0.62349,-0.5,0,0,0,0,0,0,0,1,0,0,12


In [18]:
train_df['InIntersection'] = 1
train_df.loc[train_df.Address.str.contains('Block'), 'InIntersection'] = 0

test_df['InIntersection'] = 1
test_df.loc[test_df.Address.str.contains('Block'), 'InIntersection'] = 0

In [19]:
train_df.sample(10)

Unnamed: 0,Category,Descript,DayOfWeek,Address,X,Y,Date,IsDay,Hour,Month,Year,HourCos,DayOfWeekCos,MonthCos,PdDistrict_BAYVIEW,PdDistrict_CENTRAL,PdDistrict_INGLESIDE,PdDistrict_MISSION,PdDistrict_NORTHERN,PdDistrict_PARK,PdDistrict_RICHMOND,PdDistrict_SOUTHERN,PdDistrict_TARAVAL,PdDistrict_TENDERLOIN,CategoryInt,InIntersection
871079,WARRANTS,WARRANT ARREST,3,ALEMANY BL / CRYSTAL ST,-122.460623,37.710478,2003-02-05 17:45:00,1,17,2,3,-0.258819,-0.900969,0.5,0,0,0,0,0,0,0,0,1,0,37,1
353682,LARCENY/THEFT,GRAND THEFT OF PROPERTY,3,MCALLISTER ST / PIERCE ST,-122.435053,37.778212,2010-06-16 17:30:00,1,17,6,10,-0.258819,-0.900969,-1.0,0,0,0,0,0,1,0,0,0,0,16,1
479720,WARRANTS,ENROUTE TO OUTSIDE JURISDICTION,6,7TH ST / MINNA ST,-122.410406,37.778784,2008-08-09 16:18:00,1,16,8,8,-0.5,0.62349,-0.5,0,0,0,0,0,0,0,1,0,0,37,1
358165,WEAPON LAWS,POSS OF PROHIBITED WEAPON,5,500 Block of CAPP ST,-122.417956,37.75788,2010-05-21 18:50:00,1,18,5,10,-1.83697e-16,-0.222521,-0.866025,0,0,0,1,0,0,0,0,0,0,38,0
239897,OTHER OFFENSES,TAMPERING WITH A VEHICLE,4,1500 Block of COLE ST,-122.448955,37.760481,2012-02-23 00:13:00,0,0,2,12,1.0,-0.900969,0.5,0,0,0,0,0,1,0,0,0,0,21,0
874069,VEHICLE THEFT,"VEHICLE, RECOVERED, AUTO",3,3500 Block of GEARY BL,-122.457016,37.781216,2003-01-22 18:30:00,1,18,1,3,-1.83697e-16,-0.900969,0.866025,0,0,0,0,0,0,1,0,0,0,36,0
516208,DRUG/NARCOTIC,POSSESSION OF NARCOTICS PARAPHERNALIA,2,700 Block of GEARY ST,-122.415633,37.786359,2008-02-05 16:46:00,1,16,2,8,-0.5,-0.222521,0.5,0,1,0,0,0,0,0,0,0,0,7,0
112017,OTHER OFFENSES,"DRIVERS LICENSE, SUSPENDED OR REVOKED",2,VANNESS AV / GROVE ST,-122.419885,37.778251,2013-11-12 12:20:00,1,12,11,13,-1.0,-0.222521,0.866025,0,0,0,0,1,0,0,0,0,0,21,1
109276,ASSAULT,AGGRAVATED ASSAULT WITH A DEADLY WEAPON,1,18TH ST / NOE ST,-122.432795,37.76102,2013-11-25 23:20:00,0,23,11,13,0.9659258,0.62349,0.866025,0,0,0,1,0,0,0,0,0,0,1,1
813928,DRUG/NARCOTIC,POSSESSION OF METH-AMPHETAMINE,1,JACKSON ST / HYDE ST,-122.418116,37.794566,2003-11-10 17:30:00,1,17,11,3,-0.258819,0.62349,0.866025,0,1,0,0,0,0,0,0,0,0,7,1


# Feature Selection <a name="feature-selection"></a>

**Now let's get our dataset ready for training !**

In [20]:
train_df.columns

Index(['Category', 'Descript', 'DayOfWeek', 'Address', 'X', 'Y', 'Date',
       'IsDay', 'Hour', 'Month', 'Year', 'HourCos', 'DayOfWeekCos', 'MonthCos',
       'PdDistrict_BAYVIEW', 'PdDistrict_CENTRAL', 'PdDistrict_INGLESIDE',
       'PdDistrict_MISSION', 'PdDistrict_NORTHERN', 'PdDistrict_PARK',
       'PdDistrict_RICHMOND', 'PdDistrict_SOUTHERN', 'PdDistrict_TARAVAL',
       'PdDistrict_TENDERLOIN', 'CategoryInt', 'InIntersection'],
      dtype='object')

In [21]:
feature_cols = ['X', 'Y', 'IsDay', 'DayOfWeek', 'Month', 'Hour', 'Year', 'InIntersection',
                'PdDistrict_BAYVIEW', 'PdDistrict_CENTRAL', 'PdDistrict_INGLESIDE',
                'PdDistrict_MISSION', 'PdDistrict_NORTHERN', 'PdDistrict_PARK',
                'PdDistrict_RICHMOND', 'PdDistrict_SOUTHERN', 'PdDistrict_TARAVAL', 'PdDistrict_TENDERLOIN']
target_col = 'CategoryInt'

train_x = train_df[feature_cols]
train_y = train_df[target_col]

test_ids = test_df['Id']
test_x = test_df[feature_cols]

In [22]:
train_x.sample(1)

Unnamed: 0,X,Y,IsDay,DayOfWeek,Month,Hour,Year,InIntersection,PdDistrict_BAYVIEW,PdDistrict_CENTRAL,PdDistrict_INGLESIDE,PdDistrict_MISSION,PdDistrict_NORTHERN,PdDistrict_PARK,PdDistrict_RICHMOND,PdDistrict_SOUTHERN,PdDistrict_TARAVAL,PdDistrict_TENDERLOIN
566658,-122.465494,37.782889,0,2,5,0,7,1,0,0,0,0,0,0,1,0,0,0


In [23]:
test_x.sample(1)

Unnamed: 0,X,Y,IsDay,DayOfWeek,Month,Hour,Year,InIntersection,PdDistrict_BAYVIEW,PdDistrict_CENTRAL,PdDistrict_INGLESIDE,PdDistrict_MISSION,PdDistrict_NORTHERN,PdDistrict_PARK,PdDistrict_RICHMOND,PdDistrict_SOUTHERN,PdDistrict_TARAVAL,PdDistrict_TENDERLOIN
703329,-122.429913,37.735323,1,7,5,15,5,0,0,0,1,0,0,0,0,0,0,0


# XGBOOST Training (Cross-Validation): <a name="training"></a>

In [24]:
type(train_x), type(train_y)

(pandas.core.frame.DataFrame, pandas.core.series.Series)

## Import XGBoost and create the DMatrices

In [25]:
import xgboost as xgb
train_xgb = xgb.DMatrix(train_x, label=train_y)
test_xgb  = xgb.DMatrix(test_x)

## Play with the parameters and do Cross-Validation

In [26]:
params = {
    'max_depth': 4,  # the maximum depth of each tree
    'eta': 0.3,  # the training step for each iteration
    'silent': 1,  # logging mode - quiet
    'objective': 'multi:softprob',  # error evaluation for multiclass training
    'num_class': 39,
}

In [27]:
CROSS_VAL = False
if CROSS_VAL:
    print('Doing Cross-validation ...')
    cv = xgb.cv(params, train_xgb, nfold=3, early_stopping_rounds=10, metrics='mlogloss', verbose_eval=True)
    cv

## Fit & Make the predictions

In [28]:
SUBMIT = not CROSS_VAL
if SUBMIT:
    print('Fitting Model ...')
    m = xgb.train(params, train_xgb, 10)
    res = m.predict(test_xgb)
    cols = ['Id'] + cat_le.classes_
    submission = pd.DataFrame(res, columns=cat_le.classes_)
    submission.insert(0, 'Id', test_ids)
    submission.to_csv('submission.csv', index=False)
    print('Done Outputing !')
    print(submission.sample(3))
else:
    print('NOT SUBMITING')

Fitting Model ...
Done Outputing !
            Id     ARSON     ...       WARRANTS  WEAPON LAWS
565595  565595  0.005318     ...       0.046265     0.015714
320495  320495  0.006616     ...       0.027081     0.016835
138777  138777  0.004894     ...       0.061299     0.009119

[3 rows x 40 columns]
