# Table of contents
1. [Introduction](#introduction)
2. [Imports](#imports)
3. [Explore Dataset](#explore-dataset)
4. [Quick Preprocessing & Feature Engineering](#preprocessing)
5. [Feature Selection](#feature-selection)
6. [Training with XGBoost (Cross Validation)](#training)

# Introduction <a name="introduction"></a>

This is my solution ( *late submission* ) to the **San Francisco Crime Classification** competition.

I found this Dataset to be very interesting for learning to deal with *slightly big* datasets, it contains *Spatial Coordinates*,  a *Datetime* column and *cyclic features*.

In this Kernel, I want to share my approach to this problem. I will focus on **Feature Engineering** & **Prediction** using **XGBoost**

( **I wont' be going through the visualizations** ( you can check [my Github Repo](https://github.com/hamzael1/kaggle-san-francisco-crime-classification) )

# Imports <a name="imports"></a>

In [1]:
%matplotlib inline
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Explore Dataset <a name="explore-dataset"></a>

In [2]:
train_df = pd.read_csv('../input/train.csv')
test_df = pd.read_csv('../input/test.csv')

## Show Random rows:

In [3]:
# Show 5 random rows from dataset
train_df.sample(5)

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
247537,2012-01-13 07:00:00,FORGERY/COUNTERFEITING,"CHECKS OR LEGAL INSTRUMENTS, UTTERING FORGED",Friday,TARAVAL,NONE,1200 Block of 44TH AV,-122.503912,37.763305
731953,2004-12-23 00:01:00,SEX OFFENSES FORCIBLE,SEXUAL BATTERY,Thursday,PARK,UNFOUNDED,0 Block of CASTRO ST,-122.435637,37.768169
74282,2014-05-17 16:00:00,VEHICLE THEFT,STOLEN AUTOMOBILE,Saturday,RICHMOND,NONE,2200 Block of BALBOA ST,-122.482861,37.77632
772349,2004-05-28 21:30:00,PROSTITUTION,SOLICITS LEWD ACT,Friday,MISSION,"ARREST, CITED",400 Block of HAMPSHIRE ST,-122.408431,37.763758
380823,2010-01-16 02:10:00,SUSPICIOUS OCC,SUSPICIOUS PERSON,Saturday,INGLESIDE,NONE,WOOL ST / CORTLAND AV,-122.417141,37.739122


In [4]:
test_df.sample(1)

Unnamed: 0,Id,Dates,DayOfWeek,PdDistrict,Address,X,Y
418545,418545,2009-07-03 14:40:00,Friday,TENDERLOIN,400 Block of ELLIS ST,-122.413631,37.784805


## Show useful information (columns, types, number of rows)

- Few important observations:
    - We have 878049 Observations of 9 variables
    - We have a 'Dates' column which contains the date and time of the occurence of the crime, but it's a String.
    - We have spatial coordinates ( Latitude and Longitude ) of the exact place of the crime.
    - The Target column is 'Category', which is a Categorical Column ( 39 categories )
    - The 'DayOfWeek' column is also Categorical ( 7 days )
    - The 'PdDistrict' column is also Categorical ( 10 districts  )
    - The 'Address' column indicates whether the crime location was an intersection of two roads
    - The 'Resolution' column will be droped ( It won't help us with prediction )

In [5]:
print('Number of Categories: ', train_df.Category.nunique())
print('Number of PdDistricts: ', train_df.PdDistrict.nunique())
print('Number of DayOfWeeks: ', train_df.DayOfWeek.nunique())
print('_________________________________________________')
# Show some useful Information
train_df.info()

Number of Categories:  39
Number of PdDistricts:  10
Number of DayOfWeeks:  7
_________________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878049 entries, 0 to 878048
Data columns (total 9 columns):
Dates         878049 non-null object
Category      878049 non-null object
Descript      878049 non-null object
DayOfWeek     878049 non-null object
PdDistrict    878049 non-null object
Resolution    878049 non-null object
Address       878049 non-null object
X             878049 non-null float64
Y             878049 non-null float64
dtypes: float64(2), object(7)
memory usage: 60.3+ MB


# Quick Preprocessing & Feature Engineering <a name="preprocessing"></a>

 ## Drop the Resolution Column:

In [6]:
train_df = train_df.drop('Resolution', axis=1)
train_df.sample(1)

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Address,X,Y
617445,2006-08-10 08:06:00,WARRANTS,WARRANT ARREST,Thursday,MISSION,400 Block of VALENCIA ST,-122.421994,37.765315


## Parse the 'Dates' Column:

### The 'Dates' column type is String. It will be easier to work with by parsing it to Datetime.

In [7]:
train_df.Dates.dtype

dtype('O')

### Check if there are any missing values or typos:

In [8]:
assert train_df.Dates.isnull().any() == False
assert test_df.Dates.isnull().any() == False

In [9]:
assert train_df.Dates.str.match('\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d').all() == True
assert test_df.Dates.str.match('\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d').all() == True

### Now we proceed to parsing using the function `pandas.to_datetime` :
( We will also change the column name to 'Date' singular ) 

In [10]:
train_df['Date'] = pd.to_datetime(train_df.Dates)
test_df['Date'] = pd.to_datetime(test_df.Dates)

train_df = train_df.drop('Dates', axis=1)
test_df = test_df.drop('Dates', axis=1)
train_df.sample(1)

Unnamed: 0,Category,Descript,DayOfWeek,PdDistrict,Address,X,Y,Date
749404,LARCENY/THEFT,PETTY THEFT FROM A BUILDING,Saturday,SOUTHERN,1300 Block of MARKET ST,-122.416819,37.776936,2004-09-18 17:20:00


In [11]:
# Confirm that it was parsed to Datetime
train_df.Date.dtype

dtype('<M8[ns]')

## Engineer a feature to indicate whether the crime was commited by day or by night :

In [12]:
train_df['IsDay'] = 0
train_df.loc[ (train_df.Date.dt.hour > 6) & (train_df.Date.dt.hour < 20), 'IsDay' ] = 1
test_df['IsDay'] = 0
test_df.loc[ (test_df.Date.dt.hour > 6) & (test_df.Date.dt.hour < 20), 'IsDay' ] = 1

train_df.sample(3)

Unnamed: 0,Category,Descript,DayOfWeek,PdDistrict,Address,X,Y,Date,IsDay
221833,ASSAULT,THREATS AGAINST LIFE,Tuesday,BAYVIEW,0 Block of DUKES CT,-122.386783,37.737003,2012-05-29 11:00:00,1
790550,DRUG/NARCOTIC,POSSESSION OF HEROIN,Friday,SOUTHERN,0 Block of 6TH ST,-122.409504,37.781526,2004-03-05 12:40:00,1
354565,BURGLARY,"BURGLARY,STORE UNDER CONSTRUCTION, FORCIBLE ENTRY",Monday,MISSION,300 Block of TREAT AV,-122.413446,37.764637,2010-06-14 12:00:00,1


## Create 'Month', 'Year' and 'DayOfWeekInt' columns

### Transform 'DayOfWeek' to Integer:

In [13]:
days_to_int_dic = {
        'Monday': 1,
        'Tuesday': 2,
        'Wednesday': 3,
        'Thursday': 4,
        'Friday': 5,
        'Saturday': 6,
        'Sunday': 7,
}
train_df['DayOfWeek'] = train_df['DayOfWeek'].map(days_to_int_dic)
test_df ['DayOfWeek'] = test_df ['DayOfWeek'].map(days_to_int_dic)

train_df.DayOfWeek.unique()

array([3, 2, 1, 7, 6, 5, 4])

### Create Month & Year Columns: 

In [14]:
train_df['Month'] = train_df.Date.dt.month
train_df['Year'] = train_df.Date.dt.year
train_df['Year'] = train_df['Year'] - 2000 # The Algorithm doesn't know the difference. It's just easier to work like that

test_df['Month'] = test_df.Date.dt.month
test_df['Year'] = test_df.Date.dt.year
test_df['Year'] = test_df['Year'] - 2000 # The Algorithm doesn't know the difference. It's just easier to work like that

train_df.sample(1)

Unnamed: 0,Category,Descript,DayOfWeek,PdDistrict,Address,X,Y,Date,IsDay,Month,Year
527122,OTHER OFFENSES,PROBATION VIOLATION,6,SOUTHERN,1000 Block of MARKET ST,-122.411071,37.781751,2007-12-01 13:50:00,1,12,7


### Deal with the cyclic characteristic of Months and Days of Week:

In [15]:
train_df['DayOfWeekCos'] = np.cos((train_df['DayOfWeek']*2*np.pi)/7 )
train_df['MonthCos'] = np.cos((train_df['Month']*2*np.pi)/12 )

test_df['DayOfWeekCos'] = np.cos((test_df['DayOfWeek']*2*np.pi)/7 )
test_df['MonthCos'] = np.cos((test_df['Month']*2*np.pi)/12 )

train_df.sample(1)

Unnamed: 0,Category,Descript,DayOfWeek,PdDistrict,Address,X,Y,Date,IsDay,Month,Year,DayOfWeekCos,MonthCos
784574,DRUG/NARCOTIC,MAINTAINING PREMISE WHERE NARCOTICS ARE SOLD/USED,5,SOUTHERN,500 Block of MINNA ST,-122.408674,37.780057,2004-04-02 12:30:00,1,4,4,-0.222521,-0.5


## Dummy Encoding of 'PdDistrict':

In [16]:
train_df = pd.get_dummies(train_df, columns=['PdDistrict'])
test_df  = pd.get_dummies(test_df,  columns=['PdDistrict'])
train_df.sample(2)

Unnamed: 0,Category,Descript,DayOfWeek,Address,X,Y,Date,IsDay,Month,Year,DayOfWeekCos,MonthCos,PdDistrict_BAYVIEW,PdDistrict_CENTRAL,PdDistrict_INGLESIDE,PdDistrict_MISSION,PdDistrict_NORTHERN,PdDistrict_PARK,PdDistrict_RICHMOND,PdDistrict_SOUTHERN,PdDistrict_TARAVAL,PdDistrict_TENDERLOIN
265820,SUSPICIOUS OCC,INVESTIGATIVE DETENTION,4,0 Block of PHELAN AV,-122.452428,37.725717,2011-10-06 19:53:00,1,10,11,-0.900969,0.5,0,0,1,0,0,0,0,0,0,0
824612,BURGLARY,"BURGLARY OF APARTMENT HOUSE, UNLAWFUL ENTRY",4,2900 Block of FOLSOM ST,-122.413778,37.750092,2003-09-18 16:40:00,1,9,3,-0.900969,-1.83697e-16,0,0,0,1,0,0,0,0,0,0


## Label Encoding of 'Category':

In [17]:
from sklearn.preprocessing import LabelEncoder

cat_le = LabelEncoder()
train_df['CategoryInt'] = pd.Series(cat_le.fit_transform(train_df.Category))
train_df.sample(5)
#cat_le.classes_

Unnamed: 0,Category,Descript,DayOfWeek,Address,X,Y,Date,IsDay,Month,Year,DayOfWeekCos,MonthCos,PdDistrict_BAYVIEW,PdDistrict_CENTRAL,PdDistrict_INGLESIDE,PdDistrict_MISSION,PdDistrict_NORTHERN,PdDistrict_PARK,PdDistrict_RICHMOND,PdDistrict_SOUTHERN,PdDistrict_TARAVAL,PdDistrict_TENDERLOIN,CategoryInt
538310,OTHER OFFENSES,CONSPIRACY,6,200 Block of 10TH ST,-122.413532,37.773455,2007-10-06 03:24:00,0,10,7,0.62349,0.5,0,0,0,0,0,0,0,1,0,0,21
498619,ASSAULT,AGGRAVATED ASSAULT WITH A KNIFE,6,MARKET ST / 5TH ST,-122.408068,37.783992,2008-05-03 00:28:00,0,5,8,0.62349,-0.8660254,0,0,0,0,0,0,0,1,0,0,1
559687,ASSAULT,AGGRAVATED ASSAULT WITH BODILY FORCE,6,200 Block of INTERSTATE80 HY,-122.365565,37.809671,2007-06-16 16:05:00,1,6,7,0.62349,-1.0,0,0,0,0,0,0,0,1,0,0,1
270224,OTHER OFFENSES,RESISTING ARREST,6,1600 Block of NEWHALL ST,-122.393571,37.734062,2011-09-10 22:39:00,0,9,11,0.62349,-1.83697e-16,1,0,0,0,0,0,0,0,0,0,21
163120,OTHER OFFENSES,RESISTING ARREST,2,MINNA ST / 6TH ST,-122.408163,37.780535,2013-03-19 15:50:00,1,3,13,-0.222521,6.123234000000001e-17,0,0,0,0,0,0,0,1,0,0,21


# Feature Selection <a name="feature-selection"></a>

**Now let's get our dataset ready for training !**

In [18]:
train_df.columns

Index(['Category', 'Descript', 'DayOfWeek', 'Address', 'X', 'Y', 'Date',
       'IsDay', 'Month', 'Year', 'DayOfWeekCos', 'MonthCos',
       'PdDistrict_BAYVIEW', 'PdDistrict_CENTRAL', 'PdDistrict_INGLESIDE',
       'PdDistrict_MISSION', 'PdDistrict_NORTHERN', 'PdDistrict_PARK',
       'PdDistrict_RICHMOND', 'PdDistrict_SOUTHERN', 'PdDistrict_TARAVAL',
       'PdDistrict_TENDERLOIN', 'CategoryInt'],
      dtype='object')

In [19]:
feature_cols = ['X', 'Y', 'IsDay', 'DayOfWeekCos', 'MonthCos', 'Year', 
                'PdDistrict_BAYVIEW', 'PdDistrict_CENTRAL', 'PdDistrict_INGLESIDE',
                'PdDistrict_MISSION', 'PdDistrict_NORTHERN', 'PdDistrict_PARK',
                'PdDistrict_RICHMOND', 'PdDistrict_SOUTHERN', 'PdDistrict_TARAVAL', 'PdDistrict_TENDERLOIN']
target_col = 'CategoryInt'

train_x = pd.DataFrame(train_df[feature_cols])
train_y = train_df[target_col]

test_ids = test_df['Id']
test_x = test_df[feature_cols]

In [20]:
train_x.sample(1)

Unnamed: 0,X,Y,IsDay,DayOfWeekCos,MonthCos,Year,PdDistrict_BAYVIEW,PdDistrict_CENTRAL,PdDistrict_INGLESIDE,PdDistrict_MISSION,PdDistrict_NORTHERN,PdDistrict_PARK,PdDistrict_RICHMOND,PdDistrict_SOUTHERN,PdDistrict_TARAVAL,PdDistrict_TENDERLOIN
99636,-122.403405,37.775421,1,1.0,0.866025,14,0,0,0,0,0,0,0,1,0,0


In [21]:
test_x.sample(1)

Unnamed: 0,X,Y,IsDay,DayOfWeekCos,MonthCos,Year,PdDistrict_BAYVIEW,PdDistrict_CENTRAL,PdDistrict_INGLESIDE,PdDistrict_MISSION,PdDistrict_NORTHERN,PdDistrict_PARK,PdDistrict_RICHMOND,PdDistrict_SOUTHERN,PdDistrict_TARAVAL,PdDistrict_TENDERLOIN
352042,-122.412447,37.775634,1,1.0,-1.0,10,0,0,0,0,0,0,0,1,0,0


# XGBOOST Training (Cross-Validation): <a name="training"></a>

In [22]:
type(train_x), type(train_y)

(pandas.core.frame.DataFrame, pandas.core.series.Series)

## Import XGBoost and create the DMatrices

In [23]:
import xgboost as xgb
train_xgb = xgb.DMatrix(train_x, label=train_y)
test_xgb  = xgb.DMatrix(test_x)

## Play with the parameters and do Cross-Validation

In [24]:
params = {
    'max_depth': 4,  # the maximum depth of each tree
    'eta': 0.3,  # the training step for each iteration
    'silent': 1,  # logging mode - quiet
    'objective': 'multi:softprob',  # error evaluation for multiclass training
    'num_class': 39,
}

In [25]:
CROSS_VAL = False
if CROSS_VAL:
    cv = xgb.cv(params, train_xgb, nfold=3, early_stopping_rounds=10, metrics='mlogloss', verbose_eval=True)
    cv

## Fit & Make the predictions

In [26]:
m = xgb.train(params, train_xgb, 5)

In [27]:
SUBMIT = True
if SUBMIT:
    res = m.predict(test_xgb)
    cols = ['Id'] + cat_le.classes_
    submission = pd.DataFrame(res, columns=cat_le.classes_)
    submission.insert(0, 'Id', test_ids)
    submission.to_csv('submission.csv', index=False)
    print('Done Outputing !')
    print(submission.sample(3))

Done Outputing !
            Id     ARSON     ...       WARRANTS  WEAPON LAWS
542099  542099  0.012038     ...       0.041159     0.017528
849580  849580  0.012600     ...       0.051174     0.015876
39040    39040  0.011889     ...       0.037796     0.018295

[3 rows x 40 columns]
