# San Francisco Crime Classification
https://www.kaggle.com/c/sf-crime
## Predict the category of crimes that occurred in the city by the bay

From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz.

Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay.

From Sunset to SOMA, and Marina to Excelsior, this competition's dataset provides nearly 12 years of crime reports from across all of San Francisco's neighborhoods. Given time and location, you must predict the category of crime that occurred.

# Basic Imports and Reads

In [1]:
import numpy as np
import pandas
import sklearn

FILE_TRAIN = 'train.csv'
FILE_TEST  = 'test.csv'
with open(FILE_TRAIN, 'r') as f:
    dt_orig = pandas.read_csv(f)
with open(FILE_TEST, 'r') as f:
    dt_test_orig = pandas.read_csv(f)

In [13]:
dt = dt_orig[:1000]
dt_test = dt_test_orig

# Exploration of Data
Here we do a basic exploration of the types of columns, number of rows, and the type of data they contain.

In [14]:
dt

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541
5,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM UNLOCKED AUTO,Wednesday,INGLESIDE,NONE,0 Block of TEDDY AV,-122.403252,37.713431
6,2015-05-13 23:30:00,VEHICLE THEFT,STOLEN AUTOMOBILE,Wednesday,INGLESIDE,NONE,AVALON AV / PERU AV,-122.423327,37.725138
7,2015-05-13 23:30:00,VEHICLE THEFT,STOLEN AUTOMOBILE,Wednesday,BAYVIEW,NONE,KIRKWOOD AV / DONAHUE ST,-122.371274,37.727564
8,2015-05-13 23:00:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,RICHMOND,NONE,600 Block of 47TH AV,-122.508194,37.776601
9,2015-05-13 23:00:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,CENTRAL,NONE,JEFFERSON ST / LEAVENWORTH ST,-122.419088,37.807802


In [15]:
# Dataframe Info
dt.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 9 columns):
Dates         1000 non-null object
Category      1000 non-null object
Descript      1000 non-null object
DayOfWeek     1000 non-null object
PdDistrict    1000 non-null object
Resolution    1000 non-null object
Address       1000 non-null object
X             1000 non-null float64
Y             1000 non-null float64
dtypes: float64(2), object(7)
memory usage: 78.1+ KB


In [16]:
# Types of Crimes
print dt.Category.nunique()  # Number of unique categories
dt.groupby('Category').size().sort_values(ascending=False)

27


Category
LARCENY/THEFT                  286
OTHER OFFENSES                 119
NON-CRIMINAL                   109
ASSAULT                         77
VEHICLE THEFT                   69
BURGLARY                        50
VANDALISM                       45
WARRANTS                        43
MISSING PERSON                  34
SUSPICIOUS OCC                  31
DRUG/NARCOTIC                   29
ROBBERY                         28
FRAUD                           15
SECONDARY CODES                 14
WEAPON LAWS                     12
TRESPASS                         8
FORGERY/COUNTERFEITING           6
SEX OFFENSES FORCIBLE            5
STOLEN PROPERTY                  5
KIDNAPPING                       4
DRUNKENNESS                      3
PROSTITUTION                     2
RUNAWAY                          2
FAMILY OFFENSES                  1
DRIVING UNDER THE INFLUENCE      1
DISORDERLY CONDUCT               1
ARSON                            1
dtype: int64

In [17]:
# Convert Categories into numerical classes
cat_uniques = dt.Category.unique()
cat_to_num = {k: v for (k, v) in zip(cat_uniques, range(1, len(cat_uniques) + 1))}
num_to_cat = {k: v for (k, v) in zip(range(1, len(cat_uniques) + 1), cat_uniques)}
dt['CatClass'] = dt['Category']
dt['CatClass'] = dt['CatClass'].map(cat_class).astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


NameError: name 'cat_class' is not defined

In [7]:
def convert(dt):
    return np.array([dt.X.values, dt.Y.values, dt.CatClass.values]).T

In [8]:
# convert into model
xy = np.array([dt.X.values, dt.Y.values, dt.CatClass.values]).T

In [9]:
# cut out validation set
from sklearn import cross_validation
X_train, X_valid, y_train, y_valid = cross_validation.train_test_split(xy[0::, 0:2], xy[0::, 2], test_size=0.4, random_state=0)

In [10]:
# train the model
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=100)
forest = forest.fit(X_train, y_train)

In [11]:
# score the results
forest.score(X_valid, y_valid)

0.255

In [135]:
X_test = np.array([dt_test.X.values, dt_test.Y.values]).T
y_test = forest.predict(X_test)

In [148]:
print y_test

[  4.   2.  10. ...,   8.   3.   3.]


In [157]:
headers = 'Id,' + ','.join(sorted(cat_uniques)) + '\n'
f = open('y_test.txt', 'w')
f.write(headers)
for i in xrange(len(y_test)):
    arr = [0] * 39
    arr[int(y_test[i])] = 1
    f.write('%s,%s\n' % (i, ','.join(map(str, arr))))
f.close()