# San Francisco Crime

For this hackathon, we'll be using the a database of crimes reported in San Francisco over a number of years.

This is a supervised learning task, with the goal of predicting the *category* of crime a given report will fall into given the date, police district, address, and longitude/latitude of the report.

In [1]:
import pandas as pd


Our training data has the following structure:

In [2]:
train = pd.read_csv('../data/sfcrime-hackathon/hackathon_train.csv')
train.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2011-12-04 18:15:00,SUSPICIOUS OCC,SUSPICIOUS OCCURRENCE,Sunday,PARK,NONE,100 Block of BEULAH ST,-122.452331,37.767356
1,2009-01-11 19:57:00,WARRANTS,ENROUTE TO OUTSIDE JURISDICTION,Sunday,MISSION,"ARREST, BOOKED",18TH ST / CAPP ST,-122.418272,37.761903
2,2007-01-25 18:15:00,NON-CRIMINAL,"AIDED CASE, MENTAL DISTURBED",Thursday,CENTRAL,NONE,1200 Block of STOCKTON ST,-122.408521,37.797492
3,2012-01-10 08:55:00,ROBBERY,"ROBBERY, BODILY FORCE",Tuesday,NORTHERN,NONE,HAYES ST / FRANKLIN ST,-122.421333,37.77709
4,2014-05-27 12:25:00,VANDALISM,"MALICIOUS MISCHIEF, VANDALISM OF VEHICLES",Tuesday,TENDERLOIN,NONE,JONES ST / TURK ST,-122.412414,37.783004


Our test set is has the following structure:

In [3]:
test = pd.read_csv('../data/sfcrime-hackathon/hackathon_test.csv', index_col='Id')
test.head()

Unnamed: 0_level_0,Dates,DayOfWeek,PdDistrict,Address,X,Y
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2015-03-02 14:05:00,Monday,MISSION,200 Block of CHURCH ST,-122.428814,37.766808
1,2004-01-23 20:00:00,Friday,BAYVIEW,0 Block of CONKLING ST,-122.401829,37.735606
2,2014-01-12 00:01:00,Sunday,TENDERLOIN,300 Block of OFARRELL ST,-122.410509,37.786043
3,2005-08-28 00:05:00,Sunday,SOUTHERN,100 Block of 3RD ST,-122.400916,37.785457
4,2007-11-03 10:00:00,Saturday,CENTRAL,500 Block of UNION ST,-122.4086,37.80046


The goal is to predict the probability that a given report has a particular category. In order to do this, each team will submit a test results csv with an `Id` column and one column for each category of crime in the training set. The values in these columns will be predictions of the probability that a report falls into the given category.

In [4]:
predictions = pd.DataFrame(
    0,
    index=test.index,
    columns=['ARSON', 'ASSAULT', 'BAD CHECKS', 'BRIBERY', 'BURGLARY',
       'DISORDERLY CONDUCT', 'DRIVING UNDER THE INFLUENCE',
       'DRUG/NARCOTIC', 'DRUNKENNESS', 'EMBEZZLEMENT', 'EXTORTION',
       'FAMILY OFFENSES', 'FORGERY/COUNTERFEITING', 'FRAUD', 'GAMBLING',
       'KIDNAPPING', 'LARCENY/THEFT', 'LIQUOR LAWS', 'LOITERING',
       'MISSING PERSON', 'NON-CRIMINAL', 'OTHER OFFENSES',
       'PORNOGRAPHY/OBSCENE MAT', 'PROSTITUTION', 'RECOVERED VEHICLE',
       'ROBBERY', 'RUNAWAY', 'SECONDARY CODES', 'SEX OFFENSES FORCIBLE',
       'SEX OFFENSES NON FORCIBLE', 'STOLEN PROPERTY', 'SUICIDE',
       'SUSPICIOUS OCC', 'TREA', 'TRESPASS', 'VANDALISM', 'VEHICLE THEFT',
       'WARRANTS', 'WEAPON LAWS']
)
predictions.head()

Unnamed: 0_level_0,ARSON,ASSAULT,BAD CHECKS,BRIBERY,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,DRUNKENNESS,EMBEZZLEMENT,...,SEX OFFENSES NON FORCIBLE,STOLEN PROPERTY,SUICIDE,SUSPICIOUS OCC,TREA,TRESPASS,VANDALISM,VEHICLE THEFT,WARRANTS,WEAPON LAWS
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Scoring

The results you generate will be scored using `neg_log_loss` scoring against the known category of the test set:

In [5]:
from sklearn.metrics import log_loss

In [6]:
test_results = pd.read_csv('../data/sfcrime-hackathon/hackathon_test_result.csv', index_col='Id')
test_results.head()

Unnamed: 0_level_0,Category
Id,Unnamed: 1_level_1
0,NON-CRIMINAL
1,LARCENY/THEFT
2,LARCENY/THEFT
3,NON-CRIMINAL
4,ASSAULT


In [7]:
# Predicting that none of the categories are ever selected; you should be able to beat this...
log_loss(test_results, predictions, labels=predictions.columns)

3.6635616461296463

## Some Ideas on Where To Start:

* Do some EDA:
    * What is in the data?
    * Has it been well cleaned (it has).
    * Make some charts:
        * Are there clusters of crimes geographically?
        * What about by time?
* Pre-process the data...
    * Make sure the date fields are datetimes.
    * Consider using the StandardScalar.
    * Consider doing some feature engineering.
    * Try PCA.
* Test a few different models with simple parameters.
    * Does one seem to be working better than the others?
    * If so, consider tuning that one more.
    * Consider a grid search.
* Explain your results and justify your conclusions
    * Create charts that highlight your model's performance.
    * Highlight the risks associated with using your model.
        * When is it wrong, and how is it wrong?
        * Go beyond top-line accuracy...
    * A system like this has legal and political risks, consider that in your analysis
    
## Presentation Requirements:

* This is a learning exercise, not a product pitch:
    * highlight things that went wrong, confused you, and didn't work.
    * Most learning happens when things don't work as you initially expected.
* *Do* make charts.
* *Do* show some code.
* *Don't* ONLY show code and charts — summarize your findings with some bullet points.
    * *Don't* assume your audiene is 100% engineers — executives, product managers, and other collueges may be interested in what you are prese