This Notebook imports and explores the training dataset for the Kaggle SF Crime problem
https://www.kaggle.com/c/sf-crime

Current Exploration:
- Count unique values for each variable
- View most common values

Future Exploration:
- XY plot of lat/long w. circles to indicate number of crimes (and colors to show category?)
- Time series plots to see how category use changes over time



Interesting Points:
- Most crime on Friday, then Wednesday. Least on Sunday.
- X and Y latitude have same number of distinct values. Seem to be somehow linked to locations
  since, despite there being a lots of sig fig, they still can be frequency counted
- 800 Block of BRYANT ST has 4x+ more data points than anyplace else. Seems to link w/ most freq X and Y
- "Other Offenses" are common
- The dates with the most crime are new years day. Also the first of months.
- Note: Strange max value of Y = 90 degrees


In [1]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import csv
import datetime

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.grid_search import GridSearchCV

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report


In [2]:
data = pd.read_csv("/Users/cjllop/Code/MIDS/MLearning/Final/Data/train.csv")



In [4]:
#print data length and column names
print len(data)
print data.columns


878049
Index([u'Dates', u'Category', u'Descript', u'DayOfWeek', u'PdDistrict', u'Resolution', u'Address', u'X', u'Y'], dtype='object')


In [17]:
# Count distinct for each variable:
print "There are a total of %d, values." %len(data)

for var, series in data.iteritems():
    print "There are a total of %d, %s." % (len(series.value_counts()), var)


There are a total of 878049, values.
There are a total of 389257, Dates.
There are a total of 39, Category.
There are a total of 879, Descript.
There are a total of 7, DayOfWeek.
There are a total of 10, PdDistrict.
There are a total of 17, Resolution.
There are a total of 23228, Address.
There are a total of 34243, X.
There are a total of 34243, Y.


In [22]:
# View All of Categories, PdDistrict, Resolution, DayOfWeek
variables = ["Category", "PdDistrict", "Resolution", "DayOfWeek"]
for col in variables:
    print "-------------------------------------------------------------------------"
    print "There are a total of %d distinct %s values, as follows: " % (len(data[col].value_counts()), col)
    print data[col].value_counts()
    print

-------------------------------------------------------------------------
There are a total of 39 distinct Category values, as follows: 
LARCENY/THEFT                  174900
OTHER OFFENSES                 126182
NON-CRIMINAL                    92304
ASSAULT                         76876
DRUG/NARCOTIC                   53971
VEHICLE THEFT                   53781
VANDALISM                       44725
WARRANTS                        42214
BURGLARY                        36755
SUSPICIOUS OCC                  31414
MISSING PERSON                  25989
ROBBERY                         23000
FRAUD                           16679
FORGERY/COUNTERFEITING          10609
SECONDARY CODES                  9985
WEAPON LAWS                      8555
PROSTITUTION                     7484
TRESPASS                         7326
STOLEN PROPERTY                  4540
SEX OFFENSES FORCIBLE            4388
DISORDERLY CONDUCT               4320
DRUNKENNESS                      4280
RECOVERED VEHICLE          

In [23]:
# View Top 15 of Dates, Descript, Address, X, Y
variables = ["Dates", "Descript", "Address", "X", "Y"]
for col in variables:
    print "-------------------------------------------------------------------------"
    print "There are a total of %d distinct %s values. The top 15 are: " % (len(data[col].value_counts()), col)
    print data[col].value_counts().head(15)
    print


-------------------------------------------------------------------------
There are a total of 389257 distinct Dates values. The top 15 are: 
2011-01-01 00:01:00    185
2006-01-01 00:01:00    136
2012-01-01 00:01:00     94
2006-01-01 12:00:00     63
2007-06-01 00:01:00     61
2006-06-01 00:01:00     58
2010-06-01 00:01:00     56
2010-08-01 00:01:00     55
2008-04-01 00:01:00     53
2013-11-01 00:01:00     52
2008-11-01 00:01:00     51
2013-05-01 00:01:00     51
2006-07-01 00:01:00     51
2010-11-01 00:01:00     51
2005-06-01 00:01:00     50
dtype: int64

-------------------------------------------------------------------------
There are a total of 879 distinct Descript values. The top 15 are: 
GRAND THEFT FROM LOCKED AUTO                 60022
LOST PROPERTY                                31729
BATTERY                                      27441
STOLEN AUTOMOBILE                            26897
DRIVERS LICENSE, SUSPENDED OR REVOKED        26839
WARRANT ARREST                            

In [5]:
# Describe floats
print data.describe()

# Note - strange max value of Y = 90 degrees

                   X              Y
count  878049.000000  878049.000000
mean     -122.422616      37.771020
std         0.030354       0.456893
min      -122.513642      37.707879
25%      -122.432952      37.752427
50%      -122.416420      37.775421
75%      -122.406959      37.784369
max      -120.500000      90.000000
