## San-Franscisco Crime Predition Challenge - Kaggle
### Team Member : Shanti Greene, Jing Xu, Abhishek Kumar


#### Data Description

This dataset contains incidents derived from SFPD Crime Incident Reporting system. The data ranges from 1/1/2003 to 5/13/2015. The training set and test set rotate every week, meaning week 1,3,5,7... belong to test set, week 2,4,6,8 belong to training set. 

##### train.csv / test.csv

Data fields 

 - Dates - timestamp of the crime incident 
 - Category - category of the crime incident (only in train.csv). This is the target variable you are going to predict. 
 - Descript - detailed description of the crime incident (only in train.csv) 
 - DayOfWeek - the day of the week 
 - PdDistrict - name of the Police Department District 
 - Resolution - how the crime incident was resolved (only in train.csv) 
 - Address - the approximate street address of the crime incident  
 - X - Longitude 
 - Y - Latitude
 
##### Submission data ( sampleSubmission.csv)

You must submit a csv file with the incident id, all candidate class names, and a probability for each class. The order of the rows does not matter. The file must have a header and should look like the following:


##### evaluation criteria

Submissions are evaluated using the multi-class logarithmic loss. Each incident has been labeled with one true class. For each incident, you must submit a set of predicted probabilities (one for every class). The formula is then,

logloss=−1/N∑i=1 to N ∑ j=1 to M yijlog(pij),

where N is the number of images in the test set, M is the number of class labels, log is the natural logarithm, yij is 1 if observation i is in class j and 0 otherwise, and pij is the predicted probability that observation i belongs to class j.

The submitted probabilities for a given incident are not required to sum to one because they are rescaled prior to being scored (each row is divided by the row sum). In order to avoid the extremes of the log function, predicted probabilities are replaced with max(min(p,1−10−15),10−15).




### Import Required Packages

In [1]:
import pandas as pd
import numpy as np
import os

# Force matplotlib to not use any Xwindows backend.
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import train_test_split
from sklearn.metrics import log_loss
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import HashingVectorizer

from sklearn.ensemble import GradientBoostingClassifier as GBC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier

### Import Data

In [2]:
# read train and test data files
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
#submission_df = pd.read_csv('sampleSubmission.csv')
#street_df = pd.read_csv('Street_Names.csv')

### Data Exploration

In [3]:
# show head of train_df
train_df.head(3)

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414


###Data Visualization, Crime Density

In [4]:
#Create Map of crimes
import matplotlib.pyplot as plt

mapdata = np.loadtxt("sf_map_copyright_openstreetmap_contributors.txt")
plt.imshow(mapdata, cmap = plt.get_cmap('gray'))
plt.savefig('map.png')
plt.show()

In [16]:
echo $DISPLAY

SyntaxError: invalid syntax (<ipython-input-16-217b878a31c1>, line 1)

In [5]:
#Crime density plots setup

import matplotlib.pyplot as plt
import seaborn as sns


train_m = train_df
#Get rid of the bad lat/longs
train_m['Xok'] = train_m[train_m.X<-121].X
train_m['Yok'] = train_m[train_m.Y<40].Y
train_m = train_m.dropna()
#select 350,000 random rows, it gets real slow with more rows
rows = np.random.choice(train_m.index.values, 350000)
sampled_df = train_m.ix[rows]


In [6]:
#Crime density plots

# Supplied map bounding box:
#    ll.lon     ll.lat   ur.lon     ur.lat
#    -122.52469 37.69862 -122.33663 37.82986
mapdata = np.loadtxt("sf_map_copyright_openstreetmap_contributors.txt")
asp = mapdata.shape[0] * 1.0 / mapdata.shape[1]
lon_lat_box = (-122.5247, -122.3366, 37.699, 37.8299)
clipsize = [[-122.5247, -122.3366],[ 37.699, 37.8299]]

#Seaborn FacetGrid, split by crime Category
g=sns.FacetGrid(sampled_df, col="Category", col_wrap=6, size=5, aspect=1/asp)

#add the background map
for ax in g.axes:
    ax.imshow(mapdata, cmap=plt.get_cmap('gray'), extent=lon_lat_box, aspect=asp)
#add the density plot
g.map(sns.kdeplot, "Xok", "Yok", clip=clipsize, cmap="Reds", shade=False)
plt.savefig('./sf_crime_density1.png')

ValueError: The number of observations must be larger than the number of variables.

###Data Manipulation, Make X,Y Squares and assign midpoints

In [18]:
#take a dataset and make squares for crime blocks, return dataset with appended data
def make_geosquares (dataset, nparts):
    #Get min and max of the valid lats and longs
    Xcoord = np.asarray(dataset['Xok'])
    Ycoord = np.asarray(dataset['Yok'])
    Xmin = np.amin(Xcoord)
    Xmax = np.amax(Xcoord)
    Ymin = np.amin(Ycoord)
    Ymax = np.amax(Ycoord)
    #find the top left corner and center point for each square and assign to array
    Xdist = (Xmax - Xmin)/nparts
    Ydist = (Ymax - Ymin)/nparts
    Xtopleft = np.empty(nparts)
    Ytopleft = np.empty(nparts)
    Xmid = np.empty(nparts)
    Ymid = np.empty(nparts)
    for i in range (nparts):
        Xtopleft[i] = Xmin + (i * Xdist)
        Xmid[i] = Xmin + (i * Xdist / 2)
        Ytopleft[i] = Ymin + (i * Ydist)
        Ymid[i] = Ymin + (i * Ydist / 2)
    #go thru the X and Y coords and assign to squares based on top left, assign value of middle of square
    Xbin = np.digitize(Xcoord, Xtopleft, right=False)
    Ybin = np.digitize(Ycoord, Ytopleft, right=False)
    Xmidcoord = np.empty(len(Xcoord))
    Ymidcoord = np.empty(len(Ycoord))
    for i in range(len(dataset)):
        Xmidcoord[i] = Xmid[Xbin[i]-1]
        Ymidcoord[i] = Ymid[Ybin[i]-1]
    dataset['Xmidcoord'] = Xmidcoord
    dataset['Ymidcoord'] = Ymidcoord
    return dataset
train_sq = make_geosquares(train_m,20)
print train_sq.info()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


<class 'pandas.core.frame.DataFrame'>
Int64Index: 877982 entries, 0 to 878048
Data columns (total 13 columns):
Dates         877982 non-null object
Category      877982 non-null object
Descript      877982 non-null object
DayOfWeek     877982 non-null object
PdDistrict    877982 non-null object
Resolution    877982 non-null object
Address       877982 non-null object
X             877982 non-null float64
Y             877982 non-null float64
Xok           877982 non-null float64
Yok           877982 non-null float64
Xmidcoord     877982 non-null float64
Ymidcoord     877982 non-null float64
dtypes: float64(6), object(7)
memory usage: 93.8+ MB
None


###Make top crime squares

In [163]:
#find the top description and category by square
def top_crime_square(dataset):
    X = np.unique(np.asarray(dataset['Xmidcoord']))
    Y = np.unique(np.asarray(dataset['Ymidcoord']))
    Xmid = np.asarray(dataset['Xmidcoord'])
    Ymid = np.asarray(dataset['Ymidcoord'])
    XYindex = np.empty(len(dataset))
    for i in range(len(Xmid)):
        x = np.argwhere(X == Xmid[i])
        y = np.argwhere(Y == Ymid[i])
        XYindex[i] = (x[0,0]*len(Y)) + y[0,0]
    dataset['XYIndex'] = XYindex
    aCoord = np.empty([(len(X)*len(Y)),3])
    k = 0;
    for i in range(len(X)):
        for j in range(len(Y)):
            aCoord[k][0] = X[i]
            aCoord[k][1] = Y[j]
            aCoord[k][2] = k
            k += 1
    aCategory = []
    aDescription = []
    for i in range(len(aCoord)):
        id = dataset[dataset['XYIndex'] == aCoord[i][2]].Category.value_counts()[:1].index
        aCategory.append(str(id)[str(id).find('[')+3: str(id).find(']')-1])
        id2 = dataset[dataset['XYIndex'] == aCoord[i][2]].Descript.value_counts()[:1].index
        aDescription.append(str(id)[str(id).find('[')+3: str(id).find(']')-1])
    CrimeSquare = pd.DataFrame(data=aCoord,columns = ["X","Y","XYIndex"])
    CrimeSquare['Category'] = aCategory
    CrimeSquare['Description'] = aDescription
    return CrimeSquare

#map the description and category by square
def viz_crime_square(dataset):
    #show background map
    mapdata = np.loadtxt("./sf_map_copyright_openstreetmap_contributors.txt")
    asp = mapdata.shape[0] * 1.0 / mapdata.shape[1]
    lon_lat_box = (-122.5247, -122.3366, 37.699, 37.8299)
    clipsize = [[-122.5247, -122.3366],[ 37.699, 37.8299]]
    g = sns.PairGrid(dataset, x_vars="X", y_vars="Y", hue="Category")
    plt.imshow(mapdata, cmap=plt.get_cmap('gray'), extent=lon_lat_box, aspect=asp)
    g.map(plt.scatter)
    plt.savefig('./testmap.png')
    
#train_sq = make_geosquares(train_m,50)
#csquare = top_crime_square(train_sq)
viz_crime_square(csquare)
#csquare.to_csv('./vizme.csv')

###Address Suffixes or maybe sufficies

In [17]:
def address_suffix (dataset):
    aSuffix = []
    for i in range(0,len(dataset)):
        suf = str(dataset['Address'][i]).lower().rsplit(" ",1)[-1]
        if suf == "/":
            suf = (str(dataset['Address'][i]).lower().rsplit(" ",1)[0]).rsplit(" ",1)[-1]
        aSuffix.append(suf)
    suffix = pd.DataFrame(aSuffix,columns=['Suffixes'])
    dataset['Suffixes'] = suffix['Suffixes']
    print set(aSuffix)
    return dataset
x = address_suffix(train_df)
x.head(5)

set(['al', 'mar', 'ex', 'av', 'cr', 'ter', 'i-80', 'ct', 'rw', 'ln', 'tr', 'stwy', 'rd', 'way', 'pl', 'hy', 'palms', 'bl', 'park', 'wk', 'i-280', 'hwy', 'ferlinghetti', 'dr', 'wy', 'bufano', 'st', 'pz'])


Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,Suffixes
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,st
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599,st
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414,st
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873,st
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541,st
