## San-Franscisco Crime Predition Challenge - Kaggle
### Team Member : Shanti Greene, Jing Xu, Abhishek Kumar


#### Data Description

This dataset contains incidents derived from SFPD Crime Incident Reporting system. The data ranges from 1/1/2003 to 5/13/2015. The training set and test set rotate every week, meaning week 1,3,5,7... belong to test set, week 2,4,6,8 belong to training set. 

##### train.csv / test.csv

Data fields 

 - Dates - timestamp of the crime incident 
 - Category - category of the crime incident (only in train.csv). This is the target variable you are going to predict. 
 - Descript - detailed description of the crime incident (only in train.csv) 
 - DayOfWeek - the day of the week 
 - PdDistrict - name of the Police Department District 
 - Resolution - how the crime incident was resolved (only in train.csv) 
 - Address - the approximate street address of the crime incident  
 - X - Longitude 
 - Y - Latitude
 
##### Submission data ( sampleSubmission.csv)

You must submit a csv file with the incident id, all candidate class names, and a probability for each class. The order of the rows does not matter. The file must have a header and should look like the following:


##### evaluation criteria

Submissions are evaluated using the multi-class logarithmic loss. Each incident has been labeled with one true class. For each incident, you must submit a set of predicted probabilities (one for every class). The formula is then,

logloss=−1/N∑i=1 to N ∑ j=1 to M yijlog(pij),

where N is the number of images in the test set, M is the number of class labels, log is the natural logarithm, yij is 1 if observation i is in class j and 0 otherwise, and pij is the predicted probability that observation i belongs to class j.

The submitted probabilities for a given incident are not required to sum to one because they are rescaled prior to being scored (each row is divided by the row sum). In order to avoid the extremes of the log function, predicted probabilities are replaced with max(min(p,1−10−15),10−15).




### Import Required Packages

In [1]:
import pandas as pd
import numpy as np
import os
import math
import gc
import gzip
import re
import matplotlib.pyplot as plt


from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import train_test_split
from sklearn.metrics import log_loss
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import IncrementalPCA

from sklearn.ensemble import GradientBoostingClassifier as GBC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier

### Import Data

In [2]:
# read train and test data files
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
#submission_df = pd.read_csv('sampleSubmission.csv')
#street_df = pd.read_csv('Street_Names.csv')

### Data Exploration

In [3]:
# show head of train_df
train_df.head(5)

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


In [4]:
def GetPropertionOfCrimeCategoryPerPdDistrcit():
    # Propertion of Crime category per district
    unique_categories = train_df['Category'].unique()
    df_prop_crime_per_pdDisrict = []
    for district in train_df['PdDistrict'].unique():   
            current_district_crime_row = []
            current_district_crime_row.append(district)
            total_crime_count = len(train_df[train_df['PdDistrict'] == district])       
            for category in unique_categories:
                current_category_crime_count =  len(train_df[(train_df['PdDistrict'] == district) & (train_df['Category'] == category)])
                proportion_category_crime_count = current_category_crime_count / float(total_crime_count)
                current_district_crime_row.append(proportion_category_crime_count)
            df_prop_crime_per_pdDisrict.append(current_district_crime_row)
    columns = ['PdDistrict'] + list('dist_' + unique_categories)
    df_prop_crime_per_pdDisrict =  pd.DataFrame(df_prop_crime_per_pdDisrict, columns=columns)
    return df_prop_crime_per_pdDisrict


df_prop_crime_per_pdDisrict = GetPropertionOfCrimeCategoryPerPdDistrcit()
df_prop_crime_per_pdDisrict.head(5)

Unnamed: 0,PdDistrict,dist_WARRANTS,dist_OTHER OFFENSES,dist_LARCENY/THEFT,dist_VEHICLE THEFT,dist_VANDALISM,dist_NON-CRIMINAL,dist_ROBBERY,dist_ASSAULT,dist_WEAPON LAWS,...,dist_EMBEZZLEMENT,dist_SUICIDE,dist_LOITERING,dist_SEX OFFENSES NON FORCIBLE,dist_EXTORTION,dist_GAMBLING,dist_BAD CHECKS,dist_TREA,dist_RECOVERED VEHICLE,dist_PORNOGRAPHY/OBSCENE MAT
0,NORTHERN,0.043677,0.116177,0.2719,0.059746,0.051322,0.09725,0.025072,0.078996,0.007493,...,0.001244,0.000636,0.001833,8.5e-05,0.000228,9.5e-05,0.000513,9e-06,0.002593,4.7e-05
1,PARK,0.047006,0.125403,0.185468,0.080364,0.052988,0.120151,0.019407,0.071279,0.007239,...,0.001014,0.000406,0.000466,0.000122,0.000162,2e-05,0.000304,0.0,0.002373,0.0
2,INGLESIDE,0.032063,0.167455,0.129824,0.113641,0.068159,0.086917,0.035361,0.108225,0.014332,...,0.000989,0.000824,0.00033,0.000279,0.000368,0.000203,0.000406,0.0,0.008396,0.0
3,BAYVIEW,0.048328,0.190683,0.113149,0.080721,0.05989,0.068198,0.030359,0.110219,0.018416,...,0.001118,0.000414,0.000559,0.000246,0.000145,0.000324,0.00038,3.4e-05,0.008219,2.2e-05
4,RICHMOND,0.022341,0.124577,0.218828,0.091066,0.07034,0.127054,0.017408,0.070827,0.007233,...,0.000951,0.000929,0.000177,0.000221,0.000509,8.8e-05,0.000686,0.0,0.002765,2.2e-05


In [None]:
# Check its performance. Not working well. Use better technique
def GetPropertionOfCrimeCategoryPerAddress():
    # Propertion of Crime category per address
    df_prop_crime_per_address = []
    for address in train_df['Address'].unique():   
            current_address_crime_row = []
            current_address_crime_row.append(address)
            total_crime_count = len(train_df[train_df['Address'] == address])       
            for category in unique_categories:
                current_category_crime_count =  len(train_df[(train_df['Address'] == address) & (train_df['Category'] == category)])
                proportion_category_crime_count = current_category_crime_count / float(total_crime_count)
                current_address_crime_row.append(proportion_category_crime_count)
            df_prop_crime_per_address.append(current_address_crime_row)
    columns = ['Address'] + list('address_' + unique_categories)
    df_prop_crime_per_address =  pd.DataFrame(df_prop_crime_per_address, columns=columns)   
    return df_prop_crime_per_address

  

In [3]:
train_df['loc'] = train_df['X'].map(lambda x: str(round(x,2))) +   train_df['Y'].map(lambda y: ' {0}'.format(str(round(y,2))))
test_df['loc'] = test_df['X'].map(lambda x: str(round(x,2))) +   test_df['Y'].map(lambda y: ' {0}'.format(str(round(y,2))))
unique_cordinates =train_df['loc'].unique().tolist()
# check how to handle one extra unique cordinates
#unique_cordinates = list(set(train_df['loc'].unique().tolist() + test_df['loc'].unique().tolist()))

print len(unique_cordinates)

139


In [6]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 878049 entries, 0 to 878048
Data columns (total 10 columns):
Dates         878049 non-null object
Category      878049 non-null object
Descript      878049 non-null object
DayOfWeek     878049 non-null object
PdDistrict    878049 non-null object
Resolution    878049 non-null object
Address       878049 non-null object
X             878049 non-null float64
Y             878049 non-null float64
loc           878049 non-null object
dtypes: float64(2), object(8)
memory usage: 73.7+ MB


In [7]:
# show head of test_df
test_df.head(5)

Unnamed: 0,Id,Dates,DayOfWeek,PdDistrict,Address,X,Y,loc
0,0,2015-05-10 23:59:00,Sunday,BAYVIEW,2000 Block of THOMAS AV,-122.399588,37.735051,-122.4 37.74
1,1,2015-05-10 23:51:00,Sunday,BAYVIEW,3RD ST / REVERE AV,-122.391523,37.732432,-122.39 37.73
2,2,2015-05-10 23:50:00,Sunday,NORTHERN,2000 Block of GOUGH ST,-122.426002,37.792212,-122.43 37.79
3,3,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412,-122.44 37.72
4,4,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412,-122.44 37.72


In [8]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 884262 entries, 0 to 884261
Data columns (total 8 columns):
Id            884262 non-null int64
Dates         884262 non-null object
DayOfWeek     884262 non-null object
PdDistrict    884262 non-null object
Address       884262 non-null object
X             884262 non-null float64
Y             884262 non-null float64
loc           884262 non-null object
dtypes: float64(2), int64(1), object(5)
memory usage: 60.7+ MB


In [10]:
#show info for submission_df
#print submission_df.info()

### Feature Extraction

In [4]:
unique_categories = train_df['Category'].unique()
print unique_categories

# creating features for district and dayofWeek
# for discrict
Unique_PdDistricts = train_df['PdDistrict'].unique()
PdDistrict_features = ['PdDistrict_{0}'.format(district) for district in Unique_PdDistricts]
print PdDistrict_features
# for days of week
Unique_DayOfWeek = train_df['DayOfWeek'].unique()
DayOfWeek_features = ['DayOfWeek_{0}'.format(day) for day in Unique_DayOfWeek]
# for month
month_features = ['Month_{0}'.format(month) for month in range(1,13)]
print month_features
# for year
year_features = ['year_{0}'.format(year) for year in range(2003,2016)]
print year_features
# for season 
season_features = ['season_winter','season_spring','season_summer']
print season_features
#for cordinates
cordinate_features = ['cord_{0}'.format(cord) for cord in unique_cordinates]
print cordinate_features

# for category
category_encoder = LabelEncoder()
category_encoder.fit(unique_categories) 
print category_encoder.classes_

num_address_features = 100
address_vectorizer = HashingVectorizer(decode_error='ignore', n_features=num_address_features,non_negative=True)
address_features = ['address_{0}'.format(i) for i in range(num_address_features)]
print address_features

['WARRANTS' 'OTHER OFFENSES' 'LARCENY/THEFT' 'VEHICLE THEFT' 'VANDALISM'
 'NON-CRIMINAL' 'ROBBERY' 'ASSAULT' 'WEAPON LAWS' 'BURGLARY'
 'SUSPICIOUS OCC' 'DRUNKENNESS' 'FORGERY/COUNTERFEITING' 'DRUG/NARCOTIC'
 'STOLEN PROPERTY' 'SECONDARY CODES' 'TRESPASS' 'MISSING PERSON' 'FRAUD'
 'KIDNAPPING' 'RUNAWAY' 'DRIVING UNDER THE INFLUENCE'
 'SEX OFFENSES FORCIBLE' 'PROSTITUTION' 'DISORDERLY CONDUCT' 'ARSON'
 'FAMILY OFFENSES' 'LIQUOR LAWS' 'BRIBERY' 'EMBEZZLEMENT' 'SUICIDE'
 'LOITERING' 'SEX OFFENSES NON FORCIBLE' 'EXTORTION' 'GAMBLING'
 'BAD CHECKS' 'TREA' 'RECOVERED VEHICLE' 'PORNOGRAPHY/OBSCENE MAT']
['PdDistrict_NORTHERN', 'PdDistrict_PARK', 'PdDistrict_INGLESIDE', 'PdDistrict_BAYVIEW', 'PdDistrict_RICHMOND', 'PdDistrict_CENTRAL', 'PdDistrict_TARAVAL', 'PdDistrict_TENDERLOIN', 'PdDistrict_MISSION', 'PdDistrict_SOUTHERN']
['Month_1', 'Month_2', 'Month_3', 'Month_4', 'Month_5', 'Month_6', 'Month_7', 'Month_8', 'Month_9', 'Month_10', 'Month_11', 'Month_12']
['year_2003', 'year_2004', 'year_20

In [5]:
# Processing Address Feature
def removePunctuation(text):
    '''
    function to remove punctuations
    '''
    # create a regex for punctuations
    punct = re.compile(r'([^A-Za-z0-9 ])')
    # replace the punctuation with empty space
    return punct.sub(" ", text)

# removing punctuation from address values
train_df['Address'] = train_df['Address'].map(lambda x : removePunctuation(x))
test_df['Address'] = test_df['Address'].map(lambda x : removePunctuation(x))


# count vectorizer for address
vectorizer = CountVectorizer(stop_words='english')
train_address_features = vectorizer.fit_transform(train_df['Address'].values)
test_adddress_features = vectorizer.transform(test_df['Address'].values)
print len(vectorizer.vocabulary_)

2130


In [6]:
def ProcessData(df, datatype):
    # set the correct type for dates column
    df['Dates'] = df['Dates'].astype('datetime64[ns]')   
    # adding Features
    df = addFeatures(df)
       
    # drop columns not needed now
    if datatype == 'train':
        df = df.drop(['Descript','Resolution','Address', 'Dates','PdDistrict','DayOfWeek'], axis=1)
    if datatype == 'test':
        df = df.drop(['Address','Dates','PdDistrict','DayOfWeek'], axis=1)
    return df
    
def addFeatures(df):   
    #df  = processPdDiscrictCrimePropertion(df)   # not working well.
    df = processPdDiscrict(df)  
    df = processDayOfWeek(df)
    df = processCordinates(df)
    #df = processAddress(df) # not working well. use some some other technique to extract features
    df = addHourOfCrime(df)
    df = addMonthOfCrime(df)
    df = addYearOfCrime(df)
    df = addSeasonOfCrime(df)
    return df
    
def processPdDiscrictCrimePropertion(df):    
    df  = pd.merge(df, df_prop_crime_per_pdDisrict, on='PdDistrict',how='left') 
    return df


def processPdDiscrict(df):   
    new_PdDistrict_df = pd.get_dummies(df['PdDistrict'], prefix='PdDistrict')   
    new_PdDistrict_df = new_PdDistrict_df[PdDistrict_features]
    df  = pd.concat([df, new_PdDistrict_df], axis=1)   
    return df


def processCordinates(df):   
    new_cord_df = pd.get_dummies(df['loc'], prefix='cord')   
    new_cord_df = new_cord_df[cordinate_features]
    df  = pd.concat([df, new_cord_df], axis=1)   
    return df

def processDayOfWeek(df):   
    new_DayOfWeek_df = pd.get_dummies(df['DayOfWeek'], prefix='DayOfWeek')   
    new_DayOfWeek_df = new_DayOfWeek_df[DayOfWeek_features]
    df  = pd.concat([df, new_DayOfWeek_df], axis=1)    
    return df
    
def addHourOfCrime(df):      
    df['HourOfCrime'] = df['Dates'].map(lambda d: d.hour + d.minute / 60.)  
    return df

def addMonthOfCrime(df):      
    #df['MonthOfCrime'] = df['Dates'].map(lambda d: d.month)  
    new_month_df = pd.get_dummies(df['Dates'].map(lambda d: d.month) , prefix='Month')   
    new_month_df = new_month_df[month_features]
    df  = pd.concat([df, new_month_df], axis=1)    
    return df

def addYearOfCrime(df):          
    new_year_df = pd.get_dummies(df['Dates'].map(lambda d: d.year) , prefix='year')   
    new_year_df = new_year_df[year_features]
    df  = pd.concat([df, new_year_df], axis=1)    
    return df

# March - June "Spring", July - October "Summer", November - February "Winter"
def addSeasonOfCrime(df):        
    new_month_df = pd.get_dummies(df['Dates'].map(lambda d: GetSeason(d.month)) , prefix='season')   
    new_month_df = new_month_df[season_features]   
    df  = pd.concat([df, new_month_df], axis=1)    
    return df
def GetSeason(month):
    if month == 3 or month == 4 or month == 5 or month == 6:
        return 'spring'
    if month == 7 or month == 8 or month == 9 or month == 10:
        return 'summer'
    if month == 11 or month == 12 or month == 1 or month == 2:
        return 'winter'

def processAddress(df):   
    address_hashed = address_vectorizer.fit_transform(df['Address'])
    new_address_df = pd.SparseDataFrame([ pd.SparseSeries(address_hashed[i].toarray().ravel()) 
                                   for i in np.arange(address_hashed.shape[0]) ])
    new_address_df.columns = address_features
    df  = pd.concat([df, new_address_df], axis=1)  
    return df

In [7]:
final_train_df = ProcessData(train_df, 'train')
final_train_df['Category'] = category_encoder.transform(final_train_df['Category'])  
final_test_df = ProcessData(test_df, 'test')

In [8]:
# removing variables from the memory which are not required further
del train_df
del test_df
gc.collect() # garbage collection

58

In [9]:
print final_train_df.info()
print final_test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 878049 entries, 0 to 878048
Columns: 189 entries, Category to season_summer
dtypes: float64(187), int64(1), object(1)
memory usage: 1.2+ GB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 884262 entries, 0 to 884261
Columns: 189 entries, Id to season_summer
dtypes: float64(187), int64(1), object(1)
memory usage: 1.3+ GB
None


### Creating Final Data 

In [19]:
#input features
to_be_scaled_features = []
#to_be_scaled_features +=  list('dist_' + unique_categories)
to_be_scaled_features +=  ['HourOfCrime']
no_scale_features = []
no_scale_features += PdDistrict_features 
no_scale_features += DayOfWeek_features 
no_scale_features += month_features 
no_scale_features += year_features 
no_scale_features += cordinate_features
no_scale_features += season_features
inputFeatures = []
inputFeatures = to_be_scaled_features + no_scale_features
#output features
ouptutFeatures = ['Category']

In [20]:
len(inputFeatures)

185

In [21]:
X = final_train_df[inputFeatures].values
Y = final_train_df[ouptutFeatures].values


In [12]:
# removing variables from the memory which are not required further
del final_train_df
gc.collect() # garbage collection

21

In [22]:
# append count vectorizer features for address
X = np.concatenate((X,  train_address_features.toarray()), axis=1)
print X.shape

(878049L, 2315L)


In [23]:
#  scaling data
scaler = StandardScaler()
#scaler = MinMaxScaler()
X[:,:len(to_be_scaled_features)] = scaler.fit_transform(X[:,:len(to_be_scaled_features)])
#X = scaler.fit_transform(X)

In [28]:
# removing variables from the memory which are not required further
del train_address_features
gc.collect() # garbage collection

0

In [24]:
def split_dataset(X,Y, train_size= 0.8):
    shuffle = np.random.permutation(np.arange(X.shape[0]))
    X, Y = X[shuffle], Y[shuffle]
    train_index = int(math.floor(X.shape[0] * train_size))    
    train_data, train_labels = X[:train_index], Y[:train_index]
    dev_data, dev_labels = X[train_index + 1:], Y[train_index + 1:]
    return train_data, dev_data, train_labels, dev_labels


In [25]:
train_data, dev_data, train_labels, dev_labels = split_dataset(X,Y, train_size = 0.8)
print train_data.shape, train_labels.shape
print dev_data.shape, dev_labels.shape
print len(np.unique(train_labels)), len(np.unique(dev_labels))

(702439L, 2315L) (702439L, 1L)
(175609L, 2315L) (175609L, 1L)
39 39


In [31]:
# removing variables from the memory which are not required further
del X
del Y
gc.collect() # garbage collection

0

### Model Building and Training

In [26]:
# batch training of random forest
def getRows(rows, data, labels):
    return data[rows], labels[rows].ravel()

def iter_minibatches(chunksize, data, labels):
    numtrainingpoints = len(data)     
    chunkstartmarker = 0
    while chunkstartmarker < numtrainingpoints:
        start = chunkstartmarker
        if start + chunksize < numtrainingpoints:
            end = chunkstartmarker + chunksize
        else:
            end = numtrainingpoints
        chunkrows = range(start,end)       
        X_chunk, y_chunk = getRows(chunkrows, data, labels)        
        yield X_chunk, y_chunk
        chunkstartmarker += chunksize       
        
def train_model(data, labels):
    model = batch_SGDClassifier(data, labels)
    #model = MultinomialNBClassifier(data, labels)
    #model = batch_RandomForest(data, labels)
    return model
    
def batch_RandomForest(data, labels):
    model = RandomForestClassifier(n_estimators=100,criterion='entropy',max_depth=5)
    model.fit(data, labels)
    return model  
    
def MultinomialNBClassifier(data, labels):
    model = MultinomialNB()
    model.fit(data,labels)
    return model

def batch_SGDClassifier(data,labels):
    chunk_size = 1000
    batcherator = iter_minibatches(chunk_size, data, labels)
    model = SGDClassifier(n_jobs=-1,alpha=0.00005,n_iter = 50, loss='log')
    # Train model on each chunk
    chunk_count = 1
    for X_chunk, y_chunk in batcherator:
               
        model.partial_fit(X_chunk, y_chunk, classes=np.array(range(len(category_encoder.classes_))))
        score = model.score(dev_data, dev_labels)
        probs = model.predict_proba(dev_data)
        log_loss_value = log_loss(dev_labels, probs)
        print 'training using chunk : {0} validation score : {1:.4f} log-loss : {2:.5f}'.format(chunk_count,score,log_loss_value) 
        chunk_count += 1
    return model  

In [27]:
model = train_model(train_data, train_labels.ravel())

# round up the results
probs = np.around(model.predict_proba(dev_data), decimals=5)
print 'Final log loss score : {0:.4f} accuracy : {1:.4f}'.format(log_loss(dev_labels, probs), model.score(dev_data, dev_labels))

training using chunk : 1 validation score : 0.1422 log-loss : 24.48373
training using chunk : 2 validation score : 0.1137 log-loss : 24.30079
training using chunk : 3 validation score : 0.1578 log-loss : 20.48776
training using chunk : 4 validation score : 0.1046 log-loss : 22.26809
training using chunk : 5 validation score : 0.1102 log-loss : 19.66606
training using chunk : 6 validation score : 0.1012 log-loss : 17.37252
training using chunk : 7 validation score : 0.1349 log-loss : 14.40994
training using chunk : 8 validation score : 0.1345 log-loss : 14.03948
training using chunk : 9 validation score : 0.1391 log-loss : 13.09857
training using chunk : 10 validation score : 0.1073 log-loss : 12.00916
training using chunk : 11 validation score : 0.1286 log-loss : 9.13713
training using chunk : 12 validation score : 0.1398 log-loss : 10.41530
training using chunk : 13 validation score : 0.1293 log-loss : 9.61346
training using chunk : 14 validation score : 0.1409 log-loss : 8.19500
trai

In [None]:
#clf = LogisticRegression(C=100.0)
#clf = GBC(n_estimators=10, max_depth=5,verbose=1)
#clf = MultinomialNB()

#clf = RandomForestClassifier(n_estimators=100)
#clf = SGDClassifier(fit_intercept=False, shuffle=True, n_jobs=-1,alpha=0.000005,n_iter = 50, loss='log', penalty ='l2')
   
#clf.fit(train_data, train_labels)
#print clf.predict_proba(dev_data).shape
# round up the results
#probs = np.around(clf.predict_proba(dev_data), decimals=5)
#print 'log loss score : {0:.4f} accuracy : {1:.4f}'.format(log_loss(dev_labels, probs), clf.score(dev_data, dev_labels))

### Evaluation

In [28]:
del train_data
del train_labels
gc.collect()

14

In [30]:
# train on complete data
shuffle = np.random.permutation(np.arange(X.shape[0]))
X, Y = X[shuffle], Y[shuffle]
model = train_model(X,Y.ravel())


training using chunk : 1 validation score : 0.1130 log-loss : 26.15099
training using chunk : 2 validation score : 0.1533 log-loss : 21.82010
training using chunk : 3 validation score : 0.1379 log-loss : 23.44579
training using chunk : 4 validation score : 0.0882 log-loss : 21.63701
training using chunk : 5 validation score : 0.1496 log-loss : 20.35951
training using chunk : 6 validation score : 0.0995 log-loss : 18.50244
training using chunk : 7 validation score : 0.1725 log-loss : 13.00564
training using chunk : 8 validation score : 0.1548 log-loss : 12.63484
training using chunk : 9 validation score : 0.1388 log-loss : 12.17220
training using chunk : 10 validation score : 0.1139 log-loss : 11.70642
training using chunk : 11 validation score : 0.1586 log-loss : 9.29018
training using chunk : 12 validation score : 0.1550 log-loss : 10.14502
training using chunk : 13 validation score : 0.1116 log-loss : 9.31017
training using chunk : 14 validation score : 0.1448 log-loss : 8.29448
trai

In [32]:
#del X
#del Y
#del final_train_df
#del dev_data
#del train_data
gc.collect()

1535

In [33]:
# preparate test data
test_data = final_test_df[inputFeatures].values
test_data[:,:len(to_be_scaled_features)] = scaler.transform(test_data[:,:len(to_be_scaled_features)])
test_data = np.concatenate((test_data,  test_adddress_features.toarray()), axis=1)
print test_data.shape

(884262L, 2315L)


In [34]:
#evaluate final probability
submit_output = model.predict_proba(test_data)
# round up the results
submit_output = np.around(submit_output, decimals=10)
print submit_output.shape

(884262L, 39L)


### Submission Preperation

In [35]:
result = np.c_[final_test_df['Id'].astype(int), submit_output.astype(float)]
print result.shape
outputColumns =  ['Id'] + list( category_encoder.classes_)
df_result = pd.DataFrame(result, columns=outputColumns)
df_result['Id'] = df_result['Id'].astype(int)
print df_result.info()

(884262L, 40L)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 884262 entries, 0 to 884261
Data columns (total 40 columns):
Id                             884262 non-null int32
ARSON                          884262 non-null float64
ASSAULT                        884262 non-null float64
BAD CHECKS                     884262 non-null float64
BRIBERY                        884262 non-null float64
BURGLARY                       884262 non-null float64
DISORDERLY CONDUCT             884262 non-null float64
DRIVING UNDER THE INFLUENCE    884262 non-null float64
DRUG/NARCOTIC                  884262 non-null float64
DRUNKENNESS                    884262 non-null float64
EMBEZZLEMENT                   884262 non-null float64
EXTORTION                      884262 non-null float64
FAMILY OFFENSES                884262 non-null float64
FORGERY/COUNTERFEITING         884262 non-null float64
FRAUD                          884262 non-null float64
GAMBLING                       884262 non-null floa

In [36]:
df_result.to_csv('08-SFCrimeMIDSChallengerTeam.csv', index=False,header=True)