## San-Franscisco Crime Predition Challenge - Kaggle
### Team Member : Shanti Greene, Jing Xu, Abhishek Kumar


#### Data Description

This dataset contains incidents derived from SFPD Crime Incident Reporting system. The data ranges from 1/1/2003 to 5/13/2015. The training set and test set rotate every week, meaning week 1,3,5,7... belong to test set, week 2,4,6,8 belong to training set. 

##### train.csv / test.csv

Data fields 

 - Dates - timestamp of the crime incident 
 - Category - category of the crime incident (only in train.csv). This is the target variable you are going to predict. 
 - Descript - detailed description of the crime incident (only in train.csv) 
 - DayOfWeek - the day of the week 
 - PdDistrict - name of the Police Department District 
 - Resolution - how the crime incident was resolved (only in train.csv) 
 - Address - the approximate street address of the crime incident  
 - X - Longitude 
 - Y - Latitude
 
##### Submission data ( sampleSubmission.csv)

You must submit a csv file with the incident id, all candidate class names, and a probability for each class. The order of the rows does not matter. The file must have a header and should look like the following:


##### evaluation criteria

Submissions are evaluated using the multi-class logarithmic loss. Each incident has been labeled with one true class. For each incident, you must submit a set of predicted probabilities (one for every class). The formula is then,

logloss=−1/N∑i=1 to N ∑ j=1 to M yijlog(pij),

where N is the number of images in the test set, M is the number of class labels, log is the natural logarithm, yij is 1 if observation i is in class j and 0 otherwise, and pij is the predicted probability that observation i belongs to class j.

The submitted probabilities for a given incident are not required to sum to one because they are rescaled prior to being scored (each row is divided by the row sum). In order to avoid the extremes of the log function, predicted probabilities are replaced with max(min(p,1−10−15),10−15).




### Import Required Packages

In [1]:
import pandas as pd
import numpy as np
import os
import math
import gc
import gzip
import re
import matplotlib.pyplot as plt
%matplotlib inline

from scipy.optimize import minimize
from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import train_test_split, StratifiedShuffleSplit
from sklearn.metrics import log_loss, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import PCA, IncrementalPCA


from sklearn.ensemble import GradientBoostingClassifier as GBC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier

#from nolearn.dbn import DBN

### Import Data

In [3]:
# read train and test data files
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
#submission_df = pd.read_csv('sampleSubmission.csv')
#street_df = pd.read_csv('Street_Names.csv')

### Data Exploration

In [3]:
# show head of train_df
train_df.head(5)

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


In [5]:
train_df['loc'] = train_df['X'].map(lambda x: str(round(x,2))) +   train_df['Y'].map(lambda y: ' {0}'.format(str(round(y,2))))
test_df['loc'] = test_df['X'].map(lambda x: str(round(x,2))) +   test_df['Y'].map(lambda y: ' {0}'.format(str(round(y,2))))
unique_cordinates =train_df['loc'].unique().tolist()
# check how to handle one extra unique cordinates
#unique_cordinates = list(set(train_df['loc'].unique().tolist() + test_df['loc'].unique().tolist()))

print len(unique_cordinates)

139


In [5]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 878049 entries, 0 to 878048
Data columns (total 10 columns):
Dates         878049 non-null object
Category      878049 non-null object
Descript      878049 non-null object
DayOfWeek     878049 non-null object
PdDistrict    878049 non-null object
Resolution    878049 non-null object
Address       878049 non-null object
X             878049 non-null float64
Y             878049 non-null float64
loc           878049 non-null object
dtypes: float64(2), object(8)
memory usage: 73.7+ MB


In [6]:
# show head of test_df
test_df.head(5)

Unnamed: 0,Id,Dates,DayOfWeek,PdDistrict,Address,X,Y,loc
0,0,2015-05-10 23:59:00,Sunday,BAYVIEW,2000 Block of THOMAS AV,-122.399588,37.735051,-122.4 37.74
1,1,2015-05-10 23:51:00,Sunday,BAYVIEW,3RD ST / REVERE AV,-122.391523,37.732432,-122.39 37.73
2,2,2015-05-10 23:50:00,Sunday,NORTHERN,2000 Block of GOUGH ST,-122.426002,37.792212,-122.43 37.79
3,3,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412,-122.44 37.72
4,4,2015-05-10 23:45:00,Sunday,INGLESIDE,4700 Block of MISSION ST,-122.437394,37.721412,-122.44 37.72


In [6]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 884262 entries, 0 to 884261
Data columns (total 8 columns):
Id            884262 non-null int64
Dates         884262 non-null object
DayOfWeek     884262 non-null object
PdDistrict    884262 non-null object
Address       884262 non-null object
X             884262 non-null float64
Y             884262 non-null float64
loc           884262 non-null object
dtypes: float64(2), int64(1), object(5)
memory usage: 60.7+ MB


In [10]:
#show info for submission_df
#print submission_df.info()

### Feature Extraction

In [6]:
unique_categories = train_df['Category'].unique()
print unique_categories

# creating features for district and dayofWeek
# for discrict
Unique_PdDistricts = train_df['PdDistrict'].unique()
PdDistrict_features = ['PdDistrict_{0}'.format(district) for district in Unique_PdDistricts]
print PdDistrict_features
# for days of week
Unique_DayOfWeek = train_df['DayOfWeek'].unique()
DayOfWeek_features = ['DayOfWeek_{0}'.format(day) for day in Unique_DayOfWeek]
# for month
month_features = ['Month_{0}'.format(month) for month in range(1,13)]
print month_features
# for year
year_features = ['year_{0}'.format(year) for year in range(2003,2016)]
print year_features
# for season 
season_features = ['season_winter','season_spring','season_summer']
print season_features
#for cordinates
cordinate_features = ['cord_{0}'.format(cord) for cord in unique_cordinates]
print cordinate_features

# for category
category_encoder = LabelEncoder()
category_encoder.fit(unique_categories) 
print category_encoder.classes_

num_address_features = 100
address_vectorizer = HashingVectorizer(decode_error='ignore', n_features=num_address_features,non_negative=True)
address_features = ['address_{0}'.format(i) for i in range(num_address_features)]
print address_features

['WARRANTS' 'OTHER OFFENSES' 'LARCENY/THEFT' 'VEHICLE THEFT' 'VANDALISM'
 'NON-CRIMINAL' 'ROBBERY' 'ASSAULT' 'WEAPON LAWS' 'BURGLARY'
 'SUSPICIOUS OCC' 'DRUNKENNESS' 'FORGERY/COUNTERFEITING' 'DRUG/NARCOTIC'
 'STOLEN PROPERTY' 'SECONDARY CODES' 'TRESPASS' 'MISSING PERSON' 'FRAUD'
 'KIDNAPPING' 'RUNAWAY' 'DRIVING UNDER THE INFLUENCE'
 'SEX OFFENSES FORCIBLE' 'PROSTITUTION' 'DISORDERLY CONDUCT' 'ARSON'
 'FAMILY OFFENSES' 'LIQUOR LAWS' 'BRIBERY' 'EMBEZZLEMENT' 'SUICIDE'
 'LOITERING' 'SEX OFFENSES NON FORCIBLE' 'EXTORTION' 'GAMBLING'
 'BAD CHECKS' 'TREA' 'RECOVERED VEHICLE' 'PORNOGRAPHY/OBSCENE MAT']
['PdDistrict_NORTHERN', 'PdDistrict_PARK', 'PdDistrict_INGLESIDE', 'PdDistrict_BAYVIEW', 'PdDistrict_RICHMOND', 'PdDistrict_CENTRAL', 'PdDistrict_TARAVAL', 'PdDistrict_TENDERLOIN', 'PdDistrict_MISSION', 'PdDistrict_SOUTHERN']
['Month_1', 'Month_2', 'Month_3', 'Month_4', 'Month_5', 'Month_6', 'Month_7', 'Month_8', 'Month_9', 'Month_10', 'Month_11', 'Month_12']
['year_2003', 'year_2004', 'year_20

In [7]:
# Processing Address Feature
def removePunctuation(text):
    '''
    function to remove punctuations
    '''
    # create a regex for punctuations
    punct = re.compile(r'([^A-Za-z0-9 ])')
    # replace the punctuation with empty space
    return punct.sub(" ", text)

# removing punctuation from address values
train_df['Address'] = train_df['Address'].map(lambda x : removePunctuation(x))
test_df['Address'] = test_df['Address'].map(lambda x : removePunctuation(x))


# count vectorizer for address
vectorizer = CountVectorizer(stop_words='english')
train_address_features = vectorizer.fit_transform(train_df['Address'].values)
test_adddress_features = vectorizer.transform(test_df['Address'].values)
print len(vectorizer.vocabulary_)

2130


In [8]:
def ProcessData(df, datatype):
    # set the correct type for dates column
    df['Dates'] = df['Dates'].astype('datetime64[ns]')   
    # adding Features
    df = addFeatures(df)
       
    # drop columns not needed now
    if datatype == 'train':
        df = df.drop(['Descript','Resolution','Address', 'Dates','PdDistrict','DayOfWeek'], axis=1)
    if datatype == 'test':
        df = df.drop(['Address','Dates','PdDistrict','DayOfWeek'], axis=1)
    return df
    
def addFeatures(df):   
    #df  = processPdDiscrictCrimePropertion(df)   # not working well.
    df = processPdDiscrict(df)  
    df = processDayOfWeek(df)
    df = processCordinates(df)
    #df = processAddress(df) # not working well. use some some other technique to extract features
    df = addHourOfCrime(df)
    df = addMonthOfCrime(df)
    df = addYearOfCrime(df)
    df = addSeasonOfCrime(df)
    return df
    
def processPdDiscrictCrimePropertion(df):    
    df  = pd.merge(df, df_prop_crime_per_pdDisrict, on='PdDistrict',how='left') 
    return df


def processPdDiscrict(df):   
    new_PdDistrict_df = pd.get_dummies(df['PdDistrict'], prefix='PdDistrict')   
    new_PdDistrict_df = new_PdDistrict_df[PdDistrict_features]
    df  = pd.concat([df, new_PdDistrict_df], axis=1)   
    return df


def processCordinates(df):   
    new_cord_df = pd.get_dummies(df['loc'], prefix='cord')   
    new_cord_df = new_cord_df[cordinate_features]
    df  = pd.concat([df, new_cord_df], axis=1)   
    return df

def processDayOfWeek(df):   
    new_DayOfWeek_df = pd.get_dummies(df['DayOfWeek'], prefix='DayOfWeek')   
    new_DayOfWeek_df = new_DayOfWeek_df[DayOfWeek_features]
    df  = pd.concat([df, new_DayOfWeek_df], axis=1)    
    return df
    
def addHourOfCrime(df):      
    df['HourOfCrime'] = df['Dates'].map(lambda d: d.hour + d.minute / 60.)  
    return df

def addMonthOfCrime(df):      
    #df['MonthOfCrime'] = df['Dates'].map(lambda d: d.month)  
    new_month_df = pd.get_dummies(df['Dates'].map(lambda d: d.month) , prefix='Month')   
    new_month_df = new_month_df[month_features]
    df  = pd.concat([df, new_month_df], axis=1)    
    return df

def addYearOfCrime(df):          
    new_year_df = pd.get_dummies(df['Dates'].map(lambda d: d.year) , prefix='year')   
    new_year_df = new_year_df[year_features]
    df  = pd.concat([df, new_year_df], axis=1)    
    return df

# March - June "Spring", July - October "Summer", November - February "Winter"
def addSeasonOfCrime(df):        
    new_month_df = pd.get_dummies(df['Dates'].map(lambda d: GetSeason(d.month)) , prefix='season')   
    new_month_df = new_month_df[season_features]   
    df  = pd.concat([df, new_month_df], axis=1)    
    return df
def GetSeason(month):
    if month == 3 or month == 4 or month == 5 or month == 6:
        return 'spring'
    if month == 7 or month == 8 or month == 9 or month == 10:
        return 'summer'
    if month == 11 or month == 12 or month == 1 or month == 2:
        return 'winter'

def processAddress(df):   
    address_hashed = address_vectorizer.fit_transform(df['Address'])
    new_address_df = pd.SparseDataFrame([ pd.SparseSeries(address_hashed[i].toarray().ravel()) 
                                   for i in np.arange(address_hashed.shape[0]) ])
    new_address_df.columns = address_features
    df  = pd.concat([df, new_address_df], axis=1)  
    return df

In [9]:
final_train_df = ProcessData(train_df, 'train')
final_train_df['Category'] = category_encoder.transform(final_train_df['Category'])  
final_test_df = ProcessData(test_df, 'test')

In [10]:
# removing variables from the memory which are not required further
del train_df
del test_df
gc.collect() # garbage collection

58

In [11]:
print final_train_df.info()
print final_test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 878049 entries, 0 to 878048
Columns: 189 entries, Category to season_summer
dtypes: float64(187), int64(1), object(1)
memory usage: 1.2+ GB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 884262 entries, 0 to 884261
Columns: 189 entries, Id to season_summer
dtypes: float64(187), int64(1), object(1)
memory usage: 1.3+ GB
None


### Creating Final Data 

In [12]:
#input features
to_be_scaled_features = []
#to_be_scaled_features +=  list('dist_' + unique_categories)
to_be_scaled_features +=  ['HourOfCrime']
no_scale_features = []
no_scale_features += PdDistrict_features 
no_scale_features += DayOfWeek_features 
no_scale_features += month_features 
no_scale_features += year_features 
no_scale_features += cordinate_features
no_scale_features += season_features
inputFeatures = []
inputFeatures = to_be_scaled_features + no_scale_features
#output features
ouptutFeatures = ['Category']

In [13]:
len(inputFeatures)

185

In [14]:
X = final_train_df[inputFeatures].values
Y = final_train_df[ouptutFeatures].values


In [15]:
# removing variables from the memory which are not required further
del final_train_df
gc.collect() # garbage collection

1556

In [16]:
# append count vectorizer features for address
X = np.concatenate((X,  train_address_features.toarray()), axis=1)
print X.shape

(878049L, 2315L)


In [17]:
#  scaling data
#scaler = StandardScaler()
#scaler = MinMaxScaler()
#X[:,:len(to_be_scaled_features)] = scaler.fit_transform(X[:,:len(to_be_scaled_features)])
#X = scaler.fit_transform(X)

In [18]:
# removing variables from the memory which are not required further
del train_address_features
gc.collect() # garbage collection

0

In [19]:
sss = StratifiedShuffleSplit(Y, test_size=0.1, random_state=1234)
for train_index, dev_index in sss:
    break
train_data, train_labels = X[train_index], Y[train_index]
dev_data, dev_labels = X[dev_index], Y[dev_index]
print train_data.shape, train_labels.shape
print dev_data.shape, dev_labels.shape
print len(np.unique(train_labels)), len(np.unique(dev_labels))

(790240L, 2315L) (790240L, 1L)
(87809L, 2315L) (87809L, 1L)
39 39


In [20]:
# removing variables from the memory which are not required further
del X
del Y
gc.collect() # garbage collection

0

### PCA

In [27]:
components = []
explained_ratios = []
for comp in np.arange(30 100, 10):
    print 'trying comp size : {0}'.format(comp)
    pca = IncrementalPCA(n_components=comp, batch_size=10000)
    pca.fit(train_data)
    components.append(comp)
    explained_ratios.append(np.sum(pca.explained_variance_ratio_))
plt.plot(components, explained_ratios)

SyntaxError: invalid syntax (<ipython-input-27-1b1ca91bd76e>, line 3)

In [23]:
components = 100
#pca = PCA(n_components=components)
pca = IncrementalPCA(n_components=components, batch_size=10000)
train_data_pca = pca.fit_transform(train_data)

In [24]:
print pca.explained_variance_ratio_
print 'total variance explained : {0}'.format(np.sum(pca.explained_variance_ratio_))

[  8.42627339e-01   1.18047915e-02   8.34601375e-03   7.98179114e-03
   5.29984866e-03   3.86414566e-03   3.30574706e-03   3.01521592e-03
   2.92441048e-03   2.86589255e-03   2.80843957e-03   2.78513063e-03
   2.72858837e-03   2.63179886e-03   2.49305265e-03   2.14708310e-03
   1.93103110e-03   1.77059677e-03   1.74034975e-03   1.72590181e-03
   1.71983951e-03   1.68585271e-03   1.66742113e-03   1.65798816e-03
   1.64849376e-03   1.63561835e-03   1.62488574e-03   1.60702396e-03
   1.59886162e-03   1.59247584e-03   1.57405273e-03   1.56450404e-03
   1.55625601e-03   1.53846056e-03   1.52698548e-03   1.50560765e-03
   1.50343076e-03   1.47823736e-03   1.45087280e-03   1.38703333e-03
   1.25429521e-03   1.15032512e-03   1.08640197e-03   9.98748435e-04
   9.50303664e-04   8.85554634e-04   8.51514635e-04   8.23859778e-04
   8.11904694e-04   7.65114661e-04   7.30254000e-04   6.79943461e-04
   6.25586081e-04   5.98456905e-04   5.73737628e-04   5.67204310e-04
   5.24750847e-04   4.97112682e-04

In [25]:
dev_data_pca = pca.transform(dev_data)

In [26]:
print train_data.shape, train_data_pca.shape
print dev_data.shape, dev_data_pca.shape

(790240L, 2315L) (790240L, 100L)
(87809L, 2315L) (87809L, 100L)


### Model Building and Training

In [27]:
# trying Random Forest Network
 # model 4 - Random Forest
rfc = RandomForestClassifier(n_estimators=50, max_depth=15, random_state=1337, n_jobs=-1, verbose=1)
rfc.fit(train_data_pca, train_labels)
print 'rfc  LogLoss {0}'.format(log_loss(dev_labels, rfc.predict_proba(dev_data_pca)))

[Parallel(n_jobs=-1)]: Done   1 out of  50 | elapsed:   27.4s remaining: 22.4min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  1.7min finished
[Parallel(n_jobs=16)]: Done   1 out of  44 | elapsed:    0.1s remaining:    8.6s
[Parallel(n_jobs=16)]: Done  50 out of  50 | elapsed:    0.5s finished


rfc  LogLoss 2.37112734818


In [27]:
# trying deep belief networks
clf = DBN(
    [train_data_pca.shape[1], 300, len(category_encoder.classes_)],
    learn_rates=0.3,
    learn_rate_decays=0.9,
    epochs=10,
    verbose=1,
    )

clf.fit(train_data_pca, train_labels.ravel())
preds = clf.predict(dev_data_pca)
print classification_report(dev_labels, preds)
print log_loss(dev_labels, clf.predict_proba(dev_data_pca))

[DBN] fitting X.shape=(790240L, 60L)
[DBN] layers [60L, 300, 39]

100%



[DBN] Fine-tune...
Epoch 1:

100%



  loss 7645049.8974
  err  0.804093352636
  (0:00:55)
Epoch 2:

100%



  loss 2.688498898
  err  0.802472260468
  (0:00:54)
Epoch 3:

100%



  loss 2.68930282237
  err  0.802965801409
  (0:00:55)
Epoch 4:

100%



  loss 2.68784684317
  err  0.802564641208
  (0:00:55)
Epoch 5:

100%



  loss 2.68635195572
  err  0.801383939418
  (0:00:55)
Epoch 6:

100%



  loss 2.68664861494
  err  0.801957206204
  (0:00:55)
Epoch 7:

100%



  loss 2.68577188873
  err  0.802114126913
  (0:00:55)
Epoch 8:

100%



  loss 2.68567634986
  err  0.801400390783
  (0:00:56)
Epoch 9:

100%



  loss 2.68414372489
  err  0.801675002025
  (0:00:57)
Epoch 10:
  loss 2.68425798099
  err  0.800958734915
  (0:00:55)
             precision    recall  f1-score   support

          0       0.00      0.00      0.00       151
          1       0.00      0.00      0.00      7688
          2       0.00      0.00      0.00        41
          3       0.00      0.00      0.00        29
          4       0.00      0.00      0.00      3676
          5       0.00      0.00      0.00       432
          6       0.00      0.00      0.00       227
          7       0.00      0.00      0.00      5397
          8       0.00      0.00      0.00       428
          9       0.00      0.00      0.00       117
         10       0.00      0.00      0.00        26
         11       0.00      0.00      0.00        49
         12       0.00      0.00      0.00      1061
         13       0.00      0.00      0.00      1668
         14       0.00      0.00      0.00        15
         15       0.00      0.

  'precision', 'predicted', average, warn_for)


In [None]:
# trying support vector machine models
from sklearn.svm import SVC
C = 1.0
svc = SVC(kernel='linear', C=C, probability =True).fit(train_data_pca, train_labels.ravel())
print 'linear  svc : log loss = {0}'.format(log_loss(dev_labels, svc.predict_proba(dev_data_pca)))

rbf_svc = SVC(kernel='rbf', C=C, probability =True).fit(train_data_pca, train_labels.ravel())
print 'linear  svc : log loss = {0}'.format(log_loss(dev_labels, rbf_svc.predict_proba(dev_data_pca)))

poly_svc = SVC(kernel='poly', C=C, probability =True).fit(train_data_pca, train_labels.ravel())
print 'linear  svc : log loss = {0}'.format(log_loss(dev_labels, poly_svc.predict_proba(dev_data_pca)))


In [31]:
clfs = []
predictions = []

# batch training of random forest
def getRows(rows, data, labels):
    return data[rows], labels[rows].ravel()

def iter_minibatches(chunksize, data, labels):
    numtrainingpoints = len(data)     
    chunkstartmarker = 0
    while chunkstartmarker < numtrainingpoints:
        start = chunkstartmarker
        if start + chunksize < numtrainingpoints:
            end = chunkstartmarker + chunksize
        else:
            end = numtrainingpoints
        chunkrows = range(start,end)       
        X_chunk, y_chunk = getRows(chunkrows, data, labels)        
        yield X_chunk, y_chunk
        chunkstartmarker += chunksize       
        
def train_model(train_full_data, train_full_labels, train_pca_data, dev_full_data, dev_full_labels, dev_pca_data):
    #model = batch_SGDClassifier(data, labels)
    #model = MultinomialNBClassifier(data, labels)
    #model = batch_RandomForest(data, labels)
    #return model
    models, weights = ensemble_training(train_full_data, train_full_labels, train_pca_data, dev_full_data, dev_full_labels, dev_pca_data)
    return models, weights

def ensemble_training(train_full_data, train_full_labels, train_pca_data, dev_full_data, dev_full_labels, dev_pca_data):
    
    model_modes = []
    # batch mode models - using partial_fit - use all features
    # model 1 - batch SGDClassifier
    #batch_sgd_1 = batch_SGDClassifier(train_full_data, train_full_labels,dev_full_data, dev_full_labels, 3213)
    #print 'model 1 : Batch SGD -1 LogLoss {0}'.format(log_loss(dev_full_labels, batch_sgd_1.predict_proba(dev_full_data)))
    #clfs.append(batch_sgd_1)
    #model_modes.append('Full')
    
    # model 2 - batch SGDClassifier
    #batch_sgd_2 = batch_SGDClassifier(train_full_data, train_full_labels,dev_full_data, dev_full_labels, 4321)
    #print 'model 2 : Batch SGD -2 LogLoss {0}'.format(log_loss(dev_full_labels, batch_sgd_2.predict_proba(dev_full_data)))
    #clfs.append(batch_sgd_2)
    #model_modes.append('Full')
    
    # model 3 - Random Forest
    rfc_1 = RandomForestClassifier(n_estimators=50, max_depth=15, random_state=4141, n_jobs=-1, verbose=1)
    rfc_1.fit(train_pca_data, train_full_labels)
    print 'model 3 : RFC -1 LogLoss {0}'.format(log_loss(dev_full_labels, rfc_1.predict_proba(dev_pca_data)))
    clfs.append(rfc_1)
    model_modes.append('PCA')
    
    # model 4 - Random Forest
    rfc_2 = RandomForestClassifier(n_estimators=50, max_depth=15, random_state=1337, n_jobs=-1, verbose=1)
    rfc_2.fit(train_pca_data, train_full_labels)
    print 'model 4 : RFC - 2  LogLoss {0}'.format(log_loss(dev_full_labels, rfc_2.predict_proba(dev_pca_data)))
    clfs.append(rfc_2) 
    model_modes.append('PCA')
    
    # model 3 - Random Forest
    #rfc_3 = RandomForestClassifier(n_estimators=50, max_depth=15, random_state=8765, n_jobs=-1)
    #rfc_3.fit(train_pca_data, train_full_labels)
    #print 'model 3 : RFC -3 LogLoss {0}'.format(log_loss(dev_full_labels, rfc_3.predict_proba(dev_pca_data)))
    #clfs.append(rfc_3)
    #model_modes.append('PCA')
    
    # model 4 - Random Forest
    #rfc_4 = RandomForestClassifier(n_estimators=50, max_depth=15, random_state=5469, n_jobs=-1, verbose=1)
    #rfc_4.fit(train_pca_data, train_full_labels)
    #print 'model 4 : RFC - 4  LogLoss {0}'.format(log_loss(dev_full_labels, rfc_4.predict_proba(dev_pca_data)))
    #clfs.append(rfc_4) 
    #model_modes.append('PCA')
    
    
     # model 5 - Gradient Boost
    #gbc_1 = GBC(n_estimators=50, max_depth=15, random_state=3421, verbose=1)
    #gbc_1.fit(train_pca_data, train_full_labels)
    #print 'GBC -1 LogLoss {0}'.format(log_loss(dev_labels, gbc_1.predict_proba(dev_pca_data)))
    #clfs.append(gbc_1)
    #model_modes.append('PCA')
    
     # model 6 - Gradient Boost
    #gbc_2 = GBC(n_estimators=50, max_depth=15, random_state=7651, verbose=1)
    #gbc_2.fit(train_pca_data, train_full_labels)
    #print 'GBC -2 LogLoss {0}'.format(log_loss(dev_labels, gbc_2.predict_proba(dev_pca_data)))
    #clfs.append(gbc_2)
    #model_modes.append('PCA')
    
    
    # model 5 - Logistic Regression
    #lr_1 = LogisticRegression(C=100)
    #lr_1.fit(train_pca_data, train_full_labels)
    #print 'LR -1 LogLoss {0}'.format(log_loss(dev_labels, lr_1.predict_proba(dev_pca_data)))
    #clfs.append(lr_1)
    #model_modes.append('PCA')
    
    ### finding the optimum weights    
    for index in range(len(clfs)):
        clf = clfs[index]
        model_mode = model_modes[index]
        if model_mode == 'Full':
            predictions.append(clf.predict_proba(dev_full_data))
        if model_mode == 'PCA':
            predictions.append(clf.predict_proba(dev_pca_data))
    
    #the algorithms need a starting value, right not we chose 0.5 for all weights
    #its better to choose many random starting points and run minimize a few times
    starting_values = [0.5]*len(predictions)

    #adding constraints  and a different solver as suggested by user 16universe
    #https://kaggle2.blob.core.windows.net/forum-message-attachments/75655/2393/otto%20model%20weights.pdf?sv=2012-02-12&se=2015-05-03T21%3A22%3A17Z&sr=b&sp=r&sig=rkeA7EJC%2BiQ%2FJ%2BcMpcA4lYQLFh6ubNqs2XAkGtFsAv0%3D
    cons = ({'type':'eq','fun':lambda w: 1-sum(w)})
    #our weights are bound between 0 and 1
    bounds = [(0,1)]*len(predictions)

    res = minimize(log_loss_func, starting_values, method='SLSQP', bounds=bounds, constraints=cons)

    print 'Ensamble Score: {0}'.format(res['fun'])
    print 'Best Weights: {0}'.format(res['x'])
    return clfs, res['x']
    
def log_loss_func(weights):
    ''' scipy minimize will pass the weights as a numpy array '''
    final_prediction = 0
    for weight, prediction in zip(weights, predictions):
            final_prediction += weight*prediction

    return log_loss(dev_labels, final_prediction)

def batch_RandomForest(data, labels):
    model = RandomForestClassifier(n_estimators=100,criterion='entropy',max_depth=5)
    model.fit(data, labels)
    return model  
    
def MultinomialNBClassifier(data, labels):
    model = MultinomialNB()
    model.fit(data,labels)
    return model

def batch_SGDClassifier(data , labels, test_data, test_labels,random_state):
    chunk_size = 1000
    batcherator = iter_minibatches(chunk_size, data, labels)
    model = SGDClassifier(n_jobs=-1,alpha=0.00005,n_iter = 50, loss='log',random_state=random_state)
    # Train model on each chunk
    chunk_count = 1
    for X_chunk, y_chunk in batcherator:               
        model.partial_fit(X_chunk, y_chunk, classes=np.array(range(len(category_encoder.classes_))))
        score = model.score(test_data, test_labels)
        probs = model.predict_proba(test_data)
        log_loss_value = log_loss(test_labels, probs)
        print 'training using chunk : {0} validation score : {1:.4f} log-loss : {2:.5f}'.format(chunk_count,score,log_loss_value) 
        chunk_count += 1
    return model  

In [None]:
models, weights = train_model(train_data, train_labels.ravel(), train_data_pca, dev_data, dev_labels.ravel(), dev_data_pca )

[Parallel(n_jobs=-1)]: Done   1 out of  50 | elapsed:   20.3s remaining: 16.6min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  1.2min finished
[Parallel(n_jobs=16)]: Done   1 out of  46 | elapsed:    0.1s remaining:   10.5s
[Parallel(n_jobs=16)]: Done  50 out of  50 | elapsed:    0.5s finished
[Parallel(n_jobs=-1)]: Done   1 out of  50 | elapsed:   19.4s remaining: 15.9min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  1.2min finished
[Parallel(n_jobs=16)]: Done   1 out of  50 | elapsed:    0.1s remaining:    9.8s
[Parallel(n_jobs=16)]: Done  50 out of  50 | elapsed:    0.5s finished


model 3 : RFC -1 LogLoss 2.38681878209
model 4 : RFC - 2  LogLoss 2.3871812751
      Iter       Train Loss   Remaining Time 

In [None]:
#clf = LogisticRegression(C=100.0)
#clf = GBC(n_estimators=10, max_depth=5,verbose=1)
#clf = MultinomialNB()

#clf = RandomForestClassifier(n_estimators=100)
#clf = SGDClassifier(fit_intercept=False, shuffle=True, n_jobs=-1,alpha=0.000005,n_iter = 50, loss='log', penalty ='l2')
   
#clf.fit(train_data, train_labels)
#print clf.predict_proba(dev_data).shape
# round up the results
#probs = np.around(clf.predict_proba(dev_data), decimals=5)
#print 'log loss score : {0:.4f} accuracy : {1:.4f}'.format(log_loss(dev_labels, probs), clf.score(dev_data, dev_labels))

### Evaluation

In [28]:
# preparate test data
test_data = final_test_df[inputFeatures].values
#test_data[:,:len(to_be_scaled_features)] = scaler.transform(test_data[:,:len(to_be_scaled_features)])
test_data = np.concatenate((test_data,  test_adddress_features.toarray()), axis=1)
test_data_pca = pca.transform(test_data)
print test_data.shape, test_data_pca.shape

(884262L, 2315L) (884262L, 100L)


In [42]:
# compute test output
### finding the optimum weights    
#final_predictions = []
final_prob = np.zeros((test_data_pca.shape[0], len(category_encoder.classes_)))
for index in range(len(models)):
    model = models[index]
    #model_mode = model_modes[index]
    #if model_mode == 'Full':
    #    pass
    #    #final_predictions.append(model.predict_proba(test_data))
    #if model_mode == 'PCA':
    probs = model.predict_proba(test_data_pca)
    probs = probs * weights[index]
    final_prob = final_prob + probs
print final_prob.shape
       

(884262L, 39L)


In [30]:

# round up the results
submit_output = np.around(final_prob, decimals=10)
print submit_output.shape

(884262L, 39L)


### Submission Preperation

In [31]:
result = np.c_[final_test_df['Id'].astype(int), submit_output.astype(float)]
print result.shape
outputColumns =  ['Id'] + list( category_encoder.classes_)
df_result = pd.DataFrame(result, columns=outputColumns)
df_result['Id'] = df_result['Id'].astype(int)
print df_result.info()

(884262L, 40L)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 884262 entries, 0 to 884261
Data columns (total 40 columns):
Id                             884262 non-null int32
ARSON                          884262 non-null float64
ASSAULT                        884262 non-null float64
BAD CHECKS                     884262 non-null float64
BRIBERY                        884262 non-null float64
BURGLARY                       884262 non-null float64
DISORDERLY CONDUCT             884262 non-null float64
DRIVING UNDER THE INFLUENCE    884262 non-null float64
DRUG/NARCOTIC                  884262 non-null float64
DRUNKENNESS                    884262 non-null float64
EMBEZZLEMENT                   884262 non-null float64
EXTORTION                      884262 non-null float64
FAMILY OFFENSES                884262 non-null float64
FORGERY/COUNTERFEITING         884262 non-null float64
FRAUD                          884262 non-null float64
GAMBLING                       884262 non-null floa

In [32]:
df_result.to_csv('10-SFCrimeMIDSChallengerTeam.csv', index=False,header=True)