# Predict the Criminals

## 01 : Frame the Problem

#### Problem Link:https://www.hackerearth.com/problem/machine-learning/predict-the-criminal/description/

There has been a surge in crimes committed in recent years, making crime a top cause of concern for law enforcement. If we are able to estimate whether someone is going to commit a crime in the future, we can take precautions and be prepared. You are given a dataset containing answers to various questions concerning the professional and private lives of several people. A few of them have been arrested for various small and large crimes in the past. Use the given data to predict if the people in the test data will commit a crime. The train data consists of 45718 rows, while the test data consists of 11430 rows.Given this, we have to predict whether a person is criminal or not.

### Importing the Libraries

In [0]:
import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.model_selection import cross_validate  #to include metrics for evaluation
from sklearn.model_selection import GridSearchCV #to use gridsearchcv
from sklearn.model_selection import train_test_split
import matplotlib.pylab as plt
%matplotlib inline

## 02 and 03 : Obtain Data and Analyse The Data

In [0]:
!wget https://www.dropbox.com/s/oyaehyvu7yx57pe/criminal_train.csv

In [0]:
ls -l

In [0]:
crim_data=pd.read_csv('criminal_train.csv')
crim_data.head()

In [0]:
crim_data.info()

## 04 :  Prepare the Data

In [0]:
  #Splitting the data for testing and training

In [0]:
X_train, X_test, y_train, y_test = train_test_split(crim_data.drop('Criminal',axis=1), 
                                                    crim_data['Criminal'], test_size=0.30, 
                                                    random_state=101)

In [0]:
train=pd.concat([X_train,y_train],axis=1)

In [0]:
#function to estimate the best value of n_estimators and fit the model with the given data

In [0]:
def modelfit(alg, dtrain, predictors,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
    
    if useTrainCV:
        #to get the parameters of xgboost
        xgb_param = alg.get_xgb_params() 
        
        #to convert into a datastructure internally used by xgboost for training efficiency 
        # and speed
        xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)
        
        #xgb.cv is used to find the number of estimators required for the parameters 
        # which are set
        cvresult = xgb.cv(xgb_param, xgtrain, 
                          num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
                        metrics='auc', early_stopping_rounds=early_stopping_rounds)
        
        #setting the n_estimators parameter using set_params
        alg.set_params(n_estimators=cvresult.shape[0])
        
        print(alg.get_xgb_params())
    
    #Fit the algorithm on the data
    alg.fit(dtrain[predictors], dtrain['Criminal'],eval_metric='auc')
    
    return alg

In [0]:
#function to get the accuracy of the model on the test data given the features considered

In [0]:
def get_accuracy(alg,predictors):
    dtrain_predictions = alg.predict(X_test[predictors])
    dtrain_predprob = alg.predict_proba(X_test[predictors])[:,1]
    print ("\nModel Report")
    print ("Accuracy : %.4g" % metrics.accuracy_score(y_test.values, 
                                                      dtrain_predictions))
    print ("AUC Score (Train): %f" % metrics.roc_auc_score(y_test.values, 
                                                           dtrain_predprob))

In [0]:
#function to get the feature importances based on the model fit

In [0]:
def get_feature_importances(alg):
    #to get the feature importances based on xgboost we use fscore
    feat_imp = pd.Series(alg._Booster.get_fscore()).sort_values(ascending=False)
    print(feat_imp)
    
    #this shows the feature importances on a bar chart
    feat_imp.plot(kind='bar', title='Feature Importances')
    plt.ylabel('Feature Importance Score')

In [0]:
target = 'Criminal'
IDcol = 'PERID'

## 05 : Model Selection

In [0]:
#to return the XGBClassifier object based on the values of the features

In [0]:
!pip install xgboost

In [0]:
def XgbClass(learning_rate =0.1,n_estimators=1000,max_depth=5,min_child_weight=1,
             gamma=0,subsample=0.8,colsample_bytree=0.8):
    xgb1 = XGBClassifier(learning_rate=learning_rate,
                         n_estimators=n_estimators,
                         max_depth=max_depth,
                         min_child_weight=min_child_weight,
                         gamma=gamma,
                         subsample=subsample,
                         colsample_bytree=colsample_bytree)
    return xgb1

In [0]:
#Function to return the list of predictors

In [0]:
#these are the initial parameters before tuning
def drop_features(l):
    return [x for x in train.columns if x not in l]

### First Prediction : Use of initial parameters and without feature engineering

In [0]:
from xgboost import XGBClassifier
import xgboost as xgb

In [0]:
predictors = drop_features([target, IDcol])
xgb1=XgbClass()
first_model=modelfit(xgb1, train, predictors)
xgb1.fit(train[predictors],train['Criminal'])

In [0]:
get_accuracy(first_model,predictors)

In [0]:
get_feature_importances(first_model)

### Second Prediction : Using intial Parameters and removing features of least importances

In [0]:
#model after removing the features of least importance

In [0]:
dropl=['IRWELMOS','MAIIN102','IIPINC3','HLTINNOS','IIHH65_2','TOOLONG']

In [0]:
dropl_first=dropl+[target,IDcol]

In [0]:
#these are the initial parameters before tuning
predictors = drop_features(dropl_first)
xgb1 = XgbClass()
second_model=modelfit(xgb1, train, predictors)

In [0]:
get_accuracy(second_model,predictors)

In [0]:
get_feature_importances(second_model)

### Third Prediction : Again removing the features of least importance

In [0]:
dropl1=dropl+['IRMCDCHP','HLCLAST','IIKI17_2','IRFAMPMT','IRFSTAMP','ANYHLTI2','IIFAMSVC']
dropl_second=dropl_first+['IRMCDCHP','HLCLAST','IIKI17_2','IRFAMPMT','IRFSTAMP','ANYHLTI2',
                          'IIFAMSVC']

In [0]:
predictors=drop_features(dropl_second)

In [0]:
xgb1=XgbClass()
third_model1=modelfit(xgb1,train,predictors)

In [0]:
get_accuracy(third_model1,predictors)

## 06 : Predict on New Cases

In [0]:
#Function stores the result in required csv file and features

In [0]:
!wget https://www.dropbox.com/s/1jbq922kv3mwi4r/criminal_test.csv

In [0]:
def RunTestAndSaveResults(features,filename,model):
    df1=pd.read_csv('criminal_test.csv')
    for i in features:
        df1.drop(i,axis=1,inplace=True)
    predict=model.predict(df1.drop('PERID',axis=1))
    data=pd.DataFrame(df1['PERID'],columns=['PERID'])
    data['Criminal']=predict
    data.to_csv(filename,index=False)

In [0]:
RunTestAndSaveResults([],'result.csv',first_model)

In [0]:
#This model is giving high accurancy since we applied feature engineering

In [0]:

dropl

In [0]:
RunTestAndSaveResults(dropl,'result2.csv',second_model)

In [0]:
#When the features are repeated again and again then overfitting takes place and the accuracy decrease

In [0]:
RunTestAndSaveResults(dropl1,'result3.csv',third_model1)

## 07 : Tuning

In [0]:
#tune max_depth and min_child_weight

In [0]:
predictors = drop_features(dropl_first)
predictors

In [0]:
param_test1 = {
 'max_depth':list(range(5,10,1)),
 'min_child_weight':list(range(5,10,1))
}
gsearch1 = GridSearchCV(estimator=XgbClass(n_estimators=48),param_grid =param_test1,
                        scoring='roc_auc',n_jobs=-1,iid=False, cv=5, verbose=3)
gsearch1.fit(train[predictors],train[target])
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

In [0]:
#if the best parameters are edge values then we do 
#gridsearchcv by taking one less and one value more than the best parameters

In [0]:
param_test2 = {
 'max_depth':[6,7,8,9],
 'min_child_weight':[2,3,4,5]
}
gsearch2 = GridSearchCV(estimator=XgbClass(n_estimators=48),param_grid =param_test2,scoring='roc_auc',n_jobs=-1,iid=False, cv=5)
gsearch2.fit(train[predictors],train[target])
gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_

In [0]:
xgb1 = XgbClass(max_depth=8,min_child_weight=4)
model=modelfit(xgb1, train, predictors)

In [0]:
get_accuracy(model,predictors)

In [0]:
#to tune gamma

In [0]:
param_test3 = {
 'gamma':[i/10.0 for i in range(0,8)]
}
gsearch3=GridSearchCV(estimator=XgbClass(n_estimators=48,max_depth=7,min_child_weight=5),
                      param_grid =param_test3,scoring='roc_auc',n_jobs=-1,iid=False, cv=5)
gsearch3.fit(train[predictors],train[target])
gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_

In [0]:
xgb1 = XgbClass(max_depth=7,min_child_weight=5,gamma=0)
model=modelfit(xgb1, train, predictors)

In [0]:
get_accuracy(model,predictors)

In [0]:
param_test4 = {
 'subsample':[i/10.0 for i in range(6,10)],
 'colsample_bytree':[i/10.0 for i in range(6,10)]
}
gsearch4=GridSearchCV(estimator=XgbClass(n_estimators=48,max_depth=7,
                                         min_child_weight=5,gamma=0),
                      param_grid =param_test4,scoring='roc_auc',n_jobs=-1,iid=False, cv=5)
gsearch4.fit(train[predictors],train[target])
gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_

In [0]:
xgb1 = XgbClass(max_depth=8,min_child_weight=4,gamma=0.4,subsample=0.8,colsample_bytree=0.6)
model=modelfit(xgb1, train, predictors)

In [0]:
get_accuracy(model,predictors)

In [0]:
#dropl = dropl + ['HLCLAST', 'IIFAMSVC', 'IIKI17_2', 'ANYHLTI2', 'IRFAMPMT', 'IRFSTAMP', 'IRMCDCHP']
RunTestAndSaveResults(dropl,'final_result.csv',model)

In [0]:
ls -l