<center><h1>San Francisco Crime Classification - Predictions</h1></center>

### `import` Packages

In [1]:
import warnings
warnings.filterwarnings('ignore')

import pickle
import random
import os
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler

In [2]:
project_path = '/content/drive/MyDrive/AAIC/SCS-1/sf_crime_classification/'

### Test Data (Reading)

* In the featurization part, I have already fit the training data and transformed the test data.

* I did not fit the test data again as it would create data leakage problem.

* For more details, please visit to my [jovian notebook](https://jovian.ai/msameeruddin/01-cs1-featurization).

In [3]:
train = pd.read_csv(filepath_or_buffer=project_path + 'csv_files/train_data_features.csv')
data = pd.read_csv(filepath_or_buffer=project_path + 'csv_files/test_data_features.csv')

In [4]:
data.drop(columns=['id'], axis=1, inplace=True)

In [5]:
def preprocess_data(X_train, X_new):
    scaler = StandardScaler()
    scaler.fit(X_train)
    X = scaler.transform(X_new)
    return X

In [6]:
data = preprocess_data(X_train=train, X_new=data)

In [7]:
data.shape

(884262, 130)

### Make Predictions

In [8]:
labels = [
    'ARSON',
    'ASSAULT',
    'BAD CHECKS',
    'BRIBERY',
    'BURGLARY',
    'DISORDERLY CONDUCT',
    'DRIVING UNDER THE INFLUENCE',
    'DRUG/NARCOTIC',
    'DRUNKENNESS',
    'EMBEZZLEMENT',
    'EXTORTION',
    'FAMILY OFFENSES',
    'FORGERY/COUNTERFEITING',
    'FRAUD',
    'GAMBLING',
    'KIDNAPPING',
    'LARCENY/THEFT',
    'LIQUOR LAWS',
    'LOITERING',
    'MISSING PERSON',
    'NON-CRIMINAL',
    'OTHER OFFENSES',
    'PORNOGRAPHY/OBSCENE MAT',
    'PROSTITUTION',
    'RECOVERED VEHICLE',
    'ROBBERY',
    'RUNAWAY',
    'SECONDARY CODES',
    'SEX OFFENSES FORCIBLE',
    'SEX OFFENSES NON FORCIBLE',
    'STOLEN PROPERTY',
    'SUICIDE',
    'SUSPICIOUS OCC',
    'TREA',
    'TRESPASS',
    'VANDALISM',
    'VEHICLE THEFT',
    'WARRANTS',
    'WEAPON LAWS'
 ]

In [9]:
def make_predictions(X, model_name=None, labels=labels):
    model_path = project_path + 'models/'

    if (model_name == 'logistic_regression'):
        model_path = model_path + 'log_reg_classifier.pkl'
    elif (model_name == 'decision_tree'):
        model_path = model_path + 'decision_tree_classifier.pkl'
    elif (model_name == 'random_forest'):
        model_path = model_path + 'random_forest_classifier.pkl'
    else:
        # X.columns = ['f{}'.format(i) for i in range(X.shape[1])]
        model_path = model_path + 'xgboost_multi_classifier.pkl'
    
    preds_file = model_name + '_submissions.csv'
    predictions_path = project_path + 'predictions/' + preds_file
    
    if not os.path.isfile(path=predictions_path):
        model = pickle.load(open(model_path, 'rb'))
        probas = model.predict_proba(X)
        print(probas.shape)

        ids = np.array(list(range(X.shape[0]))).reshape(-1, 1)
        pred_data = np.hstack(tup=(ids, probas))
        pred_df = pd.DataFrame(data=pred_data, columns=['Id'] + labels)
        pred_df['Id'] = pred_df['Id'].astype(int)
        pred_df.to_csv(path_or_buf=predictions_path, index=None)
        return pred_df
    else:
        print('Predictions already exists.')
        print('Model name : ', model_name)
        pred_df = pd.read_csv(filepath_or_buffer=predictions_path)
        return pred_df
    
    return None

### Logistic Regression Predictions

In [10]:
model_name = 'logistic_regression'
lr_preds = make_predictions(X=data, model_name=model_name)

Predictions already exists.
Model name :  logistic_regression


In [11]:
lr_preds.head(2)

Unnamed: 0,Id,ARSON,ASSAULT,BAD CHECKS,BRIBERY,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,DRUNKENNESS,EMBEZZLEMENT,EXTORTION,FAMILY OFFENSES,FORGERY/COUNTERFEITING,FRAUD,GAMBLING,KIDNAPPING,LARCENY/THEFT,LIQUOR LAWS,LOITERING,MISSING PERSON,NON-CRIMINAL,OTHER OFFENSES,PORNOGRAPHY/OBSCENE MAT,PROSTITUTION,RECOVERED VEHICLE,ROBBERY,RUNAWAY,SECONDARY CODES,SEX OFFENSES FORCIBLE,SEX OFFENSES NON FORCIBLE,STOLEN PROPERTY,SUICIDE,SUSPICIOUS OCC,TREA,TRESPASS,VANDALISM,VEHICLE THEFT,WARRANTS,WEAPON LAWS
0,0,0.00681,0.121123,0.00018,0.000434,0.032989,0.004379,0.007138,0.056211,0.008563,0.00025,0.000263,0.000647,0.000421,0.002368,0.000162,0.004104,0.131456,0.002034,0.000361,0.022498,0.099188,0.167898,4.5e-05,0.001278,0.010929,0.035367,0.001685,0.033758,0.004977,0.000183,0.00701,0.000686,0.041331,1.9e-05,0.008786,0.059841,0.038194,0.058017,0.028418
1,1,0.003566,0.09115,0.00011,0.000325,0.00405,0.003555,0.02242,0.081024,0.010173,0.000125,0.000206,0.000298,0.00017,0.002016,0.000145,0.002961,0.132604,0.002216,0.000777,0.004111,0.090481,0.248452,2.1e-05,0.002583,0.004854,0.04764,0.000233,0.016677,0.003736,0.000146,0.006381,0.000309,0.030777,1e-05,0.003536,0.045989,0.040382,0.065869,0.029923


In [12]:
# ! kaggle competitions submit -c sf-crime -f logistic_regression_submissions.csv -m "Implemented logistic regression model."

![image](https://user-images.githubusercontent.com/63333753/150188034-3ae9e0cd-5c7c-4fc6-ba1d-c07fb1586f8e.png)

### Decision Tree Predictions

In [13]:
model_name = 'decision_tree'
dt_preds = make_predictions(X=data, model_name=model_name)

Predictions already exists.
Model name :  decision_tree


In [14]:
dt_preds.head(2)

Unnamed: 0,Id,ARSON,ASSAULT,BAD CHECKS,BRIBERY,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,DRUNKENNESS,EMBEZZLEMENT,EXTORTION,FAMILY OFFENSES,FORGERY/COUNTERFEITING,FRAUD,GAMBLING,KIDNAPPING,LARCENY/THEFT,LIQUOR LAWS,LOITERING,MISSING PERSON,NON-CRIMINAL,OTHER OFFENSES,PORNOGRAPHY/OBSCENE MAT,PROSTITUTION,RECOVERED VEHICLE,ROBBERY,RUNAWAY,SECONDARY CODES,SEX OFFENSES FORCIBLE,SEX OFFENSES NON FORCIBLE,STOLEN PROPERTY,SUICIDE,SUSPICIOUS OCC,TREA,TRESPASS,VANDALISM,VEHICLE THEFT,WARRANTS,WEAPON LAWS
0,0,0.002118,0.165765,0.000461,0.00035,0.029653,0.004367,0.002851,0.040854,0.004421,0.001254,0.00031,0.000598,0.009633,0.014851,0.000172,0.00302,0.115277,0.0022,0.001348,0.025965,0.115913,0.138277,3.2e-05,0.003947,0.010242,0.042982,0.001952,0.018854,0.004948,0.000184,0.006122,0.000619,0.055769,1.5e-05,0.007769,0.056521,0.045942,0.045315,0.019132
1,1,0.001423,0.058543,0.000369,0.000297,0.01748,0.003436,0.002075,0.043398,0.003123,0.000978,0.000248,0.000413,0.007671,0.012043,0.000138,0.002068,0.075626,0.001623,0.001081,0.017545,0.05927,0.47337,2.5e-05,0.003165,0.002508,0.027776,0.001524,0.00861,0.003879,0.000147,0.004296,0.000472,0.034465,1.2e-05,0.005606,0.028453,0.029831,0.055203,0.011809


In [15]:
# ! kaggle competitions submit -c sf-crime -f decision_tree_submissions.csv -m "Implemented decision tree model."

![image](https://user-images.githubusercontent.com/63333753/150188343-47a07823-8cb2-422c-87f1-09944beeb677.png)

### Random Forest Predictions

In [16]:
model_name = 'random_forest'
rf_preds = make_predictions(X=data, model_name=model_name)

Predictions already exists.
Model name :  random_forest


In [17]:
rf_preds.head(2)

Unnamed: 0,Id,ARSON,ASSAULT,BAD CHECKS,BRIBERY,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,DRUNKENNESS,EMBEZZLEMENT,EXTORTION,FAMILY OFFENSES,FORGERY/COUNTERFEITING,FRAUD,GAMBLING,KIDNAPPING,LARCENY/THEFT,LIQUOR LAWS,LOITERING,MISSING PERSON,NON-CRIMINAL,OTHER OFFENSES,PORNOGRAPHY/OBSCENE MAT,PROSTITUTION,RECOVERED VEHICLE,ROBBERY,RUNAWAY,SECONDARY CODES,SEX OFFENSES FORCIBLE,SEX OFFENSES NON FORCIBLE,STOLEN PROPERTY,SUICIDE,SUSPICIOUS OCC,TREA,TRESPASS,VANDALISM,VEHICLE THEFT,WARRANTS,WEAPON LAWS
0,0,0.003593,0.181447,0.000345,0.0005,0.030004,0.003864,0.001677,0.035899,0.00345,0.001217,0.000295,0.000742,0.008759,0.010899,0.000192,0.005359,0.094183,0.001951,0.000603,0.051863,0.082983,0.12179,3.1e-05,0.003323,0.018768,0.029049,0.001819,0.025757,0.006029,0.00025,0.004727,0.000614,0.057898,1.4e-05,0.006791,0.070578,0.070346,0.037222,0.025171
1,1,0.001868,0.054511,0.000139,0.000339,0.00602,0.002849,0.00246,0.057245,0.002609,0.000726,0.000139,0.000329,0.002629,0.004945,0.000238,0.001489,0.056113,0.001897,0.000664,0.012243,0.036934,0.52328,2.6e-05,0.002733,0.002601,0.040813,0.000986,0.006967,0.001823,0.000117,0.003362,0.000412,0.024255,1.2e-05,0.002805,0.016721,0.024543,0.079048,0.02311


In [18]:
# ! kaggle competitions submit -c sf-crime -f random_forest_submissions.csv -m "Implemented random forest model."

![image](https://user-images.githubusercontent.com/63333753/150188699-3f2f8ff7-c68e-4559-a38b-556bbf03f890.png)

### XGBoost Predictions

In [19]:
model_name = 'xgboost'
xgb_preds = make_predictions(X=data, model_name=model_name)

Predictions already exists.
Model name :  xgboost


In [20]:
xgb_preds.head(2)

Unnamed: 0,Id,ARSON,ASSAULT,BAD CHECKS,BRIBERY,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,DRUNKENNESS,EMBEZZLEMENT,EXTORTION,FAMILY OFFENSES,FORGERY/COUNTERFEITING,FRAUD,GAMBLING,KIDNAPPING,LARCENY/THEFT,LIQUOR LAWS,LOITERING,MISSING PERSON,NON-CRIMINAL,OTHER OFFENSES,PORNOGRAPHY/OBSCENE MAT,PROSTITUTION,RECOVERED VEHICLE,ROBBERY,RUNAWAY,SECONDARY CODES,SEX OFFENSES FORCIBLE,SEX OFFENSES NON FORCIBLE,STOLEN PROPERTY,SUICIDE,SUSPICIOUS OCC,TREA,TRESPASS,VANDALISM,VEHICLE THEFT,WARRANTS,WEAPON LAWS
0,0,0.001974,0.111186,2.2e-05,0.00053,0.015734,0.00153,0.003977,0.038742,0.007223,0.000116,9.5e-05,0.00064,0.001022,0.006808,9.8e-05,0.003456,0.062435,0.002896,0.000441,0.021759,0.131335,0.134631,5.9e-05,8.6e-05,5.3e-05,0.046069,0.000571,0.022667,0.00357,5.7e-05,0.005097,0.000273,0.064468,4.3e-05,0.014797,0.0489,0.17527,0.046669,0.024703
1,1,0.000696,0.076578,1.6e-05,9.8e-05,0.001488,0.00154,0.00811,0.058337,0.005225,0.000103,4.2e-05,0.000107,0.001259,0.003226,0.000113,0.002175,0.016201,0.003607,0.000116,0.00456,0.03611,0.529499,1.3e-05,6.5e-05,3.9e-05,0.038179,4.3e-05,0.009555,0.001105,3.5e-05,0.00411,9.9e-05,0.03598,1.2e-05,0.000421,0.011684,0.014565,0.119701,0.015187


In [21]:
# ! kaggle competitions submit -c sf-crime -f xgboost_submissions.csv -m "Implemented xgboost classifier model."

![image](https://user-images.githubusercontent.com/63333753/150189166-fc2264f6-150d-411a-9eb4-d10e3d7513d8.png)

### Sumbitting the Predictions

* The below are the scores obtained after submitting the predictions on kaggle.

    ![image](https://user-images.githubusercontent.com/63333753/150190219-48522f48-1680-4e02-9b1f-4b27fd751b7d.png)

* From the above scores, we can see that the xgboost model's score is very less compared to all.
    - XGBoost performed well on the remaining split of the training data (it can be seen below) which was done as part of training the model.

    ![image](https://user-images.githubusercontent.com/63333753/150172794-51abed32-eada-491f-9c4d-ba28fabf9fbb.png)

* When I observed and referred the top kaggle kernels, it seems to be that they were using deep learning models (which takes care of featurization and other nitty-gritty things).

    - This is definately something that I would like to experiment in my future improvements.

* **I backed 103 position out of 2332 teams. Approximately I am in top 5% of kagglers who solved this problem statement.**