# MLP Project - crime-cast-forecasting-crime-categories

<details>
    <summary>Level1 Viva Task List</summary>
    <ul>
        <li> - [x] The dataset is appropriately loaded and stored into corresponding variables.</li>
        <li> - [ ] Exploratory data analysis
            <ul>
                <li> - [ ] Visualizing key statistics and relationships in the data</li>
                <li> - [ ] Correctly identify the feature types.</li>
            </ul>
        </li>
        <li> - [ ] Detecting missing data and imputation, if required.</li>
        <li> - [ ] The dataset is appropriately preprocessed,
            <ul>
                <li> - [ ] scaling numerical features.</li>
                <li> - [ ] encoding categorical features.</li>
            </ul>
        </li>
        <li> - [ ] Appropriate usage of pipelines if any.</li>
        <li> - [ ] Feature engineering/extraction.</li>
        <li> - [ ] Hyperparameter tuning of the model to optimize its performance.</li>
        <li> - [ ] Code should be clean, well-structured and appropriately commented.</li>
        <li> - [ ] Highlight important ideas learnt from the dataset and/or model.</li>
        <li> - [ ] Compare at least 3 best models of all the models experimented, analyze their results and provide insights on the model's performance.</li>
        <li> - [ ] Make submission with the model with the best score on training data.</li>
    </ul>
</details>


### Data Loading

In [340]:
import numpy as np; import pandas as pd; pd.set_option('display.max_columns', None)
import matplotlib.pyplot as plt; import seaborn as sns

train_data = pd.read_csv('data/train.csv')
test_data = pd.read_csv('data/test.csv')

### EDA

In [341]:
temp_df = test_data; temp_df['Crime_Category'] = None
df = pd.concat([train_data,temp_df],axis=0)
df.reset_index(drop=True, inplace=True)
columns_to_drop = []
df

Unnamed: 0,Location,Cross_Street,Latitude,Longitude,Date_Reported,Date_Occurred,Time_Occurred,Area_ID,Area_Name,Reporting_District_no,Part 1-2,Modus_Operandi,Victim_Age,Victim_Sex,Victim_Descent,Premise_Code,Premise_Description,Weapon_Used_Code,Weapon_Description,Status,Status_Description,Crime_Category
0,4500 CARPENTER AV,,34.1522,-118.3910,03/09/2020 12:00:00 AM,03/06/2020 12:00:00 AM,1800.0,15.0,N Hollywood,1563.0,1.0,0385,75.0,M,W,101.0,STREET,,,IC,Invest Cont,Property Crimes
1,45TH ST,ALAMEDA ST,34.0028,-118.2391,02/27/2020 12:00:00 AM,02/27/2020 12:00:00 AM,1345.0,13.0,Newton,1367.0,1.0,0906 0352 0371 0446 1822 0344 0416 0417,41.0,M,H,216.0,SWAP MEET,400.0,"STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",IC,Invest Cont,Property Crimes
2,600 E MARTIN LUTHER KING JR BL,,34.0111,-118.2653,08/21/2020 12:00:00 AM,08/21/2020 12:00:00 AM,605.0,13.0,Newton,1343.0,2.0,0329 1202,67.0,M,B,501.0,SINGLE FAMILY DWELLING,,,IC,Invest Cont,Property Crimes
3,14900 ORO GRANDE ST,,34.2953,-118.4590,11/08/2020 12:00:00 AM,11/06/2020 12:00:00 AM,1800.0,19.0,Mission,1924.0,1.0,0329 1300,61.0,M,H,101.0,STREET,,,IC,Invest Cont,Property Crimes
4,7100 S VERMONT AV,,33.9787,-118.2918,02/25/2020 12:00:00 AM,02/25/2020 12:00:00 AM,1130.0,12.0,77th Street,1245.0,1.0,0416 0945 1822 0400 0417 0344,0.0,X,X,401.0,MINI-MART,400.0,"STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",IC,Invest Cont,Property Crimes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,4600 MASCOT ST,,34.0409,-118.3408,06/05/2020 12:00:00 AM,06/04/2020 12:00:00 AM,2100.0,7.0,Wilshire,775.0,1.0,,0.0,,,101.0,STREET,,,IC,Invest Cont,
24996,2200 E 7TH ST,,34.0347,-118.2253,12/02/2020 12:00:00 AM,11/25/2020 12:00:00 AM,1530.0,4.0,Hollenbeck,471.0,1.0,1300 0325,0.0,X,X,116.0,OTHER/OUTSIDE,,,IC,Invest Cont,
24997,LANGDON AV,TUPPER ST,34.2392,-118.4698,11/21/2020 12:00:00 AM,11/21/2020 12:00:00 AM,2100.0,19.0,Mission,1961.0,2.0,0913 1817 0416,38.0,M,H,101.0,STREET,400.0,"STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",AA,Adult Arrest,
24998,400 E 5TH ST,,34.0453,-118.2443,03/01/2020 12:00:00 AM,02/29/2020 12:00:00 AM,2335.0,1.0,Central,147.0,2.0,0416,41.0,M,B,502.0,"MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)",500.0,UNKNOWN WEAPON/OTHER WEAPON,AO,Adult Other,


In [342]:
df.isna().mean() * 100

Location                  0.000
Cross_Street             82.968
Latitude                  0.000
Longitude                 0.000
Date_Reported             0.000
Date_Occurred             0.000
Time_Occurred             0.000
Area_ID                   0.000
Area_Name                 0.000
Reporting_District_no     0.000
Part 1-2                  0.000
Modus_Operandi           13.700
Victim_Age                0.000
Victim_Sex               13.068
Victim_Descent           13.068
Premise_Code              0.000
Premise_Description       0.024
Weapon_Used_Code         63.272
Weapon_Description       63.272
Status                    0.000
Status_Description        0.000
Crime_Category           20.000
dtype: float64

In [343]:
# dropping columns with more than 50% missing data
df.drop(columns=['Cross_Street','Weapon_Used_Code','Weapon_Description'],inplace=True)
columns_to_drop.extend(['Cross_Street','Weapon_Used_Code','Weapon_Description'])
df.shape

(25000, 19)

In [344]:
# dropping ID similar data
df.drop(columns=['Area_Name','Premise_Description','Status_Description'],inplace=True)
columns_to_drop.extend(['Area_Name','Premise_Description','Status_Description'])
df.shape

(25000, 16)

In [345]:
df.nunique()

Location                 14456
Latitude                  3814
Longitude                 3780
Date_Reported              874
Date_Occurred              366
Time_Occurred             1079
Area_ID                     21
Reporting_District_no     1124
Part 1-2                     2
Modus_Operandi           12797
Victim_Age                 100
Victim_Sex                   4
Victim_Descent              18
Premise_Code               226
Status                       5
Crime_Category               6
dtype: int64

In [346]:
print(df.groupby('Part 1-2')['Crime_Category'].nunique())

Part 1-2
1.0    4
2.0    6
Name: Crime_Category, dtype: int64


### Preprocessing

In [347]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

In [348]:
from sklearn.base import BaseEstimator, TransformerMixin
class DropColumns(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X.drop(columns=self.columns, axis=1)
    
class FlattenAndCountVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.vectorizer = CountVectorizer()

    def fit(self, X, y=None):
        # Ensure X is a 1D array of strings
        if isinstance(X, pd.DataFrame):
            X = X.iloc[:, 0].values.flatten()
        else:
            X = X.flatten()
        self.vectorizer.fit(X)
        return self

    def transform(self, X):
        # Ensure X is a 1D array of strings
        if isinstance(X, pd.DataFrame):
            X = X.iloc[:, 0].values.flatten()
        else:
            X = X.flatten()
        X_vec = self.vectorizer.transform(X)
        return pd.DataFrame(X_vec.toarray(), columns=self.vectorizer.get_feature_names_out())

In [349]:
X = train_data.drop('Crime_Category',axis=1)
y = train_data['Crime_Category']
X.head()

Unnamed: 0,Location,Cross_Street,Latitude,Longitude,Date_Reported,Date_Occurred,Time_Occurred,Area_ID,Area_Name,Reporting_District_no,Part 1-2,Modus_Operandi,Victim_Age,Victim_Sex,Victim_Descent,Premise_Code,Premise_Description,Weapon_Used_Code,Weapon_Description,Status,Status_Description
0,4500 CARPENTER AV,,34.1522,-118.391,03/09/2020 12:00:00 AM,03/06/2020 12:00:00 AM,1800.0,15.0,N Hollywood,1563.0,1.0,0385,75.0,M,W,101.0,STREET,,,IC,Invest Cont
1,45TH ST,ALAMEDA ST,34.0028,-118.2391,02/27/2020 12:00:00 AM,02/27/2020 12:00:00 AM,1345.0,13.0,Newton,1367.0,1.0,0906 0352 0371 0446 1822 0344 0416 0417,41.0,M,H,216.0,SWAP MEET,400.0,"STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",IC,Invest Cont
2,600 E MARTIN LUTHER KING JR BL,,34.0111,-118.2653,08/21/2020 12:00:00 AM,08/21/2020 12:00:00 AM,605.0,13.0,Newton,1343.0,2.0,0329 1202,67.0,M,B,501.0,SINGLE FAMILY DWELLING,,,IC,Invest Cont
3,14900 ORO GRANDE ST,,34.2953,-118.459,11/08/2020 12:00:00 AM,11/06/2020 12:00:00 AM,1800.0,19.0,Mission,1924.0,1.0,0329 1300,61.0,M,H,101.0,STREET,,,IC,Invest Cont
4,7100 S VERMONT AV,,33.9787,-118.2918,02/25/2020 12:00:00 AM,02/25/2020 12:00:00 AM,1130.0,12.0,77th Street,1245.0,1.0,0416 0945 1822 0400 0417 0344,0.0,X,X,401.0,MINI-MART,400.0,"STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",IC,Invest Cont


In [350]:
columns_to_drop = ['Location','Cross_Street','Latitude','Longitude','Date_Reported','Date_Occurred','Area_Name','Premise_Description','Weapon_Description','Status_Description']
X[columns_to_drop].shape

(20000, 10)

In [351]:
num_cols = ['Victim_Age','Time_Occurred']
cat_cols = ['Area_ID','Reporting_District_no','Victim_Sex','Victim_Descent','Premise_Code','Status','Part 1-2','Weapon_Used_Code']
mulLable_cols = ['Modus_Operandi']

In [352]:
num_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
])
cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore')),
])
mulLable_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='')),
    ('vectorizer', FlattenAndCountVectorizer()),
])
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_cols),
        ('cat', cat_transformer, cat_cols),
        ('mlb', mulLable_transformer, mulLable_cols)
    ]
)

In [353]:
model_pipeline = Pipeline(steps=[
    ('drop_columns', DropColumns(columns=columns_to_drop)),
    ('preprocessor', preprocessor),
])
model_pipeline

In [354]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [355]:
model_pipeline.fit(X_train)
X_train_trans = model_pipeline.transform(X_train)
X_val_trans = model_pipeline.transform(X_val)

### Feature Engineering

### Hyper parameter tuning

In [356]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.metrics import accuracy_score

In [357]:
models = {
    'RandomForest': RandomForestClassifier(),
    'LogisticRegression': LogisticRegression(max_iter=1000),
    'SVC': SVC(),
    'KNN': KNeighborsClassifier(),
    # 'NaiveBayes': GaussianNB(), #
    'DecisionTreeClassifier': DecisionTreeClassifier(),
    'GradientBoostingClassifier': GradientBoostingClassifier(),
}
best_model = None; best_accuracy = 0; best_model_name = ""
for model_name, model in models.items():
    model.fit(X_train_trans, y_train)
    val_predictions = model.predict(X_val_trans)
    accuracy = accuracy_score(y_val, val_predictions)
    print(f"Validation Accuracy for {model_name}: {accuracy}\n")
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_model = model
        best_model_name = model_name
print(f"Best Model: {best_model_name}")
print(f"Best Model Accuracy: {best_accuracy}")

Validation Accuracy for RandomForest: 0.791

Validation Accuracy for LogisticRegression: 0.75325



#### Hyper Tuning

In [None]:
# from sklearn.svm import SVC
# from sklearn.model_selection import GridSearchCV
# from sklearn.metrics import accuracy_score

# param_grid = {
#     'C': [0.1, 1, 10, 100],
#     'gamma': ['scale', 'auto'],
#     'kernel': ['linear', 'rbf', 'poly', 'sigmoid']
# }

# # Perform grid search with SVC
# grid_search = GridSearchCV(estimator=SVC(), param_grid=param_grid, cv=2, n_jobs=-1, verbose=2)
# grid_search.fit(X_train_trans, y_train)

# # Get the best model
# best_model = grid_search.best_estimator_

# print('best params: ', grid_search.best_params_)

# # Make predictions on the validation data
# val_predictions = best_model.predict(X_val_trans)

# # Evaluate the best model
# accuracy = accuracy_score(y_val, val_predictions)
# print("Best Model Validation Accuracy:", accuracy)

### Submission

In [None]:
# test_predictions = model_pipeline.predict(test_data)

# submission = pd.DataFrame(
#     {
#         "ID": test_data.index,
#         "Crime_Category": test_predictions,
#     }
# )
# submission.to_csv("submission.csv", index=False)
# print(submission.shape)