# **San Francisco Crime Project**

- **Author:** Muhammad Jawad [@mjawad17]()
- **Description:** Data analysis, exploration, visualization, and data mining on crime in SF
- **Original dataset:** [SF Gov Crime dataset](https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-Historical-2003/tmnf-yvry/about_data)
- **Kaggle dataset:** [Kaggle SF Crime](https://www.kaggle.com/competitions/sf-crime/overview)

<div style="border: 2px solid #4CAF50; border-radius: 10px; padding: 20px; background-color: #f9f9f9; font-family: Arial, sans-serif; color: #333;">
    <h2 style="color: #4CAF50;">Author Overview</h2>
    <!-- Round image centered -->
    <img src="https://avatars.githubusercontent.com/u/77524488?s=400&u=5ee60100c5daf1eb876be2bc80aaa0e9e85969c3&v=4" alt="Author Image" style="border-radius: 12%; width: 200px; height: 200px; margin-bottom: 20px; border: 2px solid green; display: block; margin-left: auto; margin-right: auto;">
    <p>I am <strong>Muhammad Jawad</strong>, a passionate data analyst dedicated to leveraging data to drive meaningful insights and support decision-making. With a strong foundation in computer science, I continuously seek to enhance my skills in analytical thinking and data-driven solutions. 📊</p>
    <h3 style="color: #4CAF50;">What Do I Know?</h3>
    <p>I excel at extracting valuable insights from data using Python, SQL, and data visualization tools like Power BI and Tableau. My experience includes analyzing insurance and medical data, as well as performing exploratory data analysis to uncover trends and patterns. I'm committed to improving my statistical analysis and reporting abilities to solve complex problems effectively. 📚</p>
    <h3 style="color: #4CAF50;">What Am I Doing Right Now?</h3>
    <p>Currently, I am focused on expanding my knowledge in data science, particularly in machine learning and advanced analytics. I am eager to apply my theoretical knowledge to real-world projects that challenge me and contribute to organizational success. 🎯</p>
    <h3 style="color: #4CAF50;">My Goal:</h3>
    <p>To utilize my data analysis and data science skills to create value for companies by supporting their growth objectives. I am enthusiastic about learning new concepts and sharing my experiences with others. If you’re interested in discussing projects, collaborating, or exchanging ideas, I would love to connect!</p>
</div>

# Table of Contents
- Introduction
    - SF Crime Dataset
- Basic Preparation
    - Import libraries
    - Load data
- Data Exploration/Analysis Extension
- Data Preprocessing
    - Data Imputation/Removal
    - Feature Engineering
    - Feature Encoding
- Build Machine Learning Models
    - Train different baseline models
    - Analyze results
- Model Selection
- Hyperparameter tuning
- Train Model with optimal hyperparameters
- Feature Selection
    - Feature Importance
    - Feature Removal
- Train Final Model
- Model Evaluation
- Summary
<!-- - Kaggle Submission -->
- Conclusion

# Introduction

## SF Crime Dataset

This dataset includes information about crime incidents reported by the **San Francisco Police Department (SFPD)**. It covers data from _January 1, 2003, to May 13, 2015_.

The dataset is divided into **two groups**: a training set and a test set. These sets rotate weekly. This means that in odd weeks (like the 1st, 3rd, 5th, and 7th weeks), the data is used for the test set. In even weeks (like the 2nd, 4th, 6th, and 8th weeks), the data is used for the training set.

The main **objective** of this dataset is to predict the category of crime that took place in San Francisco based on the available information.

### Data Fields
- **Dates** - timestamp of the crime incident
- **Category** - category of the crime incident (only in train.csv). This is the target variable you are going to predict.
- **Descript** - detailed description of the crime incident (only in train.csv)
- **DayOfWeek** - the day of the week
- **PdDistrict** - name of the Police Department District
- **Resolution** - how the crime incident was resolved (only in train.csv)
- **Address** - the approximate street address of the crime incident
- **X** - Longitude
- **Y** - Latitude

---

In this Jupyter notebook, I will take you through the entire process of creating a machine learning model using the open-source San Francisco Crime dataset. This will be a step-by-step journey that includes several important stages.

First, I will explore and analyze the data to understand its structure and contents. This is a crucial step that helps identify patterns and insights within the data. Next, I will preprocess the data, which is a significant part of this project. This step involves cleaning the data and performing feature engineering to create useful variables for the model.

After preparing the data, I will try out different machine learning algorithms to see which one works best for this dataset. I will determine the most effective model and then fine-tune its hyperparameters to improve its performance. Finally, I will evaluate the chosen model using a metric called multiclass log loss to assess how well it predicts the categories of crime.

Since this project is based on an older Kaggle competition, I will avoid looking for external resources or past Kaggle notebooks. My goal is to enhance my coding skills for an end-to-end data science project and to become more familiar with Python data science libraries. I also hope to uncover interesting insights and discover cool patterns while working with this dataset. So, let’s get started!

# Basic Preparation

## Import Libraries

In [1]:
pip install scikit-optimize

Note: you may need to restart the kernel to use updated packages.


In [2]:
__author__ = "Muhammad Jawad (https://github.com/mj-awad17)"

# linear algebra
import numpy as np

# data manipulation
import pandas as pd

# plotting
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import style
%matplotlib inline
style.use('ggplot')

# preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder

# machince learning algorithms
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression, SGDClassifier, Perceptron
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
import xgboost as xgb

# model evaluation metrics
from sklearn.metrics import log_loss
from sklearn.model_selection import cross_val_score

# model selection and tuning
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, StratifiedGroupKFold
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer

# clustering
from sklearn.cluster import KMeans

# mathematical fundtions
import math

# ignore warnings
import warnings
warnings.filterwarnings("ignore")

## Load Data

In [4]:
train = pd.read_csv('../../All/train.csv')
test = pd.read_csv('../../All/test.csv')

In [6]:
# info of train data
print("Train Data Info:")
print(train.info())

Train Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878049 entries, 0 to 878048
Data columns (total 9 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   Dates       878049 non-null  object 
 1   Category    878049 non-null  object 
 2   Descript    878049 non-null  object 
 3   DayOfWeek   878049 non-null  object 
 4   PdDistrict  878049 non-null  object 
 5   Resolution  878049 non-null  object 
 6   Address     878049 non-null  object 
 7   X           878049 non-null  float64
 8   Y           878049 non-null  float64
dtypes: float64(2), object(7)
memory usage: 60.3+ MB
None


# Data Exploration & Analysis Extension

- Complete data exploration & visualizations are done in jupyter notebook: [sf-crime-data-exploration.ipynb](sf-crime-exploration.ipynb)
- This dataset suffers from **imbalanced** classes (TREA has 6 occurrences while LARCENY/THEFT has 1,749,000 occurrences)
    - There are a couple ways to deal with **imbalanced** classes, such as:
        - Changing performance metric (Do not use accuracy, use a confusion matrix, precision, recall, F1 score, ROC curves)
        - Resample dataset (Oversample under-represented classes, and undersample over-represented classes)
        - Try different ML algorithms that can handle imbalanced classes
            - Decision Trees (Random Forests/XGBoost) often perform well on imbalanced classes (due to splitting rules)

In [9]:
train.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


In [13]:
train.columns.values

array(['Dates', 'Category', 'Descript', 'DayOfWeek', 'PdDistrict',
       'Resolution', 'Address', 'X', 'Y'], dtype=object)

### Things we Observed so far:
- 878,049 instances in training set (or recorded crime instances in SF)
- 9 columns (8 potential features + 1 label (Category))

**Data types:**

- 2 columns with float values
- 7 objects

There are no null (NaN) values (Yaha!)

In [21]:
# Count number of observations for each crime 
train['Category'].value_counts()

Category
LARCENY/THEFT                  174900
OTHER OFFENSES                 126182
NON-CRIMINAL                    92304
ASSAULT                         76876
DRUG/NARCOTIC                   53971
VEHICLE THEFT                   53781
VANDALISM                       44725
WARRANTS                        42214
BURGLARY                        36755
SUSPICIOUS OCC                  31414
MISSING PERSON                  25989
ROBBERY                         23000
FRAUD                           16679
FORGERY/COUNTERFEITING          10609
SECONDARY CODES                  9985
WEAPON LAWS                      8555
PROSTITUTION                     7484
TRESPASS                         7326
STOLEN PROPERTY                  4540
SEX OFFENSES FORCIBLE            4388
DISORDERLY CONDUCT               4320
DRUNKENNESS                      4280
RECOVERED VEHICLE                3138
KIDNAPPING                       2341
DRIVING UNDER THE INFLUENCE      2268
RUNAWAY                          1946
LIQ

In [22]:
# Count number of observations for each PdDistrict
train['PdDistrict'].value_counts()

PdDistrict
SOUTHERN      157182
MISSION       119908
NORTHERN      105296
BAYVIEW        89431
CENTRAL        85460
TENDERLOIN     81809
INGLESIDE      78845
TARAVAL        65596
PARK           49313
RICHMOND       45209
Name: count, dtype: int64

In [23]:
# Count number of observations for each DayOfWeek
train['DayOfWeek'].value_counts()

DayOfWeek
Friday       133734
Wednesday    129211
Saturday     126810
Thursday     125038
Tuesday      124965
Monday       121584
Sunday       116707
Name: count, dtype: int64

In [24]:
# Count number of observations for each DayOfWeek
train['DayOfWeek'].value_counts()

DayOfWeek
Friday       133734
Wednesday    129211
Saturday     126810
Thursday     125038
Tuesday      124965
Monday       121584
Sunday       116707
Name: count, dtype: int64

In [25]:
# Count number of observations for each Resolution
train['Resolution'].value_counts()

Resolution
NONE                                      526790
ARREST, BOOKED                            206403
ARREST, CITED                              77004
LOCATED                                    17101
PSYCHOPATHIC CASE                          14534
UNFOUNDED                                   9585
JUVENILE BOOKED                             5564
COMPLAINANT REFUSES TO PROSECUTE            3976
DISTRICT ATTORNEY REFUSES TO PROSECUTE      3934
NOT PROSECUTED                              3714
JUVENILE CITED                              3332
PROSECUTED BY OUTSIDE AGENCY                2504
EXCEPTIONAL CLEARANCE                       1530
JUVENILE ADMONISHED                         1455
JUVENILE DIVERTED                            355
CLEARED-CONTACT JUVENILE FOR MORE INFO       217
PROSECUTED FOR LESSER OFFENSE                 51
Name: count, dtype: int64

In [27]:
# describe the train data
train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
X,878049.0,-122.422616,0.030354,-122.513642,-122.432952,-122.41642,-122.406959,-120.5
Y,878049.0,37.77102,0.456893,37.707879,37.752427,37.775421,37.784369,90.0


> There seems to be an invalid coordinates (max) 90 (latitude) or -120.5 (longitude) does not seem to be a valid coordinate in San Francisco. We must fix these values for this feature.

# Data Preprocessing

- Data cleaning
    - imputation or removal of outlier values
- Feature Engineering (Feature Creation)
- Feature Encoding
    - Integer encode or label encode ordinal categorical features that maintain order (Year, Business Quarter, Block/Street Number)
    - Usually:
        - One hot encode nominal categorical features (DayOfWeek, PdDistrict, StreetType, Category) 
            - mainly for logistic regression
        - However, Random Forests & Boosting algorithms can handle nominal categorical features directly, so we just integer encode these features.

## Data Imputation/Removal

## Feature Engineering


## Feature Encoding