# **San Francisco Crime Project**

- **Author:** Muhammad Jawad [@mjawad17]()
- **Description:** Data analysis, exploration, visualization, and data mining on crime in SF
- **Original dataset:** [SF Gov Crime dataset](https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-Historical-2003/tmnf-yvry/about_data)
- **Kaggle dataset:** [Kaggle SF Crime](https://www.kaggle.com/competitions/sf-crime/overview)

<div style="border: 2px solid #4CAF50; border-radius: 10px; padding: 20px; background-color: #f9f9f9; font-family: Arial, sans-serif; color: #333;">
    <h2 style="color: #4CAF50;">Author Overview</h2>
    <!-- Round image centered -->
    <img src="https://avatars.githubusercontent.com/u/77524488?s=400&u=5ee60100c5daf1eb876be2bc80aaa0e9e85969c3&v=4" alt="Author Image" style="border-radius: 12%; width: 200px; height: 200px; margin-bottom: 20px; border: 2px solid green; display: block; margin-left: auto; margin-right: auto;">
    <p>I am <strong>Muhammad Jawad</strong>, a passionate data analyst dedicated to leveraging data to drive meaningful insights and support decision-making. With a strong foundation in computer science, I continuously seek to enhance my skills in analytical thinking and data-driven solutions. 📊</p>
    <h3 style="color: #4CAF50;">What Do I Know?</h3>
    <p>I excel at extracting valuable insights from data using Python, SQL, and data visualization tools like Power BI and Tableau. My experience includes analyzing insurance and medical data, as well as performing exploratory data analysis to uncover trends and patterns. I'm committed to improving my statistical analysis and reporting abilities to solve complex problems effectively. 📚</p>
    <h3 style="color: #4CAF50;">What Am I Doing Right Now?</h3>
    <p>Currently, I am focused on expanding my knowledge in data science, particularly in machine learning and advanced analytics. I am eager to apply my theoretical knowledge to real-world projects that challenge me and contribute to organizational success. 🎯</p>
    <h3 style="color: #4CAF50;">My Goal:</h3>
    <p>To utilize my data analysis and data science skills to create value for companies by supporting their growth objectives. I am enthusiastic about learning new concepts and sharing my experiences with others. If you’re interested in discussing projects, collaborating, or exchanging ideas, I would love to connect!</p>
</div>

# Table of Contents
- Introduction
    - SF Crime Dataset
- Basic Preparation
    - Import libraries
    - Load data
- Data Exploration/Analysis Extension
- Data Preprocessing
    - Data Imputation/Removal
    - Feature Engineering
    - Feature Encoding
- Build Machine Learning Models
    - Train different baseline models
    - Analyze results
- Model Selection
- Hyperparameter tuning
- Train Model with optimal hyperparameters
- Feature Selection
    - Feature Importance
    - Feature Removal
- Train Final Model
- Model Evaluation
- Summary
<!-- - Kaggle Submission -->
- Conclusion

# Introduction

## SF Crime Dataset

This dataset includes information about crime incidents reported by the **San Francisco Police Department (SFPD)**. It covers data from _January 1, 2003, to May 13, 2015_.

The dataset is divided into **two groups**: a training set and a test set. These sets rotate weekly. This means that in odd weeks (like the 1st, 3rd, 5th, and 7th weeks), the data is used for the test set. In even weeks (like the 2nd, 4th, 6th, and 8th weeks), the data is used for the training set.

The main **objective** of this dataset is to predict the category of crime that took place in San Francisco based on the available information.

### Data Fields
- **Dates** - timestamp of the crime incident
- **Category** - category of the crime incident (only in train.csv). This is the target variable you are going to predict.
- **Descript** - detailed description of the crime incident (only in train.csv)
- **DayOfWeek** - the day of the week
- **PdDistrict** - name of the Police Department District
- **Resolution** - how the crime incident was resolved (only in train.csv)
- **Address** - the approximate street address of the crime incident
- **X** - Longitude
- **Y** - Latitude

---

In this Jupyter notebook, I will take you through the entire process of creating a machine learning model using the open-source San Francisco Crime dataset. This will be a step-by-step journey that includes several important stages.

First, I will explore and analyze the data to understand its structure and contents. This is a crucial step that helps identify patterns and insights within the data. Next, I will preprocess the data, which is a significant part of this project. This step involves cleaning the data and performing feature engineering to create useful variables for the model.

After preparing the data, I will try out different machine learning algorithms to see which one works best for this dataset. I will determine the most effective model and then fine-tune its hyperparameters to improve its performance. Finally, I will evaluate the chosen model using a metric called multiclass log loss to assess how well it predicts the categories of crime.

Since this project is based on an older Kaggle competition, I will avoid looking for external resources or past Kaggle notebooks. My goal is to enhance my coding skills for an end-to-end data science project and to become more familiar with Python data science libraries. I also hope to uncover interesting insights and discover cool patterns while working with this dataset. So, let’s get started!

# Basic Preparation

## Import Libraries

In [None]:
__author__ = "Muhammad Jawad (https://github.com/mj-awad17)"

# linear algebra
import numpy as np

# data manipulation
import pandas as pd

# plotting
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import style
%matplotlib inline
style.use('ggplot')

# preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder

# machince learning algorithms
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression, SGDClassifier, Perceptron
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
import xgboost as xgb

# model evaluation metrics
from sklearn.metrics import log_loss
from sklearn.model_selection import cross_val_score

# model selection and tuning
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, StratifiedGroupKFold
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer

# clustering
from sklearn.cluster import KMeans

# mathematical fundtions
import math

# ignore warnings
import warnings
warnings.filterwarnings("ignore")