# Crimerate Classification - San Francisco

## Abstract
### 1. Problem Statement:
The main objective of this project is to implement Big Data technologies in the machine learning realm. As part of this project, we will be working on the San Francisco Crime Classification dataset obtained from Kaggle. We are mainly interested in developing a system that could classify crime descriptions into different categories which would help the authorities to assign officers to crimes based on the report.

### 2. Solution:
There can be numerous approaches to solving this problem. Out of all those approaches we will be using the crime dataset and working around it. We will train a model based on 39 predefined categories, test its accuracy, and deploy it into production. Given a new crime description, the system should assign it to one of the 39 categories. In addition, to solve this multi-class text classification problem, we will use various feature extraction techniques along with different supervised machine learning algorithms in Pyspark.

### 3. Project Goals:
We will try different sets of models to check the crime rate and compare their accuracy. This comparative analysis would help us know which model would be the best for this kind of dataset and problem.

### 4. References :
- [Kaggle](https://www.kaggle.com/datasets/kaggle/san-francisco-crime-classification)
- [Researchgate](https://www.researchgate.net/publication/347219439_Crime_Rate_Prediction_Using_Machine_Learning_and_Data_Mining)
- [IEEE Explore](https://ieeexplore.ieee.org/document/9170731)

- [NCBI](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8529125/)

##Exploratory Data Analysis

In [None]:
from google.colab import drive
drive.mount('/content/drive')

MessageError: ignored

In [None]:
! pip install geopandas --quiet
! pip install --upgrade plotly --quiet
! pip install mpu --quiet
! pip install visualization --quiet

In [None]:
import warnings
warnings.filterwarnings('ignore')

import json
import os
import pickle
import pandas as pd
import numpy as np
import geopandas as gpd
import plotly.graph_objects as go
import plotly.express as px

from mpu import haversine_distance
from plotly.subplots import make_subplots
from tqdm import tqdm
from difflib import SequenceMatcher

from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_extraction.text import (
    CountVectorizer,
    TfidfVectorizer
)

from visualizer import (
    MapScatter,
    MapChoropleth,
    OccurrencePlotter,
    CategoryOccurrencePlotter
)

ModuleNotFoundError: ignored

In [None]:
project_path = '/content/drive/MyDrive/AAIC/SCS-1/sf_crime_classification/'

In [None]:
train_sf_df = pd.read_csv(filepath_or_buffer=project_path + 'csv_files/train.csv')
test_sf_df = pd.read_csv(filepath_or_buffer=project_path + 'csv_files/test.csv')
sf_pd = gpd.read_file(filename=project_path + 'shp_files/sf-police-districts/sf-police-districts.shp')

In [None]:
train_sf_df.shape, test_sf_df.shape

In [None]:
train_sf_df.head(2)

In [None]:
test_sf_df.head(2)

In [None]:
train_cols_renamed = ['time', 'category', 'description', 'weekday', 'police_dept', 
                      'resolution', 'address', 'longitude', 'latitude']
train_sf_df.columns = train_cols_renamed

test_cols_renamed = ['id', 'time', 'weekday', 'police_dept', 'address', 'longitude', 'latitude']
test_sf_df.columns = test_cols_renamed

In [None]:
train_sf_df.drop(columns=['description', 'resolution'], axis=1, inplace=True)

In [None]:
train_sf_df.head(2)

In [None]:
test_sf_df.head(2)

In [None]:
train_sf_df.dtypes

In [None]:
test_sf_df.dtypes

In [None]:
test_sf_df.dtypes

In [None]:
def extract_date(time):
    """Extract data from time"""
    return time.split(' ')[0]

def extract_year(date):
    """Extract year from date"""
    return int(date.split('-')[0])

def extract_month(date):
    """Extract month from date"""
    return int(date.split('-')[1])

def extract_day(date):
    """Extract day from date"""
    return int(date.split('-')[2])

def extract_hour(time):
    """Extract hour from time"""
    date, hms = time.split(' ')
    return int(hms.split(':')[0])

def extract_minute(time):
    """Extract minute from time"""
    date, hms = time.split(' ')
    return int(hms.split(':')[1])

def extract_season(month):
    """Determine season from month"""
    if month in [4, 5, 6]:
        return 'summer'
    elif month in [7, 8, 9]:
        return 'rainy'
    elif month in [10, 11, 12]:
        return 'winter'
    return 'spring'

def extract_hour_type(hour):
    """Determine hour type from hour"""
    if (hour >= 4) and (hour < 12):
        return 'morning'
    elif (hour >= 12) and (hour < 15):
        return 'noon'
    elif (hour >= 15) and (hour < 18):
        return 'evening'
    elif (hour >= 18) and (hour < 22):
        return 'night'
    return 'mid-night'

def extract_time_period(hour):
    """Determine the time period from hour"""
    if hour in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]:
        return 'am'
    return 'pm'

In [None]:
def title_text(text):
    """Title the text"""
    if isinstance(text, str):
        text = text.title()
        return text
    return text

In [None]:
def extract_address_type(addr):
    """Extract address type if it Street or Cross etc"""
    if ' / ' in addr:
        return 'Cross'
    addr_sep = addr.split(' ')
    addr_type = addr_sep[-1]
    return addr_type

In [None]:
def write_temporal_address_features(df, path):
    """Writing the temporal based features"""
    
    ### Adding temporal features
    df['date'] = df['time'].apply(func=extract_date)
    df['year'] = df['date'].apply(func=extract_year)
    df['month'] = df['date'].apply(func=extract_month)
    df['day'] = df['date'].apply(func=extract_day)
    df['hour'] = df['time'].apply(func=extract_hour)
    df['minute'] = df['time'].apply(func=extract_minute)
    df['season'] = df['month'].apply(func=extract_season)
    df['hour_type'] = df['hour'].apply(func=extract_hour_type)
    df['time_period'] = df['hour'].apply(func=extract_time_period)
    
    ### Adding address type
    df['address_type'] = df['address'].apply(func=extract_address_type)
    
    ### Text titling
    df = df.applymap(func=title_text)
    
    ### Writing
    df.to_csv(path_or_buf=path, index=None)
    
    return True

In [None]:
if (
    not os.path.isfile(path=project_path + 'csv_files/train_time_address_cleaned.csv') and
    not os.path.isfile(path=project_path + 'csv_files/test_time_address_cleaned.csv')
   ):
    # Training
    write_temporal_address_features(df=train_sf_df, path=project_path + 'csv_files/train_time_address_cleaned.csv')
    # Test
    write_temporal_address_features(df=test_sf_df, path=project_path + 'csv_files/test_time_address_cleaned.csv')

else:
    print("Data already exists in the directory.")
    train_sf_df = pd.read_csv(filepath_or_buffer=project_path + 'csv_files/train_time_address_cleaned.csv')
    test_sf_df = pd.read_csv(filepath_or_buffer=project_path + 'csv_files/test_time_address_cleaned.csv')

In [None]:
train_sf_df.head(2)

In [None]:
test_sf_df.head(2)

In [None]:
train_sf_df[['latitude', 'longitude']].describe()

In [None]:
test_sf_df[['latitude', 'longitude']].describe()

In [None]:
def plot_column_distribution(df, column):
    """Plot the distribution of the column from dataframe"""
    
    column_val_df = df[column].value_counts().to_frame().reset_index()
    column_val_df.columns = [column, 'count']
    
    fig = px.bar(data_frame=column_val_df, x=column, y='count')
    fig.update_layout(
        autosize=True,
        height=600,
        hovermode='closest',
        showlegend=True,
        margin=dict(l=10, r=10, t=30, b=0)
    )
    
    fig.show()
    return None

In [None]:
plot_column_distribution(df=train_sf_df, column='category')

In [None]:
plot_column_distribution(df=train_sf_df, column='address_type')

In [None]:
plot_column_distribution(df=train_sf_df, column='police_dept')

In [None]:
plot_column_distribution(df=train_sf_df, column='year')

In [None]:
plot_column_distribution(df=train_sf_df, column='month')

In [None]:
plot_column_distribution(df=train_sf_df, column='weekday')

In [None]:
plot_column_distribution(df=train_sf_df, column='hour')

In [None]:
plot_column_distribution(df=train_sf_df, column='minute')

In [None]:
plot_column_distribution(df=train_sf_df, column='season')

In [None]:
plot_column_distribution(df=train_sf_df, column='time_period')

In [None]:
plot_column_distribution(df=train_sf_df, column='hour_type')

In [None]:
oviz = OccurrencePlotter(df=train_sf_df)

In [None]:
oviz.plot_crime_occurrences(police_dept='Southern')

In [None]:
oviz.plot_crime_occurrences_by_year(year=2003, police_dept='Southern')

In [None]:
oviz.plot_crime_occurrences_by_month(year=2003, month=1, police_dept='Southern')

In [None]:
# oviz.plot_crime_occurrences_by_day(year=2005, month=1, day=10)

In [None]:
mviz = MapScatter(df=train_sf_df)

In [None]:
mviz.map_crimes(police_dept='Richmond')

In [None]:
mviz.map_crimes_by_year(year=2015, police_dept='Richmond')

In [None]:
mviz.map_crimes_by_month(year=2003, month=2, police_dept='Richmond')

In [None]:
# mviz.map_crimes_by_day(year=2003, month=2, day=6)

In [None]:
mciz = MapChoropleth(df=train_sf_df, gdf=sf_pd)

In [None]:
# mciz.map_crimes()

In [None]:
mciz.map_crimes_by_year(year=2015)

In [None]:
mciz.map_crimes_by_month(year=2015, month=3)

In [None]:
# mciz.map_crimes_by_day(year=2015, month=3, day=3)

In [None]:
# cop = CategoryOccurrencePlotter(df=train_sf_df)

In [None]:
# cop.plot_crime_occurrences_by_month()

In [None]:
# cop.plot_crime_occurrences_by_weekday()

In [None]:
# cop.plot_crime_occurrences_by_hour()

In [None]:
def make_subplots_of_categories_by_year(year, df, top=12):
    """Density map subplots to show the top crimes occurred based on the year"""
    
    # San Francisco coordinates
    clat = 37.773972
    clon = -122.431297
    
    # select top 20 based on the frequency
    sf_ = df[df['year'] == year]
    category_vc = sf_['category'].value_counts().to_frame()
    categories = category_vc.index.to_list()[:top]
    
    # subplots grid
    nrows = 4; ncols = 3
    fig = make_subplots(
        rows=nrows, cols=ncols, subplot_titles=categories,
        specs=[[{"type" : "mapbox"} for i in range(ncols)] for j in range(nrows)]
    )

    r = 1; c = 1
    for name in categories:
        group = sf_[sf_['category'] == name]
        if (c > ncols):
            r += 1
            if (r > nrows): break
            c = 1
        f = go.Densitymapbox(lat=group['latitude'], lon=group['longitude'], radius=1)
        fig.add_trace(trace=f, row=r, col=c)
        c += 1
    
    fig.update_layout(
        # autosize=True,
        title=year,
        height=1000, hovermode='closest', showlegend=False,
        margin=dict(l=0, r=0, t=60, b=0)
    )

    fig.update_mapboxes(
        center=dict(lat=clat, lon=clon),
        bearing=0, pitch=0, zoom=10,
        style='carto-positron'
    )
    
    fig.show()
    return None

In [None]:
# make_subplots_of_categories_by_year(year=2003, df=train_sf_df)

In [None]:
def split_categories_numericals(df):
    """Identifying the numerical and categorical columns separately"""
    cols = list(df.columns)
    num_cols = list(df._get_numeric_data().columns)
    cate_cols = list(set(cols) - set(num_cols))
    return cate_cols, num_cols

In [None]:
ignore_columns = ['category', 'time', 'address', 'date']

def extract_feature_dummies(df, column):
    """One-Hot-Encoding using Pandas"""
    col_df = df[column]
    return pd.get_dummies(data=col_df)

def encode_multiple_columns(df, ignore_columns=ignore_columns):
    """Encoding the multiple columns and vertical stacking them"""
    cate_cols, num_cols = split_categories_numericals(df=df)
    
    multi_feature_dummies = [df[num_cols]]
    for i in cate_cols:
        if i not in ignore_columns:
            d = extract_feature_dummies(df=df, column=i)
            multi_feature_dummies.append(d)

    encoded_data = pd.concat(multi_feature_dummies, axis=1)
    
    return encoded_data

In [None]:
encoded_data = encode_multiple_columns(df=train_sf_df)

In [None]:
sf_pstations_tourists = {
    "sfpd"                : [37.7725, -122.3894],
    "ingleside"           : [37.7247, -122.4463],
    "central"             : [37.7986, -122.4101],
    "northern"            : [37.7802, -122.4324],
    "mission"             : [37.7628, -122.4220],
    "tenderloin"          : [37.7838, -122.4129],
    "taraval"             : [37.7437, -122.4815],
    "sfpd park"           : [37.7678, -122.4552],
    "bayview"             : [37.7298, -122.3977],
    "kma438 sfpd"         : [37.7725, -122.3894],
    "richmond"            : [37.7801, -122.4644],
    "police commission"   : [37.7725, -122.3894],
    "juvenile"            : [37.7632, -122.4220],
    "southern"            : [37.6556, -122.4366],
    "sfpd pistol range"   : [37.7200, -122.4996],
    "sfpd public affairs" : [37.7754, -122.4039],
    "broadmoor"           : [37.6927, -122.4748],
    #################
    "napa wine country"      : [38.2975, -122.2869],
    "sonoma wine country"    : [38.2919, -122.4580],
    "muir woods"             : [37.8970, -122.5811],
    "golden gate"            : [37.8199, -122.4783],
    "yosemite national park" : [37.865101, -119.538330],
}

In [None]:
def get_distance(ij):
    """Get distance from two coordinates"""
    i = ij[0]
    j = ij[1]
    distance = haversine_distance(origin=i, destination=j)
    return distance

def extract_spatial_distance_feature(df, lat_column, lon_column, pname, pcoords):
    """Compute the distance between pcoords and all the feature values"""
    lat_vals = df[lat_column].to_list()
    lon_vals = df[lon_column].to_list()
    
    df_coords = list(zip(lat_vals, lon_vals))
    pcoords_df_coords_combines = zip([pcoords] * len(df), df_coords)
    
    f = pd.DataFrame()
    distances = list(map(get_distance, pcoords_df_coords_combines))
    f[pname] = distances
    
    return f

In [None]:
def extract_spatial_distance_multi_features(df, lat_column, lon_column, stations=sf_pstations_tourists):
    """Compute the spatial distance for multiple features and vertical stacking them"""
    sfeatures = []
    
    for pname, pcoords in stations.items():
        print(pname, pcoords)
        sf = extract_spatial_distance_feature(df, lat_column, lon_column, pname, pcoords)
        sfeatures.append(sf)
    
    spatial_distances = pd.concat(sfeatures, axis=1)
    return spatial_distances

In [None]:
sd_features = extract_spatial_distance_multi_features(df=train_sf_df, lat_column='latitude', lon_column='longitude')

In [None]:
def lat_lon_sum(ll):
    """Return the sum of lat and lon"""
    lat = ll[0]
    lon = ll[1]
    return lat + lon

def lat_lon_diff(ll):
    """Return the diff of lat and lon"""
    lat = ll[0]
    lon = ll[1]
    return lon - lat

def lat_lon_sum_square(ll):
    """Return the square of sum of lat and lon"""
    lat = ll[0]
    lon = ll[1]
    return (lat + lon) ** 2

def lat_lon_diff_square(ll):
    """Return the square of diff of lat and lon"""
    lat = ll[0]
    lon = ll[1]
    return (lat - lon) ** 2

def lat_lon_sum_sqrt(ll):
    """Return the sqrt of sum of lat and lon"""
    lat = ll[0]
    lon = ll[1]
    return (lat**2 + lon**2) ** (1 / 2)

def lat_lon_diff_sqrt(ll):
    """Return the sqrt of diff of lat and lon"""
    lat = ll[0]
    lon = ll[1]
    return (lon**2 - lat**2) ** (1 / 2)

In [None]:
sll_features = features_by_lat_lon(df=train_sf_df, lat_column='latitude', lon_column='longitude')

In [None]:
def create_bow_vectorizer(df, column, target='category', write_vect=True, kbest=20):
    """We should only fit on training data to avoid data leakage"""

    model_name = 'vect_bow_{}.pkl'.format(column)
    print(model_name)
    df_col_val = df[column]

    if not os.path.isfile(path=project_path + 'models/' + model_name):
        vect = CountVectorizer()
        vect.fit(raw_documents=df_col_val)
        pickle.dump(vect, open(project_path + 'models/' + model_name, "wb"))
        df_col_features = vect.transform(raw_documents=df_col_val)
    else:
        print("Model already exists in the directory.")
        vect = pickle.load(open(project_path + 'models/' + model_name, "rb"))
        df_col_features = vect.transform(raw_documents=df_col_val)

    if kbest:
        fs = SelectKBest(k=kbest)
        fs.fit(df_col_features, df[target])
        df_col_features = fs.transform(df_col_features)
    
    return pd.DataFrame(df_col_features.toarray())

In [None]:
train_address_bow = create_bow_vectorizer(df=train_sf_df, column='address')

In [None]:
def create_tfidf_vectorizer(df, column, target='category', write_vect=True, kbest=20):
    """We should only fit on training data to avoid data leakage"""

    model_name = 'vect_tfidf_{}.pkl'.format(column)
    print(model_name)
    df_col_val = df[column]

    if not os.path.isfile(path=project_path + 'models/' + model_name):
        vect = TfidfVectorizer()
        vect.fit(raw_documents=df_col_val)
        pickle.dump(vect, open(project_path + 'models/' + model_name, "wb"))
        df_col_features = vect.transform(raw_documents=df_col_val)
    else:
        print("Model already exists in the directory.")
        vect = pickle.load(open(project_path + 'models/' + model_name, "rb"))
        df_col_features = vect.transform(raw_documents=df_col_val)

    if kbest:
        fs = SelectKBest(k=kbest)
        fs.fit(df_col_features, df[target])
        df_col_features = fs.transform(df_col_features)
    
    return pd.DataFrame(df_col_features.toarray())

In [None]:
train_address_tfidf = create_tfidf_vectorizer(df=train_sf_df, column='address')

In [None]:
train_sf_df_featurized = pd.concat([encoded_data, sd_features, sll_features, train_address_bow, train_address_tfidf], axis=1)
train_sf_df_featurized['category'] = train_sf_df['category']

In [None]:
train_sf_df_featurized.shape

In [None]:
def divide_by_stratification(df, target):
    """Apply stratification and split the data"""
    X = df.drop(columns=[target])
    y = df[target]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.9, stratify=y, random_state=42)
    return X_train, y_train

In [None]:
def plot_tsne(X, y, k=20, perplexity=50):
    """TSNE plot after reducing the dimensionality to k best features"""
    
    fs = SelectKBest(k=k)
    fs.fit(X, y)
    X = fs.transform(X)
    print(X.shape)
    
    X = StandardScaler().fit_transform(X)
    tsne = TSNE(n_components=2, random_state=0, perplexity=perplexity)
    projections = tsne.fit_transform(X, )
    
    fig = px.scatter(projections, x=0, y=1, color=y)
    fig.update_layout(
        autosize=True,
        height=600,
        hovermode='closest',
        showlegend=True,
        margin=dict(l=10, r=10, t=30, b=0)
    )
    fig.show()

    return None

In [None]:
X_train, y_train = divide_by_stratification(df=train_sf_df_featurized, target='category')

In [None]:
plot_tsne(X=X_train, y=y_train)

In [None]:
def segregate_only_top(df, column, n=12, randomize=True):
    """Considering only top crimes and randomizing the data"""
    top_n = df[column].value_counts().index.to_list()[:n]
    
    df_vals = []
    for i in top_n:
        df_vals.append(df[df[column] == i])
    
    df = pd.concat(df_vals, axis=0)
    if randomize:
        df = df.sample(frac=1).reset_index(drop=True)

    return df

In [None]:
data = segregate_only_top(df=train_sf_df_featurized, column='category')
X_train, y_train = divide_by_stratification(df=data, target='category')

In [None]:
plot_tsne(X=X_train, y=y_train)

In [None]:
data = segregate_only_top(df=train_sf_df_featurized, column='category', n=5)
X_train, y_train = divide_by_stratification(df=data, target='category')

In [None]:
plot_tsne(X=X_train, y=y_train)