# San Fanciscon Crime

## Motivation

From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz. Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay.

### Overview

From Sunset to SOMA, and Marina to Excelsior, this dataset provides nearly 12 years of crime reports from across all of San Francisco's neighborhoods. Given time and location, you must predict the category of crime that occurred.

### Approach

We will apply a full Data Science Development life cycle composed of the following steps:

- Data Wrangling to perform all the necessary actions to clean the dataset.
- Feature Engineering to create additional variables from the existing.
- Data Normalization and Data Transformation for preparing the dataset for the learning algorithms.
- Training / Testing data creation to evaluate the performance of our model.


## Data Wrangling

### Loading the data

In [1]:
# Core imports
import pandas as pd
import numpy as np

# Yeast imports
from yeast import Recipe
from yeast.steps import *
from yeast.transformers import *
from yeast.selectors import *

In [2]:
train = pd.read_csv('../../data/sf_train.csv')
test = pd.read_csv('../../data/sf_test.csv')

### The cleaning recipe

In [4]:
recipe = Recipe([
    # Normalize all column names
    CleanColumnNamesStep('snake'),
    # This dataset contains 2323 duplicates that we should remove only on training set
    DropDuplicateRowsStep(role='training'),
    # Some Geolocation points are missplaced
    # We will replace the outlying coordinates with the average coordinates
    MutateStep({
        'x': MapValues({-120.5: np.NaN}),
        'y': MapValues({90: np.NaN})
    }),
    MeanImputeStep(['x', 'y']),
    # Extract some features drom the date:
    CastStep({'dates': 'datetime'}),
    MutateStep({
        'year': DateYear('dates'),
        'quarter': DateQuarter('dates'),
        'month': DateMonth('dates'),
        'week': DateWeek('dates'),
        'day': DateDay('dates'),
        'hour': DateHour('dates'),
        'minute': DateMinute('dates'),
        'dow': DateDayOfWeek('dates'),
        'doy': DateDayOfYear('dates')
    }),
    # Calculate the tenure: days(date - min(date)):
    MutateStep({
        'tenure': lambda df: (df['dates'] - df['dates'].min()).apply(lambda x: x.days)
    }),
    # Is it on a block?
    MutateStep({
        'is_block': StrContains('block', column='address', case=False)
    }),
    # Convert the category (target) into a numerical feature:
    # OrdinalScoreStep('category'),
    # Drop irrelevant Columns
    DropColumnsStep(['dates', 'day_of_week']),
    # Cast the numerical features
    CastStep({
        'is_block': 'integer'  # True and False to 1 and 0
    }),
    # Keep only numerical features
    SelectStep(AllNumeric()),
]).prepare(train)

In [5]:
baked_train = recipe.bake(train)
baked_test  = recipe.bake(test, role="testing")

In [6]:
baked_train.sample(5).head().T

Unnamed: 0,726598,144656,604682,680623,608770
x,-122.474,-122.435,-122.429,-122.403,-122.42
y,37.76,37.7249,37.7818,37.7754,37.7393
year,2005.0,2013.0,2006.0,2005.0,2006.0
quarter,1.0,2.0,4.0,3.0,3.0
month,1.0,6.0,10.0,9.0,9.0
week,4.0,24.0,42.0,38.0,38.0
day,27.0,13.0,17.0,21.0,22.0
hour,7.0,19.0,10.0,17.0,17.0
minute,0.0,35.0,51.0,30.0,35.0
dow,3.0,3.0,1.0,2.0,4.0


## Links & Resources

- [SF-Crime Analysis & Prediction by @yannisp](https://www.kaggle.com/yannisp/sf-crime-analysis-prediction)