# Day 2 - Hackathon Objectives
- We'll be hacking on the SF Crime Classification dataset from Kaggle. 
- Some parts we'll go through guided/together
- Some parts you'll be on your own, in the wild
- It's gonna be great!

## The Data
First, let's read in the data.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("../data/sf_crime_truncated.csv")

In [3]:
df.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2009-06-11 13:45:00,OTHER OFFENSES,CONSPIRACY,Thursday,TARAVAL,JUVENILE BOOKED,19TH AV / OCEAN AV,-122.474954,37.732456
1,2005-10-17 12:00:00,ASSAULT,THREATENING PHONE CALL(S),Monday,TARAVAL,NONE,1500 Block of SLOAT BL,-122.489714,37.73395
2,2012-09-20 20:30:00,NON-CRIMINAL,LOST PROPERTY,Thursday,MISSION,NONE,1800 Block of FOLSOM ST,-122.415605,37.767718
3,2006-03-25 15:28:00,SECONDARY CODES,DOMESTIC VIOLENCE,Saturday,RICHMOND,"ARREST, BOOKED",800 Block of 28TH AV,-122.487534,37.773336
4,2013-10-01 00:33:00,WARRANTS,ENROUTE TO PAROLE OFFICER,Tuesday,MISSION,"ARREST, BOOKED",1200 Block of CHURCH ST,-122.427465,37.751296


# Exploratory Data Analysis
Let's see what's in our dataset.

In [4]:
df.describe()

Unnamed: 0,X,Y
count,20000.0,20000.0
mean,-122.422499,37.772153
std,0.031782,0.522885
min,-122.513642,37.708154
25%,-122.43322,37.752239
50%,-122.416349,37.775421
75%,-122.406841,37.784401
max,-120.5,90.0


## Sanity Check
San Francisco latitude is 37.77493 and longitude is -122.41942, so we have an outlier or outliers in the data, based on the max.

In [5]:
df[df['Y'] >38]

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
1652,2003-05-02 01:00:00,SEX OFFENSES FORCIBLE,"FORCIBLE RAPE, BODILY FORCE",Friday,SOUTHERN,COMPLAINANT REFUSES TO PROSECUTE,3RD ST / JAMES LICK FREEWAY HY,-120.5,90.0
16336,2005-09-09 00:03:00,ASSAULT,BATTERY,Friday,TENDERLOIN,NONE,ELLIS ST / 5THSTNORTH ST,-120.5,90.0


Let's exclude these points from our dataset.

Let's take a look at our other columns and dtypes.

In [6]:
df.columns

Index(['Dates', 'Category', 'Descript', 'DayOfWeek', 'PdDistrict',
       'Resolution', 'Address', 'X', 'Y'],
      dtype='object')

In [7]:
df.dtypes

Dates          object
Category       object
Descript       object
DayOfWeek      object
PdDistrict     object
Resolution     object
Address        object
X             float64
Y             float64
dtype: object

## Data Type Conversions
Looks like we'll need to convert our Dates column to a datetime object.

In [8]:
df['Dates'] = pd.to_datetime(df['Dates'])

Also let's convert our weeks to 1-7 instead of the strings.

## Exercise
Do this in pairs or own our own.

Create a function to do the conversion and apply it to the column

In [9]:
df.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2009-06-11 13:45:00,OTHER OFFENSES,CONSPIRACY,Thursday,TARAVAL,JUVENILE BOOKED,19TH AV / OCEAN AV,-122.474954,37.732456
1,2005-10-17 12:00:00,ASSAULT,THREATENING PHONE CALL(S),Monday,TARAVAL,NONE,1500 Block of SLOAT BL,-122.489714,37.73395
2,2012-09-20 20:30:00,NON-CRIMINAL,LOST PROPERTY,Thursday,MISSION,NONE,1800 Block of FOLSOM ST,-122.415605,37.767718
3,2006-03-25 15:28:00,SECONDARY CODES,DOMESTIC VIOLENCE,Saturday,RICHMOND,"ARREST, BOOKED",800 Block of 28TH AV,-122.487534,37.773336
4,2013-10-01 00:33:00,WARRANTS,ENROUTE TO PAROLE OFFICER,Tuesday,MISSION,"ARREST, BOOKED",1200 Block of CHURCH ST,-122.427465,37.751296


## Brainstorm
Form groups of 3-4 and brainstorm all the questions you might ask of this data, what do you want to find out from this data?

## Analysis Exercise
Let's break up into small groups and take a look at these questions and not only answer them/analyze them but see if you can create a visualization to represent the answer:

1. How many instances of each crime are there in total? (Bonus: create a bar graph! Just the top 20)
2. How does the volume of crime vary by day of week? (Hint: use pivot tables)
3. What about crime vs district?


What can we conclude from each of these questions?

## 1. Crime Category Counts

## 2. Crime Volume for Each Day of the Week

## 3. Crime per District

## `In The Wild` EDA Exercise
In your groups, go back to the questions you brainstormed earlier and you still think would be meaningful or insightful and hack away to get the answers. We'll come back and share together what we found and proceed onwards from there.

## kNN Model Hacking Exercise
- In your groups, go ahead and build and tune a kNN model on **just the numerical** fields we have so far. 
- What's the best choice of k? How are your results?

## Try other models from your toolkit
- How does RandomForest do?
- How about SVC?
- Any others from sklearn?

## Feature Engineering Exercises
Let's make use of our non-numerical columns. There are many strategies of how to do this

- One Hot Encoding
- Label Encoding
- Binary Encoding

## Thinking Exercise
Google/read about these different methods of encoding and discuss with your group. Which one or ones do you think make sense for which columns in our dataset?

## Let's Hack on Our Dates Column Exercise
- Think of different ways you could incorporate the dates data into the model.
- Then hack on it with your group and see what the results are!

## Data Scaling Exercise
Sometimes the scale of our columns matter, depending on the model we use. Let's see what the situation with that is in our data.

Use the `scale` method from the `preprocessing` module in `sklearn`, re-run your models and see how that affects the performance.

In [10]:
df.describe()

Unnamed: 0,X,Y
count,20000.0,20000.0
mean,-122.422499,37.772153
std,0.031782,0.522885
min,-122.513642,37.708154
25%,-122.43322,37.752239
50%,-122.416349,37.775421
75%,-122.406841,37.784401
max,-120.5,90.0


## Working with Text Data Exercise
Let's talk about vectorization and then incorporate address data into our model...

Some options:
- `HashingVectorizer`
- `TfidfVectorizer`

Incorporate these additional features and go through the tuning process for your models again!

## Bonus: Different scoring methods
Let's look at negative log loss. Uh oh!

## Sklearn tools for more productivity
Let's use some smart sklearn shortcuts so we can iterate faster and quickly test out many models in one fell swoop with different configurations, and also at a glance analyze our results more granularly.

- `GridSearch`
- `classification_report`

## Class simplification / Business-Logic Optimization
What if we take the crime categories and combine the lower-frequency crimes into one category "OTHER"

## Expanding our feature set with an external data set (the weather!)
Add in a weather dataset and see what impact that has.

## Brainstorm
In your groups, brainstorm what else can you do to expand the featureset or tweak the models to perform better?

Let's discuss and share together.

Then let's try those approaches!

## Presentations
Let's have each group present their best models, their approaches, and any Aha! moments or things they learned along the way.