## Creating the data for classification

With the following code, you can create the data for classification yourself! 
You can decide which features you want to "one-hot-encode", and which crime types you want to use.
Try the code if you want to test bigger parts of the origial data set

### 1. Read in your cleaned data

In [1]:
import pandas as pd

df = pd.read_csv("clean_crime.csv")

### 2. One hot encode or drop categorial data 

In [2]:
#create a list here, with the crimes you want to classify
crime_classification = [
    "Ballistics", 
    "License Violation",
    "Liquor Violation",
    "Fire Related Reports"] 

# slice wanted crime
df_ml = df[df.OFFENSE_CODE_GROUP.isin(crime_classification)]

# drop columns we will not use as feature
df_ml = df_ml.drop(["INCIDENT_NUMBER", "OCCURRED_ON_DATE", "OFFENSE_DESCRIPTION", "OFFENSE_CODE", "STREET"], axis=1)

# one hot encode categorial data
df_ml_hot = pd.get_dummies(df_ml, columns=["DAY_OF_WEEK","DISTRICT","REPORTING_AREA", "SHOOTING"])

# make sure all dtype are numercial, except for the y-labels
df_ml_hot.dtypes

OFFENSE_CODE_GROUP     object
YEAR                    int64
MONTH                   int64
HOUR                    int64
Lat                   float64
                       ...   
REPORTING_AREA_97        bool
REPORTING_AREA_98        bool
REPORTING_AREA_99        bool
SHOOTING_N               bool
SHOOTING_Y               bool
Length: 774, dtype: object

### 3. Mix and split one-hot-encoded data into training set and test set

In [3]:
# mix/shuffle values
df_ml_hot = df_ml_hot.sample(frac=1)

# split 80% into training set
split_ratio = 0.8
train_set = df_ml_hot[:int(split_ratio*len(df_ml_hot.index))]

# split the rest into testing set
test_set = df_ml_hot[int(split_ratio*len(df_ml_hot.index)):]

len(train_set.index), len(test_set.index)

(4328, 1083)

### 4. Split treaining and test set in sample labels(y) and samples(X)

In [4]:
# remove the Offense code group into from training samples (X_train), and set as Y_train
X_train = train_set.drop("OFFENSE_CODE_GROUP", axis=1) 
y_train = train_set["OFFENSE_CODE_GROUP"]

assert len(X_train.index) == len(y_train.index)

# remove the Offense code group into from testing samples (X_test), and set as Y_test
X_test = test_set.drop("OFFENSE_CODE_GROUP", axis=1)
y_test = test_set["OFFENSE_CODE_GROUP"]

assert len(X_test.index) == len(y_test.index)

### 5. Save the files  

In [5]:
X_train.to_csv("crime_training_data.csv", index=False)
y_train.to_csv("crime_training_labels.csv", index=False)
X_test.to_csv("crime_test_data.csv", index=False)
y_test.to_csv("crime_test_labels.csv", index=False)