# Final Project USC Data Science Bootcamp
## LA Crime data analysis and prediction using TensorFlow 

#### Maaike Rutten,  June 2019

For this project I used the crime data found on the City of LA website from 2010 to Present.
This dataset reflects incidents of crime in the City of Los Angeles dating back to 2010. 
This data is transcribed from original crime reports that are typed on paper and therefore there may be 
some inaccuracies within the data. 

This workbook combines a series of technologies and frameworks to read in Los Angeles Crime data and using machine learning predict the probability of crimes in certain age groups.

I downloaded the crime data as a CSV file and loaded it in a Jupyter notebook. 

The available information in the dataset contains for every individual crime the following: 
- DR Number   	
- Date Reported   
- Date Occurred   	
- Time Occurred   	
- Area ID   	
- Area Name   	
- Reporting District   	
- Crime Code   	
- Crime Code Description   	
- MO Codes   	
- Victim Age   	
- Victim Sex   	
- Victim Descent   	
- Premise Code   	
- Premise Description   	
- Weapon Used Code   	
- Weapon Description   	
- Status Code   	
- Status Description   	
- Crime Code 1   	
- Crime Code 2   	
- Crime Code 3   	
- Crime Code 4   	
- Address   	
- Cross Street   	
- Location


In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
import tensorflow.feature_column as fc

import os
import sys

import matplotlib.pyplot as plt
from IPython.display import clear_output

In [None]:
tf.enable_eager_execution()

## Download wide and deep tensorflow implementation

wide and deep model of tensorflow will be used

In [None]:
! pip install requests
! git clone --depth 1 https://github.com/tensorflow/models

Add to python path

In [None]:
models_path = os.path.join(os.getcwd(), 'models')

sys.path.append(models_path)
print(sys.path)

Connect to the dataset:

In [None]:
from official.wide_deep import lacrime_dataset
from official.wide_deep import lacrime_main



Export path to external python process

In [None]:
#export PYTHONPATH=${PYTHONPATH}:"$(pwd)/models"
#running from python you need to set the `os.environ` or the subprocess will not see the directory.

if "PYTHONPATH" in os.environ:
  os.environ['PYTHONPATH'] += os.pathsep +  models_path
else:
  os.environ['PYTHONPATH'] = models_path

Run the model:


In [None]:
!python -m official.wide_deep.lacrime_main --model_type=wide --train_epochs=2

## Read the LA Crime data

In [None]:
train_file = "lacrime.data"
test_file = "lacrime.test"

In [None]:
import pandas

train_df = pandas.read_csv(train_file, header = None, names = lacrime_dataset._CSV_COLUMNS)
test_df = pandas.read_csv(test_file, header = None, names = lacrime_dataset._CSV_COLUMNS)

train_df.head()



## Converting Data into Tensors



In [None]:
ds = lacrime_dataset.input_fn(train_file, num_epochs=5, shuffle=True, batch_size=10)

for feature_batch, label_batch in ds.take(1):
  print('Feature keys:', list(feature_batch.keys())[:5])
  print()
  print('Age batch   :', feature_batch['Victim_Age'])
  print()
  print('Label batch :', label_batch )

Because `Estimators` expect an `input_fn` that takes no arguments, we typically wrap configurable input function into an obejct with the expected signature. For this notebook configure the `train_inpf` to iterate over the data twice:

In [None]:
import functools

train_inpf = functools.partial(lacrime_dataset.input_fn, train_file, num_epochs=2, shuffle=True, batch_size=64)
test_inpf = functools.partial(lacrime_dataset.input_fn, test_file, num_epochs=1, shuffle=False, batch_size=64)

#### Numeric columns

- Victim age

In [None]:
age = fc.numeric_column('Victim_Age')

The model will use the `feature_column` definitions to build the model input. You can inspect the resulting output using the `input_layer` function:

In [None]:
fc.input_layer(feature_batch, [age]).numpy()

The following will train and evaluate a model using only the `age` feature:

In [None]:
classifier = tf.estimator.LinearClassifier(feature_columns=[age])
classifier.train(train_inpf)
result = classifier.evaluate(test_inpf)

clear_output()  # used for display in notebook
print(result)

Similarly, we can define a `NumericColumn` for each continuous feature column
that we want to use in the model:

In [None]:
timeocurred_num = tf.feature_column.numeric_column('Time_Occurred')

my_numeric_columns = [age,timeocurred_num]

fc.input_layer(feature_batch, my_numeric_columns).numpy()

You could retrain a model on these features by changing the `feature_columns` argument to the constructor:

In [None]:
classifier = tf.estimator.LinearClassifier(feature_columns=my_numeric_columns)
classifier.train(train_inpf)

result = classifier.evaluate(test_inpf)

clear_output()

for key,value in sorted(result.items()):
  print('%s: %s' % (key, value))

#### Categorical columns

Victim descent is part of a list of possible values

In [None]:
descent = fc.categorical_column_with_vocabulary_list(
    'victim_descent',
    ['O', 'B', 'H', 'W', 'X'])

Run the  layer,  with both the `age` and `descent` columns:

In [None]:
fc.input_layer(feature_batch, [age, fc.indicator_column(descent)])

areaname with `categorical_column_with_hash_bucket` :

In [None]:
areaname = tf.feature_column.categorical_column_with_hash_bucket(
    'Area_Name', hash_bucket_size=1000)

each possible value in the  column `area name` is hashed to an integer ID as we encounter them in training. 

In [None]:
for item in feature_batch['Area_Name'].numpy():
    print(item.decode())

run the input layer

In [None]:
areaname_result = fc.input_layer(feature_batch, [fc.indicator_column(areaname)])

areaname_result.numpy().shape

It's easier to see the actual results if we take the `tf.argmax` over the `hash_bucket_size` dimension. N

In [None]:
tf.argmax(areaname_result, axis=1).numpy()

### Derived feature columns

#### Make Continuous Features Categorical through Bucketization

Bucketization of ages to create agegroups

In [None]:
age_buckets = tf.feature_column.bucketized_column(
    age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

In [None]:
fc.input_layer(feature_batch, [age, age_buckets]).numpy()

## Define the logistic regression model



In [None]:
import tempfile

base_columns = [
     Time_Ocurred, Victim_Sex, descent, areaname_result,
    age_buckets,
]



model = tf.estimator.LinearClassifier(
    model_dir=tempfile.mkdtemp(),
    feature_columns=base_columns 
    optimizer=tf.train.FtrlOptimizer(learning_rate=0.1))



## Train and evaluate the model



In [None]:
train_inpf = functools.partial(lacrime_dataset.input_fn, train_file,
                               num_epochs=40, shuffle=True, batch_size=64)

model.train(train_inpf)

clear_output()  

evaluate the accuracy of the model by predicting the labels

In [None]:
results = model.evaluate(test_inpf)

clear_output()

for key,value in sorted(results.items()):
  print('%s: %0.2f' % (key, value))

evaluate how the model performed against real dataset

In [None]:
import numpy as np

predict_df = test_df[:20].copy()

pred_iter = model.predict(
    lambda:easy_input_function(predict_df, label_key='Victim_Sex',
                               num_epochs=1, shuffle=False, batch_size=10))

classes = np.array(['M', 'F'])
pred_class_id = []

for pred_dict in pred_iter:
  pred_class_id.append(pred_dict['class_ids'])

predict_df['predicted_class'] = classes[np.array(pred_class_id)]
predict_df['correct'] = predict_df['predicted_class'] == predict_df['Victim_Sex']

clear_output()

predict_df[['Victim_Sex','predicted_class', 'correct']]