# **Predicting U.S. Crime Rates**

## **Recurrent Neural Network with Long Short-Term Memory**

---

**PLEASE NOTE: To run this notebook, the user may need to upload this notebook along with the following dataset to [Google Colab](https://colab.research.google.com/). From what we can tell, an incompatibility between numpy and Tensorflow prevents the script from running in Jupyter.**

* Upload the [predictors_and_targets.csv](../data/model_inputs/predictors_and_targets.csv) to the Colab runtime. The `pd.read_csv()` call below is configured to read the csv from the root runtime upload folder. The this notebook writes the model output to .csvs which can be downloaded from the same runtime folder.

---

In this notebook, we'll seek to train an Long Short-Term Memory Recurrent Neural Network in a fashion inspired by [this TensorFlow tutorial](https://www.tensorflow.org/tutorials/structured_data/time_series#setup).

Start by importing a few important libraries:

In [1]:
import os
import datetime

import IPython
import IPython.display
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf

from sklearn.preprocessing import StandardScaler
from sklearn import metrics

# set a random seed:
tf.random.set_seed(42)

Import and format the data:

In [2]:
df = pd.read_csv('predictors_and_targets.csv')
to_drop = ['violent_crime', 'homicide', 'rape',
       'robbery', 'aggravated_assault', 'property_crime', 'burglary',
       'larceny', 'motor_vehicle_theft', 'arson', 'homicide_1000', 'rape_1000', 
       'robbery_1000', 'aggravated_assault_1000','burglary_1000', 'larceny_1000',
       'motor_vehicle_theft_1000', 'arson_1000', 'ag_Unknown']
df.drop(columns=to_drop,inplace=True)

#to datetime format inspired by user Zero: https://stackoverflow.com/a/46658244
years = df.copy()
years = list(years['year'].unique())
years.append(years[-1]+1)
years = pd.to_datetime(np.array(years),format="%Y")

#set the years to a datetime format:
df['year'] = pd.to_datetime(df['year'], format = '%Y')

## Modeling Goals
* Build a model that can produce predictions for a specified crime rate in each state for every year after the first observed year. Do this for two crime categories:
    * `violent_crime_1000`, the number of violent crimes committed per thousand population in a state and
    * `property_crime_1000`, the number of property crimes committed per thousand population in a state.
* Compare the results of the model to a baseline for evaluation  
  * Metrics used will be MAE, RMSE, and R2
* Append the predictions to a time series of crime rates
  * Flag the predictions as 'forecast' and the observed as 'historical'

---

Modeling steps that must happen for each state:

1. Extract the state dataframe
1. Set the year column to be the index
1. Define the X and Y
1. Set aside just the **index** and **crime rate** columns in another dataframe, with a new field which labels the observations as 'historical'. Add a new row for the forecast period, with values to be filled later.
1. Scale the Data
1. Make predictions using baseline model, add to scoring matrix
1. Make predictions using LSTM model, add to scoring matrix
1. Append predictions to historical+forecast dataframe

We will run the above for each state plus D.C. We'll evaluate the model based on mean MAE, RMSE, and R2 of all the state's predictions VS the mean MAE, RMSE, and R2 of all the states predicted using the baseline model.

First, we'll need to define a helper class, which will generate windows of consecutive time steps. Heavily inspired by [TensorFlow Time Series tutorial](https://www.tensorflow.org/tutorials/structured_data/time_series#setup).

In [3]:
class WindowGenerator():
    def __init__(self, input_width, label_width, shift,
                 train_df,
                 label_columns=None):
        # Store the raw data.
        self.train_df = train_df

        # Work out the label column indices.
        self.label_columns = label_columns
        if label_columns is not None:
            self.label_columns_indices = {name: i for i, name in
                                        enumerate(label_columns)}
        self.column_indices = {name: i for i, name in
                           enumerate(train_df.columns)}

        # Work out the window parameters.
        self.input_width = input_width
        self.label_width = label_width
        self.shift = shift

        self.total_window_size = input_width + shift

        self.input_slice = slice(0, input_width)
        self.input_indices = np.arange(self.total_window_size)[self.input_slice]

        self.label_start = self.total_window_size - self.label_width
        self.labels_slice = slice(self.label_start, None)
        self.label_indices = np.arange(self.total_window_size)[self.labels_slice]

    def __repr__(self):
        return '\n'.join([
            f'Total window size: {self.total_window_size}',
            f'Input indices: {self.input_indices}',
            f'Label indices: {self.label_indices}',
            f'Label column name(s): {self.label_columns}'])
    
    def split_window(self, features):
        inputs = features[:, self.input_slice, :]
        labels = features[:, self.labels_slice, :]
        if self.label_columns is not None:
            labels = tf.stack(
                [labels[:, :, self.column_indices[name]] for name in self.label_columns],
                axis=-1)

        # Slicing doesn't preserve static shape information, so set the shapes
        # manually. This way the `tf.data.Datasets` are easier to inspect.
        inputs.set_shape([None, self.input_width, None])
        labels.set_shape([None, self.label_width, None])

        return inputs, labels

    #WindowGenerator.split_window = split_window

    def plot(self, model=None, plot_col='violent_crime_1000', max_subplots=3):
        inputs, labels = self.example
        plt.figure(figsize=(12, 8))
        plot_col_index = self.column_indices[plot_col]
        max_n = min(max_subplots, len(inputs))
        for n in range(max_n):
            plt.subplot(max_n, 1, n+1)
            plt.ylabel(f'{plot_col} [normed]')
            plt.plot(self.input_indices, inputs[n, :, plot_col_index],
                 label='Inputs', marker='.', zorder=-10)

            if self.label_columns:
                label_col_index = self.label_columns_indices.get(plot_col, None)
            else:
                label_col_index = plot_col_index

            if label_col_index is None:
                continue

            plt.scatter(self.label_indices, labels[n, :, label_col_index],
                    edgecolors='k', label='Labels', c='#2ca02c', s=64)
            if model is not None:
                predictions = model(inputs)
                plt.scatter(self.label_indices, predictions[n, :, label_col_index],
                            marker='X', edgecolors='k', label='Predictions',
                            c='#ff7f0e', s=64)

            if n == 0:
                plt.legend()

        plt.xlabel('Year')
    
    def make_dataset(self, data):
        data = np.array(data, dtype=np.float32)
        ds = tf.keras.preprocessing.timeseries_dataset_from_array(
            data=data,
            targets=None,
            sequence_length=self.total_window_size,
            sequence_stride=1,
            shuffle=True,
            batch_size=41,)

        ds = ds.map(self.split_window)

        return ds

    
    @property
    def train(self):
        return self.make_dataset(self.train_df)

    @property
    def example(self):
        """Get and cache an example batch of `inputs, labels` for plotting."""
        result = getattr(self, '_example', None)
        if result is None:
            # No example batch was found, so get one from the `.train` dataset
            result = next(iter(self.train))
            # And cache it for next time
            self._example = result
        return result

#define a modeling function to fit a couple neural nets:
MAX_EPOCHS = 20

def compile_and_fit(model, window, patience=2):
    early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss',
                                                      patience=patience,
                                                      mode='min')

    model.compile(loss=tf.losses.MeanSquaredError(),
                  optimizer=tf.optimizers.Adam(),
                  metrics=[tf.metrics.MeanAbsoluteError()])

    history = model.fit(window.train, epochs=MAX_EPOCHS,
                        validation_data=window.train,
                        callbacks=[early_stopping], verbose=0)
    return history

In [5]:
#define our states:
states = df.copy()
states = states['state_abbr'].unique()

#define dataframes to hold new things:
target_forecast = pd.DataFrame(None,columns=['year','state','measure_type','violent_crime_1000','property_crime_1000'])
print(f'shape of target_forecast: {target_forecast.shape} ({list(target_forecast.columns)})') # will add to this as we loop through modeling the states

scores = pd.DataFrame(None,columns=['state','crime_rate','baseline_MAE','baseline_RMSE','baseline_R2','LSTM_MAE','LSTM_RMSE','LSTM_R2'])
print(f'shape of scores: {scores.shape, scores.columns} ({list(scores.columns)})') # will add to this as we loop through modeling

rate_preds = pd.DataFrame(None,columns=['year','state','crime_rate','prediction'])
print(f'shape of rate_preds: {rate_preds.shape, rate_preds.columns} ({list(rate_preds.columns)})')

def get_scores(state, crime, true, baseline, pred):
    state_scores = {x:None for x in scores.columns}
    state_scores['state'] = state
    state_scores['crime_rate'] = crime
    state_scores['baseline_MAE'] = metrics.mean_absolute_error(true, baseline)
    state_scores['baseline_RMSE'] = metrics.mean_squared_error(true, baseline,squared=False)
    state_scores['baseline_R2'] = metrics.r2_score(true, baseline)
    state_scores['LSTM_MAE'] = metrics.mean_absolute_error(true, pred)
    state_scores['LSTM_RMSE'] = metrics.mean_squared_error(true, pred,squared=False)
    state_scores['LSTM_R2'] = metrics.r2_score(true, pred)
    return state_scores # must be a dict with keys matching the scores columns


shape of target_forecast: (0, 5) (['year', 'state', 'measure_type', 'violent_crime_1000', 'property_crime_1000'])
shape of scores: ((0, 8), Index(['state', 'crime_rate', 'baseline_MAE', 'baseline_RMSE', 'baseline_R2',
       'LSTM_MAE', 'LSTM_RMSE', 'LSTM_R2'],
      dtype='object')) (['state', 'crime_rate', 'baseline_MAE', 'baseline_RMSE', 'baseline_R2', 'LSTM_MAE', 'LSTM_RMSE', 'LSTM_R2'])
shape of rate_preds: ((0, 4), Index(['year', 'state', 'crime_rate', 'prediction'], dtype='object')) (['year', 'state', 'crime_rate', 'prediction'])


In [6]:
for state in states:
    state_crime = df.copy()
    state_crime = state_crime[state_crime['state_abbr']==state]
    state_crime.drop(columns=['state_abbr'],inplace=True)
    state_crime.set_index('year',inplace=True)

    #save the mean and standard deviation to unscale predictions later:
    means = {'violent_crime_1000': state_crime['violent_crime_1000'].mean(),
             'property_crime_1000': state_crime['property_crime_1000'].mean()}
    sdevs = {'violent_crime_1000': state_crime['violent_crime_1000'].std(),
             'property_crime_1000': state_crime['property_crime_1000'].std()}
    
    sc = StandardScaler()

    state_crime = pd.DataFrame(sc.fit_transform(state_crime), index=state_crime.index,columns=state_crime.columns)

    forecasted = [years[-1],state, 'forecast']
    
    for crime in ['violent_crime_1000','property_crime_1000']:
        state_crime_filtered = state_crime.copy()
        state_crime_filtered = state_crime[['population', crime, 'avg_unemployment_rate', 'avg_CPI','ag_Democrat', 'ag_Mixed', 'ag_Republican']]
        
        #set the baseline value
        crime_baseline_y = state_crime_filtered[crime][:-1]
        
        #define a wide window:
        wide_window = WindowGenerator(
            input_width=41, 
            label_width=41, 
            shift=1,
            train_df=state_crime_filtered,
            label_columns=[crime])

        #define our LSTM model:
        lstm_model = tf.keras.models.Sequential([
            # Shape [batch, time, features] => [batch, time, lstm_units]
            tf.keras.layers.LSTM(41, return_sequences=True),
            # Shape => [batch, time, features]
            tf.keras.layers.Dense(units=1)
        ])
        
        #fit:
        history = compile_and_fit(lstm_model, wide_window)
        
        #model is now trained make a set of predictions:
        predictive_input = WindowGenerator(
            input_width=42, 
            label_width=42, 
            shift=0,
            train_df=state_crime_filtered,
            label_columns=[crime])
        
        #get predictions:
        preds = lstm_model.predict(predictive_input.example[0])[0]
        
        #clear memory:
        del history
        
        #get scores:
        scores = scores.append(
            get_scores(state=state,
                       crime=crime,
                       true=state_crime_filtered[crime][1:],
                       baseline=crime_baseline_y,
                       pred=preds[:-1]), ignore_index=True)
        
        #un-scale the preds so they are closer to their unscaled historical values:
        preds = preds*sdevs[crime]+means[crime]
        
        #append rate predictions:
        rates = pd.DataFrame(data={'year': years[1:],
                                   'state': state,
                                   'crime_rate': crime,
                                   'prediction': [x[0] for x in preds]})
        rate_preds = rate_preds.append(rates,ignore_index=True)
        
        #append forecasting forecasted crime data:
        forecasted.append(preds[-1][0])
    
    #put the violent and property crime forecast into the forecasted dataframe
    to_forecast = {x: None for x in target_forecast.columns}
    for k, v in zip(to_forecast.keys(),forecasted):
        to_forecast[k] = v
    target_forecast = target_forecast.append(to_forecast,ignore_index=True)



In [7]:
#concatenate the historical and forecast dataframes together:
historical = df.copy()
historical = historical[['year','state_abbr','violent_crime_1000','property_crime_1000']]
historical['measure_type'] = 'historical'
historical = historical[['year', 'state_abbr','measure_type', 'violent_crime_1000', 'property_crime_1000']]
target_forecast.rename(columns={'state': 'state_abbr'}, inplace=True) 
target_forecast = pd.concat([historical,target_forecast],axis=0).sort_values(['state_abbr','year']).reset_index()
target_forecast.drop(columns=['index'],inplace=True)

In [8]:
#write stuff:
target_forecast.to_csv('target_forecast.csv',index=False)
scores.to_csv('scores.csv', index=False)
rate_preds.to_csv('predictions.csv',index=False)