##Boston 311 Machine Learning models, version 2

In this notebook we are going to organize our data cleaning into idempotent functions. Idempotent means we can run those functions over and over and get the same results. There were some issues in running the code in the last notebook where certain blocks couldn't be re-run on their own because the column the code processed was deleted.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.preprocessing import StandardScaler

%matplotlib inline

#first of course we must import the necessary modules

##Writing our data cleaning functions

To get started writing our data cleaning function, we will start by only using the data from 2022, to make testing our function faster.

In [None]:
df2022 = pd.read_csv("https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/81a7b022-f8fc-4da5-80e4-b160058ca207/download/tmph4izx_fb.csv",
                            parse_dates=['open_dt', 'target_dt', 'closed_dt'])

Let's copy all of our code from the first notebook that was involved with cleaning the data



```
#code for cleaning for logistic regression
data['survival_time'] = data['closed_dt'] - data['open_dt']
data['event'] = data['closed_dt'].notnull().astype(int)
data['ward_number'] = data['ward'].str.extract(r'0*(\d+)')

cols_to_keep = ['case_enquiry_id', 'survival_time', 'event', 'subject', 'reason', 'department', 'source', 'ward_number']
clean_data = data[cols_to_keep].copy()

clean_data = pd.get_dummies(clean_data, columns=['subject', 'reason', 'department', 'source', 'ward_number'])

#code for cleaning for linear regression
clean_data_survival_mask = clean_data["survival_time"].notnull()
clean_data_survival = clean_data[clean_data_survival_mask].copy()
clean_data_survival['survival_time_hours'] = clean_data_survival['survival_time'].apply(lambda x: x.total_seconds()/3600)

#code for splitting cleaned data into features and targets:

#logistic regression:
X = clean_data.drop(['event', 'survival_time'], axis=1)
y = clean_data['event']

#linear regression
X = clean_data_survival.drop(['survival_time_hours', 'survival_time', 'event'], axis=1) # drop the target and event columns
y = clean_data_survival['survival_time_hours']
```

We have four different sets of code here:
1. The initial cleaning code that prepared for logistic regression to predict the Closed/Not Closed label
2. Further data cleaning that removed all Not Closed records to prepare for linear regression on the survival_time label, computed from "open_dt" and "closed_dt" features
3. Code for splitting the data into feature and label dataframes for logistic regression
4. Code for splitting the data into feature and label dataframes for linear regression

We could make anywhere from one to four data cleaning functions. But what will our ideal workflow be in the future? Probably:

1. Clean and split data for logistic regression
2. Clean and split data for linear regression

Long term we will probably want two models, one to tell us if a case is likely to be closed, and another to estimate how long. The question remains if we want to completely decouple the data cleaning for linear regression from the cleaning for logistic regression?

While right now we are testing the same feature encoding on both logistic and linear regression, it is probable that in the future we will want to use either different features, or different Machine Learing algorithms for our two models. However, how we clean and how we split our data for each model is very semantically connected, so for now we will not write separate clean and split functions.

We will therefore write two functions that take the original dataframe and return the feature datafram and label dataframe as a tuple.

In [None]:
#These functions originally did not have the WRONG in the name, but it was added later for clarity. 
#If you continue reading you will see why, or maybe you can spot it right away

def clean_and_split_for_logistic_WRONG(data) :
  data['survival_time'] = data['closed_dt'] - data['open_dt']
  data['event'] = data['closed_dt'].notnull().astype(int)
  data['ward_number'] = data['ward'].str.extract(r'0*(\d+)')

  cols_to_keep = ['case_enquiry_id', 'survival_time', 'event', 'subject', 'reason', 'department', 'source', 'ward_number']
  clean_data = data[cols_to_keep].copy()

  clean_data = pd.get_dummies(clean_data, columns=['subject', 'reason', 'department', 'source', 'ward_number'])
  X = clean_data.drop(['event', 'survival_time'], axis=1)
  y = clean_data['event']

  return X, y

def clean_and_split_for_linear_WRONG(data) :
  data['survival_time'] = data['closed_dt'] - data['open_dt']
  data['event'] = data['closed_dt'].notnull().astype(int)
  data['ward_number'] = data['ward'].str.extract(r'0*(\d+)')

  cols_to_keep = ['case_enquiry_id', 'survival_time', 'event', 'subject', 'reason', 'department', 'source', 'ward_number']
  clean_data = data[cols_to_keep].copy()

  clean_data = pd.get_dummies(clean_data, columns=['subject', 'reason', 'department', 'source', 'ward_number'])
  clean_data_survival_mask = clean_data["survival_time"].notnull()
  clean_data_survival = clean_data[clean_data_survival_mask].copy()
  clean_data_survival['survival_time_hours'] = clean_data_survival['survival_time'].apply(lambda x: x.total_seconds()/3600)

  X = clean_data_survival.drop(['survival_time_hours', 'survival_time', 'event'], axis=1) 
  y = clean_data_survival['survival_time_hours']
  
  return X, y

Let's test our new functions on the 2022 data

In [None]:
logistic_X, logistic_y = clean_and_split_for_logistic_WRONG(df2022)

In [None]:
linear_X, linear_y = clean_and_split_for_linear_WRONG(df2022)

In [None]:
logistic_X.describe()

Unnamed: 0,case_enquiry_id,subject_Animal Control,subject_Boston Police Department,subject_Boston Water & Sewer Commission,subject_Inspectional Services,subject_Mayor's 24 Hour Hotline,subject_Neighborhood Services,subject_Parks & Recreation Department,subject_Property Management,subject_Public Works Department,...,ward_number_20,ward_number_21,ward_number_22,ward_number_3,ward_number_4,ward_number_5,ward_number_6,ward_number_7,ward_number_8,ward_number_9
count,276723.0,276723.0,276723.0,276723.0,276723.0,276723.0,276723.0,276723.0,276723.0,276723.0,...,276723.0,276723.0,276723.0,276723.0,276723.0,276723.0,276723.0,276723.0,276723.0,276723.0
mean,101004400000.0,0.014552,0.002533,0.005309,0.069658,0.038342,8e-05,0.060393,0.010487,0.527564,...,0.047188,0.03983,0.043108,0.099305,0.047835,0.08949,0.055593,0.050534,0.03287,0.032462
std,143681.0,0.119753,0.050267,0.072666,0.25457,0.19202,0.008916,0.238213,0.101868,0.499241,...,0.212041,0.195561,0.203101,0.299072,0.213417,0.285451,0.229135,0.219045,0.178298,0.177224
min,101004100000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,101004200000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,101004400000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,101004500000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,101004600000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [None]:
logistic_y.describe()

count    276723.000000
mean          0.894400
std           0.307326
min           0.000000
25%           1.000000
50%           1.000000
75%           1.000000
max           1.000000
Name: event, dtype: float64

In [None]:
linear_X.describe()

Unnamed: 0,case_enquiry_id,subject_Animal Control,subject_Boston Police Department,subject_Boston Water & Sewer Commission,subject_Inspectional Services,subject_Mayor's 24 Hour Hotline,subject_Neighborhood Services,subject_Parks & Recreation Department,subject_Property Management,subject_Public Works Department,...,ward_number_20,ward_number_21,ward_number_22,ward_number_3,ward_number_4,ward_number_5,ward_number_6,ward_number_7,ward_number_8,ward_number_9
count,247501.0,247501.0,247501.0,247501.0,247501.0,247501.0,247501.0,247501.0,247501.0,247501.0,...,247501.0,247501.0,247501.0,247501.0,247501.0,247501.0,247501.0,247501.0,247501.0,247501.0
mean,101004400000.0,0.014036,0.000259,0.002364,0.06585,0.038452,7.3e-05,0.049757,0.007681,0.531489,...,0.043778,0.040654,0.043923,0.09886,0.04585,0.088848,0.056868,0.052266,0.033127,0.033075
std,143590.0,0.117641,0.016079,0.04856,0.248021,0.192286,0.008528,0.217444,0.087303,0.499008,...,0.2046,0.197489,0.204924,0.298475,0.209161,0.284525,0.231592,0.222564,0.178969,0.178832
min,101004100000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,101004200000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,101004400000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,101004500000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,101004600000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [None]:
linear_y.describe()

count    247501.000000
mean        132.223485
std         502.381809
min           0.000833
25%           1.604167
50%          12.028333
75%          50.717222
max        8633.708333
Name: survival_time_hours, dtype: float64

There is an obvious mistake, that I am going to leave here for the sake of the learning process, which is that the case_enquiry_id is being included in the test sets! This is a large random integer, and definitely not a good feature, and definitely would have negatively impacted our models in the last notebook. Lets copy our functions again below:

In [None]:
def clean_and_split_for_logistic(data) :
  data['survival_time'] = data['closed_dt'] - data['open_dt']
  data['event'] = data['closed_dt'].notnull().astype(int)
  data['ward_number'] = data['ward'].str.extract(r'0*(\d+)')

  cols_to_keep = ['case_enquiry_id', 'survival_time', 'event', 'subject', 'reason', 'department', 'source', 'ward_number']
  clean_data = data[cols_to_keep].copy()

  clean_data = pd.get_dummies(clean_data, columns=['subject', 'reason', 'department', 'source', 'ward_number'])

  #fix this line to also drop the case_enquiry_id
  X = clean_data.drop(['case_enquiry_id','event', 'survival_time'], axis=1)
  y = clean_data['event']

  return X, y

def clean_and_split_for_linear(data) :
  data['survival_time'] = data['closed_dt'] - data['open_dt']
  data['event'] = data['closed_dt'].notnull().astype(int)
  data['ward_number'] = data['ward'].str.extract(r'0*(\d+)')

  cols_to_keep = ['case_enquiry_id', 'survival_time', 'event', 'subject', 'reason', 'department', 'source', 'ward_number']
  clean_data = data[cols_to_keep].copy()

  clean_data = pd.get_dummies(clean_data, columns=['subject', 'reason', 'department', 'source', 'ward_number'])
  clean_data_survival_mask = clean_data["survival_time"].notnull()
  clean_data_survival = clean_data[clean_data_survival_mask].copy()
  clean_data_survival['survival_time_hours'] = clean_data_survival['survival_time'].apply(lambda x: x.total_seconds()/3600)

  #fix this line to also drop the case_enquiry_id
  X = clean_data_survival.drop(['case_enquiry_id','survival_time_hours', 'survival_time', 'event'], axis=1) 
  y = clean_data_survival['survival_time_hours']
  
  return X, y

Let's test our functions again:

In [None]:
logistic_X, logistic_y = clean_and_split_for_logistic(df2022)

In [None]:
linear_X, linear_y = clean_and_split_for_linear(df2022)

In [None]:
logistic_X.describe()

Unnamed: 0,subject_Animal Control,subject_Boston Police Department,subject_Boston Water & Sewer Commission,subject_Inspectional Services,subject_Mayor's 24 Hour Hotline,subject_Neighborhood Services,subject_Parks & Recreation Department,subject_Property Management,subject_Public Works Department,subject_Transportation - Traffic Division,...,ward_number_20,ward_number_21,ward_number_22,ward_number_3,ward_number_4,ward_number_5,ward_number_6,ward_number_7,ward_number_8,ward_number_9
count,276723.0,276723.0,276723.0,276723.0,276723.0,276723.0,276723.0,276723.0,276723.0,276723.0,...,276723.0,276723.0,276723.0,276723.0,276723.0,276723.0,276723.0,276723.0,276723.0,276723.0
mean,0.014552,0.002533,0.005309,0.069658,0.038342,8e-05,0.060393,0.010487,0.527564,0.271083,...,0.047188,0.03983,0.043108,0.099305,0.047835,0.08949,0.055593,0.050534,0.03287,0.032462
std,0.119753,0.050267,0.072666,0.25457,0.19202,0.008916,0.238213,0.101868,0.499241,0.44452,...,0.212041,0.195561,0.203101,0.299072,0.213417,0.285451,0.229135,0.219045,0.178298,0.177224
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [None]:
logistic_y.describe()

count    276723.000000
mean          0.894400
std           0.307326
min           0.000000
25%           1.000000
50%           1.000000
75%           1.000000
max           1.000000
Name: event, dtype: float64

In [None]:
linear_X.describe()

Unnamed: 0,subject_Animal Control,subject_Boston Police Department,subject_Boston Water & Sewer Commission,subject_Inspectional Services,subject_Mayor's 24 Hour Hotline,subject_Neighborhood Services,subject_Parks & Recreation Department,subject_Property Management,subject_Public Works Department,subject_Transportation - Traffic Division,...,ward_number_20,ward_number_21,ward_number_22,ward_number_3,ward_number_4,ward_number_5,ward_number_6,ward_number_7,ward_number_8,ward_number_9
count,247501.0,247501.0,247501.0,247501.0,247501.0,247501.0,247501.0,247501.0,247501.0,247501.0,...,247501.0,247501.0,247501.0,247501.0,247501.0,247501.0,247501.0,247501.0,247501.0,247501.0
mean,0.014036,0.000259,0.002364,0.06585,0.038452,7.3e-05,0.049757,0.007681,0.531489,0.290039,...,0.043778,0.040654,0.043923,0.09886,0.04585,0.088848,0.056868,0.052266,0.033127,0.033075
std,0.117641,0.016079,0.04856,0.248021,0.192286,0.008528,0.217444,0.087303,0.499008,0.453781,...,0.2046,0.197489,0.204924,0.298475,0.209161,0.284525,0.231592,0.222564,0.178969,0.178832
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [None]:
linear_y.describe()

count    247501.000000
mean        132.223485
std         502.381809
min           0.000833
25%           1.604167
50%          12.028333
75%          50.717222
max        8633.708333
Name: survival_time_hours, dtype: float64

That looks better. Let's try training our models to see if we get improved results. First we will train Logistic Regression, then we will train linear regression:

In [None]:
#Train a logistic regression model

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(logistic_X, logistic_y, test_size=0.2, random_state=42)

# Build model
model_logistic = keras.Sequential([
    keras.layers.Dense(units=1, input_shape=(X_train.shape[1],), activation='sigmoid')
])

# Compile model
model_logistic.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train model
model_logistic.fit(X_train, y_train, epochs=10, batch_size=32)

# Evaluate model
test_loss, test_acc = model_logistic.evaluate(X_test, y_test)

print('Test accuracy:', test_acc)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test accuracy: 0.9145541787147522


The loss function has gone down dramatically from the last notebook, but the accuracy has not improved by much. Since the case id column was likely uncorrelated with the label, its removal may not have had a significant impact on the prediction accuracy of our model. The other 99 features may have been sufficient to accurately predict the label.

Now let's train our linear regression model:

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(linear_X) # scale the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, linear_y, test_size=0.2, random_state=42)

# define the model architecture
model_linear = keras.Sequential([
    keras.layers.Dense(units=1, input_dim=X_train.shape[1])
])

# compile the model
model_linear.compile(optimizer='adam', loss='mean_squared_error')

# train the model
model_linear.fit(X_train, y_train, epochs=50, batch_size=32, verbose=1)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7f7d0bdfb670>

For our linear regression model, our loss function dropped quickly, but it didn't actually reach the same level. This could be random, or it could be that a sequential case id actually had a little bit of correlation on the time it takes to close a case.

In the next notebook, we will work more on data cleaning. It seems like maybe outliers in our data might be producing noise.