#Boston 311 v4 - List all to-dos and questions, and finally train our models on all the data

The last three notebooks provided interesting insights into our problem here, but the fun of Machine Learning is running models on big data right? So let's list all of our to dos, and then put them aside for a moment to see how our data cleaning functions and models do on the combined 311 data from 2011-2023

##Questions and To-Dos from v1:

1. Train the models on all the historical 311 data
2. Add more features
3. clean up the data by removing outliers
4. deal with the missing feature columns between 2022 and 2023 data because some categorical feature values are missing from one or the other, resulting in one hot encoded column mismatches.
5. Develop some heuristics to see if our Machine Learning model can actually do better than some obvious correlations.

Questions to answer:
1. Can we find some basic commonality between open cases?
2. When and how is the target date set? How about the overdue flag?
3. Do cases autoclose after a certain time?
4. Do cases carry over from year to year? If so, do they keep the same case_enquiry_id? (probably they do, but it would be good to confirm)

##Questions and To-Dos from v3:

To-Dos:

1. look at the currently available android app and see what values are available to the user to select, and which categories might be assigned by the 311 agents after receiving a new case.
2. compare a basic model which only uses the department value as a feature to our more complex models as a heuristic for whether additional features actually improve predictions.

We have done a bit of work on some of these. v1.TD3 was addressed a bit in v3. v1.TD4 will be dealt with by combining all our data together. v1.TD5 is addressed partially with v3.TD2 where we hypothesize that a simple department value feature model might be a good heuristic to see if other features add noise or good predictive data.

For now we are going to address v1.TD1 and combine all the data.

In [25]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import tensorflow as tf
import glob
from tensorflow import keras
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.preprocessing import StandardScaler
from datetime import datetime

from IPython.display import display

%matplotlib inline

#first of course we must import the necessary modules

In [21]:
def clean_and_split_for_logistic(data) :
  data['survival_time'] = data['closed_dt'] - data['open_dt']
  data['event'] = data['closed_dt'].notnull().astype(int)
  data['ward_number'] = data['ward'].str.extract(r'0*(\d+)')

  cols_to_keep = ['case_enquiry_id', 'survival_time', 'event', 'subject', 'reason', 'department', 'source', 'ward_number']
  clean_data = data[cols_to_keep].copy()

  clean_data = pd.get_dummies(clean_data, columns=['subject', 'reason', 'department', 'source', 'ward_number'])

  #fix this line to also drop the case_enquiry_id
  X = clean_data.drop(['case_enquiry_id','event', 'survival_time'], axis=1)
  y = clean_data['event']

  return X, y

def clean_and_split_for_linear(data) :
  data['survival_time'] = data['closed_dt'] - data['open_dt']
  data['event'] = data['closed_dt'].notnull().astype(int)
  data['ward_number'] = data['ward'].str.extract(r'0*(\d+)')

  cols_to_keep = ['case_enquiry_id', 'survival_time', 'event', 'subject', 'reason', 'department', 'source', 'ward_number']
  clean_data = data[cols_to_keep].copy()

  clean_data = pd.get_dummies(clean_data, columns=['subject', 'reason', 'department', 'source', 'ward_number'])
  clean_data_survival_mask = clean_data["survival_time"].notnull()
  clean_data_survival = clean_data[clean_data_survival_mask].copy()
  clean_data_survival['survival_time_hours'] = clean_data_survival['survival_time'].apply(lambda x: x.total_seconds()/3600)

  #fix this line to also drop the case_enquiry_id
  X = clean_data_survival.drop(['case_enquiry_id','survival_time_hours', 'survival_time', 'event'], axis=1) 
  y = clean_data_survival['survival_time_hours']
  
  return X, y

We imported our libraries and defined our clean data functions. Now let's load all the data and see if we have duplicated case_enquiry_id value and how we will deal with those. Probably we would want to drop the earlier records and keep the later ones. It will be interesting to see if any cases remained open long enough to be included in more than two of the year-based datasets. I loaded these files into google colaboratory by downloading them from data.boston.gov and uploading them manually. Here is the link to all the data sets:

https://data.boston.gov/dataset/311-service-requests

In [7]:

# Get a list of all CSV files in the directory
all_files = glob.glob("*.csv")

# Create an empty list to store the dataframes
dfs = []

# Loop through the files and load them into dataframes
for file in all_files:
  df = pd.read_csv(file)
  dfs.append(df)

  df = pd.read_csv(file)
  df = pd.read_csv(file)


In [8]:
#check that the files all have the same number of columns, and the same names
for i in range(len(dfs)):
  if dfs[i].shape[1] != dfs[0].shape[1]:
    print('Error: File', i, 'does not have the same number of columns as File 0')
  else:
    print('File', i, 'has same number of columns as File 0')
  if not dfs[i].columns.equals(dfs[0].columns):
    print('Error: File', i, 'does not have the same column names and order as File 0')
  else:
    print('File', i, 'has the same column name and order as File 0')

File 0 has same number of columns as File 0
File 0 has the same column name and order as File 0
File 1 has same number of columns as File 0
File 1 has the same column name and order as File 0
File 2 has same number of columns as File 0
File 2 has the same column name and order as File 0
File 3 has same number of columns as File 0
File 3 has the same column name and order as File 0
File 4 has same number of columns as File 0
File 4 has the same column name and order as File 0
File 5 has same number of columns as File 0
File 5 has the same column name and order as File 0
File 6 has same number of columns as File 0
File 6 has the same column name and order as File 0
File 7 has same number of columns as File 0
File 7 has the same column name and order as File 0
File 8 has same number of columns as File 0
File 8 has the same column name and order as File 0
File 9 has same number of columns as File 0
File 9 has the same column name and order as File 0
File 10 has same number of columns as Fi

In [None]:
# Concatenate the dataframes into a single dataframe
df_all = pd.concat(dfs, ignore_index=True)

In [11]:
id_counts = df_all['case_enquiry_id'].value_counts()
id_filter = df_all['reason'].isin(id_counts[id_counts > 1].index)
display(df_all[id_filter])

Unnamed: 0,case_enquiry_id,open_dt,target_dt,closed_dt,ontime,case_status,closure_reason,case_title,subject,reason,...,police_district,neighborhood,neighborhood_services_district,ward,precinct,location_street_name,location_zipcode,latitude,longitude,source


Interestingly, even though we expected the data might have duplicate 'case_enquiry_id' values because of cases that spanned multiple years, it looks like maybe the data from past years was generated from a central database. Is it continually updated? We could answer that question by checking if any cases have an open and closed date in different years.

In [13]:
# Convert the 'open_dt' and 'close_dt' columns to datetime
df_all['open_dt'] = pd.to_datetime(df_all['open_dt'])
df_all['closed_dt'] = pd.to_datetime(df_all['closed_dt'])

# Create a new column to hold the year of each open date
df_all['open_year'] = df_all['open_dt'].dt.year

# Filter the DataFrame to only include cases where the close_dt year is greater than the open_dt year
df_across_years = df_all[(df_all['closed_dt'].dt.year > df_all['open_dt'].dt.year)]


In [16]:
df_across_years.describe()

Unnamed: 0,case_enquiry_id,location_zipcode,latitude,longitude,open_year
count,40245.0,32455.0,40244.0,40244.0,40245.0
mean,101001000000.0,2126.978678,42.328398,-71.082523,2013.281153
std,641496.7,17.592728,0.034287,0.035802,2.010371
min,101000300000.0,2108.0,42.2327,-71.181,2011.0
25%,101000500000.0,2119.0,42.2986,-71.1027,2012.0
50%,101000900000.0,2126.0,42.3364,-71.0734,2013.0
75%,101001200000.0,2130.0,42.3579,-71.0587,2014.0
max,101004600000.0,2467.0,42.3933,-70.9963,2022.0


We indeed have many records that were opened in one year and closed in another. Looking at the individual data sets, we can see that each data set contains cases opened in that year, but the closed date might be in the following year. This is a helpful aspect of our data. That means we are ready to clean our data and train our models.

In [22]:
logistic_X, logistic_y = clean_and_split_for_logistic(df_all)

In [29]:
linear_X, linear_y = clean_and_split_for_linear(df_all)

In [24]:
#Train a logistic regression model

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(logistic_X, logistic_y, test_size=0.2, random_state=42)

# Build model
model_logistic = keras.Sequential([
    keras.layers.Dense(units=1, input_shape=(X_train.shape[1],), activation='sigmoid')
])

# Compile model
model_logistic.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train model
model_logistic.fit(X_train, y_train, epochs=10, batch_size=32)

# Evaluate model
test_loss, test_acc = model_logistic.evaluate(X_test, y_test)

print('Test accuracy:', test_acc)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test accuracy: 0.9402193427085876


Our model has better accuracy than when we only used the 2022 data. Unfortunately, now we have no validation data set. So we can't tell if we just overfitted our model or if it's actually better. We should add a validation set in our next notebook.

For now, let's just add a validation set to our linear regression model, since that will take more time to train, and is a more important model. 

Let's also add code to record the start and end times of our training and tell us how long the training took.

In [34]:
#Train a linear regression model

start_time = datetime.now()
print("Starting Training at {}".format(start_time))

scaler = StandardScaler()
X_scaled = scaler.fit_transform(linear_X) # scale the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, linear_y, test_size=0.2, random_state=42)

# split the data again to create a validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# define the model architecture
model_linear = keras.Sequential([
    keras.layers.Dense(units=1, input_dim=X_train.shape[1])
])

# compile the model
model_linear.compile(optimizer='adam', loss='mean_squared_error')

# train the model
model_linear.fit(X_train, y_train, epochs=50, batch_size=32, verbose=1, validation_data=(X_val, y_val))

end_time = datetime.now()
total_time = (end_time - start_time)
print("Ending Training at {}".format(end_time))
print("Training took {}".format(total_time))

Starting Training at 2023-05-04 18:16:48.813984
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Ending Training at 2023-05-04 19:08:22.215391
Training took 0:51:33.401407


In [36]:
model_linear.save("model_linear.h5")

In [37]:
model_logistic.save("model_logistic.h5")

We save these models so that we can start building a website and API to make these models available to the public