#Boston 311 v5 - Changing Logistic Regression Labels

IN our previous notebook we tried updating our linear regression model to only use data that was closed within a certain amount of time. This led to a model that had a much higher loss value on the validation set than the training set, which is in contrast to our previous model that had a lower loss value for the validation set than the training set. 

Another method we could use with the linear regression model is assume that all open cases might be closed tomorrow, and assign them a time to close based on that.

For the logistic regression model, we have two important factors:
1. We need to make sure we are not training or validating our model on the cases we want to predict, i.e. the recently opened cases that still have a chance of being closed within the near future. 
2. We can try predictin whether a case will be closed within a certain amount of time, rather than just closed at any point. Basically we can try what we did on the linear regression model in the last notebook and do it for our logistic regression model.

Lastly, the last notebook crashed a few times due to RAM requirements. We can reshape our data cleaning routines to conserve memory by using fewer data copying commands this time.

Below is our open questions and to-dos consolidated from the last notebook. Moving forward we will probably keep this list at the top of each notebook.

##Questions and To-Dos from v4:

2. Add more features
3. clean up the data by removing outliers
6. look at the currently available android app and see what values are available to the user to select, and which categories might be assigned by the 311 agents after receiving a new case.
7. compare a basic model which only uses the department value as a feature to our more complex models as a heuristic for whether additional features actually improve predictions.
8. Moving forward compare our model predictions with the target date assigned by 311 to see which performs better.

Questions to answer:
1. Can we find some basic commonality between open cases?
2. When and how is the target date set? How about the overdue flag?
3. Do cases autoclose after a certain time?


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import tensorflow as tf
import glob
from tensorflow import keras
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.preprocessing import StandardScaler
from datetime import datetime

from IPython.display import display

%matplotlib inline

#first of course we must import the necessary modules

Let's add some code to load the records directly from their storage so we don't have to upload them to google colaboratory each time. So far the URLs appear to be constant.

In [None]:
url_2023 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/e6013a93-1321-4f2a-bf91-8d8a02f1e62f/download/tmp9g_820k8.csv"
url_2022 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/81a7b022-f8fc-4da5-80e4-b160058ca207/download/tmph4izx_fb.csv"
url_2021 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/f53ebccd-bc61-49f9-83db-625f209c95f5/download/tmppgq9965_.csv"
url_2020 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/6ff6a6fd-3141-4440-a880-6f60a37fe789/download/script_105774672_20210108153400_combine.csv"
url_2019 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/ea2e4696-4a2d-429c-9807-d02eb92e0222/download/311_service_requests_2019.csv"
url_2018 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/2be28d90-3a90-4af1-a3f6-f28c1e25880a/download/311_service_requests_2018.csv"
url_2017 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/30022137-709d-465e-baae-ca155b51927d/download/311_service_requests_2017.csv"
url_2016 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/b7ea6b1b-3ca4-4c5b-9713-6dc1db52379a/download/311_service_requests_2016.csv"
url_2015 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/c9509ab4-6f6d-4b97-979a-0cf2a10c922b/download/311_service_requests_2015.csv"
url_2014 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/bdae89c8-d4ce-40e9-a6e1-a5203953a2e0/download/311_service_requests_2014.csv"
url_2013 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/407c5cd0-f764-4a41-adf8-054ff535049e/download/311_service_requests_2013.csv"
url_2012 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/382e10d9-1864-40ba-bef6-4eea3c75463c/download/311_service_requests_2012.csv"
url_2011 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/94b499d9-712a-4d2a-b790-7ceec5c9c4b1/download/311_service_requests_2011.csv"


Let's refactor the code to make fewer copies of the dataframe. Let's also add "scenario" functionality so we can keep code cleaning procedures we tried in previous notebooks. 

In [None]:
def clean_and_split_for_logistic(myData, scenario) :

  data = myData.copy()
  # Convert the 'open_dt' and 'close_dt' columns to datetime
  data['open_dt'] = pd.to_datetime(data['open_dt'])
  data['closed_dt'] = pd.to_datetime(data['closed_dt'])
  data['survival_time'] = data['closed_dt'] - data['open_dt']
  data['event'] = data['closed_dt'].notnull().astype(int)
  data['ward_number'] = data['ward'].str.extract(r'0*(\d+)')

  

  cols_to_keep = ['case_enquiry_id', 'survival_time', 'event', 'subject', 'reason', 'department', 'source', 'ward_number']
  cols_to_drop = [
 'open_dt',
 'target_dt',
 'closed_dt',
 'ontime',
 'case_status',
 'closure_reason',
 'case_title',
 'type',
 'queue',
 'submittedphoto',
 'closedphoto',
 'location',
 'fire_district',
 'pwd_district',
 'city_council_district',
 'police_district',
 'neighborhood',
 'neighborhood_services_district',
 'ward',
 'precinct',
 'location_street_name',
 'location_zipcode',
 'latitude',
 'longitude']

  #scenarios
  #scenario 1: drop any open cases from the last month, and switch the event value for any cases that took longer than a month to close
  if (scenario == 1) :
    # Convert the date string to a pandas Timestamp object
    cutoff_date = pd.Timestamp('2023-04-09')

    # Filter the DataFrame to include only rows where event is 1 or open_dt is before the cutoff date
    data = data[(data['event'] == 1) | (data['open_dt'] < cutoff_date)]

    #switch the event value for any cases that took longer than a month to close
    delta = pd.Timedelta(seconds=2678400)
    data.loc[(data['event'] == 1) & (data['survival_time'] > delta), 'event'] = 0


  data = data.drop(cols_to_drop, axis=1)

  data = pd.get_dummies(data, columns=['subject', 'reason', 'department', 'source', 'ward_number'])



  #fix this line to also drop the case_enquiry_id
  X = data.drop(['case_enquiry_id','event', 'survival_time'], axis=1)
  y = data['event']

  return X, y

def clean_and_split_for_linear(myData, scenario) :

  data = myData.copy()
  # Convert the 'open_dt' and 'close_dt' columns to datetime
  data['open_dt'] = pd.to_datetime(data['open_dt'])
  data['closed_dt'] = pd.to_datetime(data['closed_dt'])
  data['survival_time'] = data['closed_dt'] - data['open_dt']
  data['event'] = data['closed_dt'].notnull().astype(int)
  data['ward_number'] = data['ward'].str.extract(r'0*(\d+)')

  cols_to_keep = ['case_enquiry_id', 'survival_time', 'event', 'subject', 'reason', 'department', 'source', 'ward_number']
  cols_to_drop = [
 'open_dt',
 'target_dt',
 'closed_dt',
 'ontime',
 'case_status',
 'closure_reason',
 'case_title',
 'type',
 'queue',
 'submittedphoto',
 'closedphoto',
 'location',
 'fire_district',
 'pwd_district',
 'city_council_district',
 'police_district',
 'neighborhood',
 'neighborhood_services_district',
 'ward',
 'precinct',
 'location_street_name',
 'location_zipcode',
 'latitude',
 'longitude']

  data = data.drop(cols_to_drop, axis=1)

  data = pd.get_dummies(data, columns=['subject', 'reason', 'department', 'source', 'ward_number'])
  data_survival_mask = data["survival_time"].notnull()
  clean_data = data[data_survival_mask].copy()
  clean_data['survival_time_hours'] = clean_data['survival_time'].apply(lambda x: x.total_seconds()/3600)

  #add scenarios

  #scenario 1: remove records if the case took more than a month to close or time to close is negative
  if (scenario == 1) :
    clean_data = clean_data[(clean_data['survival_time_hours'] >= 0) & (clean_data['survival_time_hours'] <= 744)]
  
  #scenario 2: remove records just if the time to close is negative 
  if (scenario == 2) :
    clean_data = clean_data[(clean_data['survival_time_hours'] >= 0)]

  #fix this line to also drop the case_enquiry_id
  X = clean_data.drop(['case_enquiry_id','survival_time_hours', 'survival_time', 'event'], axis=1) 
  y = clean_data['survival_time_hours']
  
  return X, y

Here is the link to all the data sets:

https://data.boston.gov/dataset/311-service-requests

In [None]:

# Get a list of all CSV files in the directory
all_files = [url_2023, url_2022, url_2021, url_2020, url_2019, url_2018, url_2017, url_2016, url_2015, url_2014, url_2013, url_2012, url_2011]

# Create an empty list to store the dataframes
dfs = []

# Loop through the files and load them into dataframes
for file in all_files:
  df = pd.read_csv(file)
  dfs.append(df)

  df = pd.read_csv(file)
  df = pd.read_csv(file)


In [None]:
#check that the files all have the same number of columns, and the same names
for i in range(len(dfs)):
  if dfs[i].shape[1] != dfs[0].shape[1]:
    print('Error: File', i, 'does not have the same number of columns as File 0')
  else:
    print('File', i, 'has same number of columns as File 0')
  if not dfs[i].columns.equals(dfs[0].columns):
    print('Error: File', i, 'does not have the same column names and order as File 0')
  else:
    print('File', i, 'has the same column name and order as File 0')

File 0 has same number of columns as File 0
File 0 has the same column name and order as File 0
File 1 has same number of columns as File 0
File 1 has the same column name and order as File 0
File 2 has same number of columns as File 0
File 2 has the same column name and order as File 0
File 3 has same number of columns as File 0
File 3 has the same column name and order as File 0
File 4 has same number of columns as File 0
File 4 has the same column name and order as File 0
File 5 has same number of columns as File 0
File 5 has the same column name and order as File 0
File 6 has same number of columns as File 0
File 6 has the same column name and order as File 0
File 7 has same number of columns as File 0
File 7 has the same column name and order as File 0
File 8 has same number of columns as File 0
File 8 has the same column name and order as File 0
File 9 has same number of columns as File 0
File 9 has the same column name and order as File 0
File 10 has same number of columns as Fi

In [None]:
# Concatenate the dataframes into a single dataframe
df_all = pd.concat(dfs, ignore_index=True)

In [None]:
#save ram by deleting the dfs variable
del dfs

In [None]:
df_all.columns.tolist()

['case_enquiry_id',
 'open_dt',
 'target_dt',
 'closed_dt',
 'ontime',
 'case_status',
 'closure_reason',
 'case_title',
 'subject',
 'reason',
 'type',
 'queue',
 'department',
 'submittedphoto',
 'closedphoto',
 'location',
 'fire_district',
 'pwd_district',
 'city_council_district',
 'police_district',
 'neighborhood',
 'neighborhood_services_district',
 'ward',
 'precinct',
 'location_street_name',
 'location_zipcode',
 'latitude',
 'longitude',
 'source']

In [None]:
logistic_X, logistic_y = clean_and_split_for_logistic(df_all, 1)

In [None]:
linear_X, linear_y = clean_and_split_for_linear(df_all, 2)

In [36]:
linear_X.describe()



Unnamed: 0,subject_Animal Control,subject_Boston Police Department,subject_Boston Water & Sewer Commission,subject_CRM Application,subject_Consumer Affairs & Licensing,subject_Disability Department,subject_Inspectional Services,subject_Mayor's 24 Hour Hotline,subject_Neighborhood Services,subject_Parks & Recreation Department,...,ward_number_20,ward_number_21,ward_number_22,ward_number_3,ward_number_4,ward_number_5,ward_number_6,ward_number_7,ward_number_8,ward_number_9
count,2325113.0,2325113.0,2325113.0,2325113.0,2325113.0,2325113.0,2325113.0,2325113.0,2325113.0,2325113.0,...,2325113.0,2325113.0,2325113.0,2325113.0,2325113.0,2325113.0,2325113.0,2325113.0,2325113.0,2325113.0
mean,0.005866812,0.0001621427,0.00397185,4.300866e-07,3.440693e-06,0.0009083429,0.07542085,0.03842265,4.859979e-05,0.05733227,...,0.05820921,0.040306,0.04248224,0.08972897,0.04273599,0.08104767,0.05123837,0.04257384,0.0341149,0.03039981
std,0.07637012,0.0127325,0.06289735,0.0006558099,0.001854908,0.03012504,0.2640693,0.1922144,0.006971187,0.2324765,...,0.2341387,0.196676,0.2016867,0.2857931,0.2022613,0.2729084,0.2204836,0.2018944,0.1815243,0.1716848
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [37]:
linear_y.describe()

count    2.325113e+06
mean     3.958667e+02
std      1.789195e+03
min      0.000000e+00
25%      1.416944e+00
50%      1.710167e+01
75%      1.176658e+02
max      6.989429e+04
Name: survival_time_hours, dtype: float64

In [38]:
#Train a logistic regression model

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(logistic_X, logistic_y, test_size=0.2, random_state=42)

# Build model
model_logistic = keras.Sequential([
    keras.layers.Dense(units=1, input_shape=(X_train.shape[1],), activation='sigmoid')
])

# Compile model
model_logistic.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train model
model_logistic.fit(X_train, y_train, epochs=10, batch_size=32)

# Evaluate model
test_loss, test_acc = model_logistic.evaluate(X_test, y_test)

print('Test accuracy:', test_acc)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test accuracy: 0.8653748631477356


Let's delete the df_all variable to save RAM before we train our linear regression model

In [39]:
del df_all

In [40]:
#Train a linear regression model

start_time = datetime.now()
print("Starting Training at {}".format(start_time))

scaler = StandardScaler()
X_scaled = scaler.fit_transform(linear_X) # scale the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, linear_y, test_size=0.2, random_state=42)

# split the data again to create a validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# define the model architecture
model_linear = keras.Sequential([
    keras.layers.Dense(units=1, input_dim=X_train.shape[1])
])

# compile the model
model_linear.compile(optimizer='adam', loss='mean_squared_error')

# train the model
model_linear.fit(X_train, y_train, epochs=50, batch_size=32, verbose=1, validation_data=(X_val, y_val))

end_time = datetime.now()
total_time = (end_time - start_time)
print("Ending Training at {}".format(end_time))
print("Training took {}".format(total_time))

Starting Training at 2023-05-09 16:59:26.027579
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Ending Training at 2023-05-09 18:18:17.358963
Training took 1:18:51.331384


In [41]:
model_linear.save("model_linear.h5")

In [42]:
model_logistic.save("model_logistic.h5")

We save these models so that we can start building a website and API to make these models available to the public