#Boston 311 v15 - Refactoring single class into multiple classes


The Boston311Model class was becoming unwieldy with a lot of repeated code and adding new model types was becoming complicated and error prone. It made sense to create a new subclass for each model type, which would simplify the base Boston311Model class and allow for easier troubleshooting of individual model types and adding new model types.

##Questions and To-Dos:

Below is our open questions and to-dos consolidated from the last notebook. Moving forward we will probably keep this list at the top of each notebook.

2. Add more features
3. clean up the data by removing outliers
6. look at the currently available android app and see what values are available to the user to select, and which categories might be assigned by the 311 agents after receiving a new case.
7. compare a basic model which only uses the department value as a feature to our more complex models as a heuristic for whether additional features actually improve predictions.
8. Moving forward compare our model predictions with the target date assigned by 311 to see which performs better.

Questions to answer:
1. Can we find some basic commonality between open cases?
2. When and how is the target date set? How about the overdue flag?
3. Do cases autoclose after a certain time?

###Conclusions from Boston311_v9, copied from below:
We have a problem now though, which is that our data cleaning functions drop the case_enquiry_id before returning the data, which is good for training, but it means we can't match up prediction results with the original cases. 

Our immediate next task for prediction should be to refactor the data cleaning and splitting functions to make predicting cases possible with a particular model.

###Conclusions from Boston311_v8:

We got some variable results on these models. Now that we have several scenarios, we might want to come up with ways to compare the performance of these models easily.

We are reaching the limits of the capabilities of Google Colaboratory. When we train our models, we might want to try deleting the test data after splitting it so we can save ram. If we want to save it we can write it to a file and download it before deleting it.

Additionally, we want to delete any intermediary data frames created during training before doing the next training. The best way to do that will be to put our data splitting and training inside functions so when the functions complete the variables go out of scope and the RAM they used is freed. Anything that needs to be kept can be saved to files.

##Conclusions from this notebook:

The URLS for the data sets change, at least the 2023 one does.

##Install the package from github using pip

In [None]:
! pip install lifelines

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting lifelines
  Downloading lifelines-0.27.7-py3-none-any.whl (409 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m409.4/409.4 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
Collecting autograd-gamma>=0.3 (from lifelines)
  Downloading autograd-gamma-0.5.0.tar.gz (4.0 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting formulaic>=0.2.2 (from lifelines)
  Downloading formulaic-0.6.1-py3-none-any.whl (82 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.3/82.3 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
Collecting astor>=0.8 (from formulaic>=0.2.2->lifelines)
  Downloading astor-0.8.1-py2.py3-none-any.whl (27 kB)
Collecting interface-meta>=1.2.0 (from formulaic>=0.2.2->lifelines)
  Downloading interface_meta-1.3.0-py3-none-any.whl (14 kB)
Building wheels for collected packages: autograd-gamma
  Building wheel for autograd-gamma

In [1]:
! pip install git+https://github.com/mindfulcoder49/Boston_311.git

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/mindfulcoder49/Boston_311.git
  Cloning https://github.com/mindfulcoder49/Boston_311.git to /tmp/pip-req-build-yus_6b1i
  Running command git clone --filter=blob:none --quiet https://github.com/mindfulcoder49/Boston_311.git /tmp/pip-req-build-yus_6b1i
  Resolved https://github.com/mindfulcoder49/Boston_311.git to commit bfd109f9de94b30cb8e4c2871f9e16f8196e0fea
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: boston311
  Building wheel for boston311 (pyproject.toml) ... [?25l[?25hdone
  Created wheel for boston311: filename=boston311-0.0.2-py3-none-any.whl size=16799 sha256=24500e7954952bedb9ff629f7c3bcd5f15839a96765f08510bbfb9187f0d9d57
  Stored in directory: /tmp/pip-ephem-wheel-cache-fvjmqrq6

##Import the Boston311Model class

In [2]:
from boston311 import Boston311LogReg, Boston311EventDecTree, Boston311SurvDecTree

##Define several models

In [3]:
linear_tree_model = Boston311SurvDecTree.Boston311SurvDecTree(train_date_range={'start':'2022-01-01','end':'2023-04-22'},
                            predict_date_range={'start':'2023-04-23','end':'2023-05-23'},
                            feature_columns=['type','queue'],
                            scenario={'dropColumnValues': {'source':['City Worker App', 'Employee Generated']},
                                      'survivalTimeMin':0,
                                      'survivalTimeFill':'2023-05-22'})

In [4]:
logistic_model = Boston311LogReg.Boston311LogReg(train_date_range={'start':'2022-01-01','end':'2023-04-22'},
                            predict_date_range={'start':'2023-04-23','end':'2023-05-23'},
                            feature_columns=['type', 'queue'],
                            scenario={'dropColumnValues': {'source':['City Worker App', 'Employee Generated']},
                                      'survivalTimeMin':86400})

In [5]:
logistic_tree_model = Boston311EventDecTree.Boston311EventDecTree(train_date_range={'start':'2022-01-01','end':'2023-04-22'},
                            predict_date_range={'start':'2023-04-23','end':'2023-05-23'},
                            feature_columns=['type', 'queue'],
                            scenario={'dropColumnValues': {'source':['City Worker App', 'Employee Generated']},
                                      'survivalTimeMin':0})

##Train several models

In [6]:
logistic_tree_model.run_pipeline()

Files with different number of columns from File 0:  []
Files with same number of columns as File 0:  [0, 1]
Files with different column order from File 0:  []
Files with same column order as File 0:  [0, 1]
Starting Training at 2023-05-23 19:50:01.912246
Testing accuracy: 0.9478974807230044
Ending Training at 2023-05-23 19:50:40.820457
Training took 0:00:38.908211


In [7]:
import gc
gc.collect()

23

In [8]:
logistic_tree_prediction = logistic_tree_model.predict()

Files with different number of columns from File 0:  []
Files with same number of columns as File 0:  [0]
Files with different column order from File 0:  []
Files with same column order as File 0:  [0]


In [9]:
logistic_tree_prediction['event_prediction'].value_counts()

1    3091
0    1491
Name: event_prediction, dtype: int64

In [10]:
logistic_tree_prediction[logistic_tree_prediction['event_prediction'] == 0].head()

Unnamed: 0,case_enquiry_id,open_dt,target_dt,closed_dt,ontime,case_status,closure_reason,case_title,subject,reason,...,location_street_name,location_zipcode,latitude,longitude,source,survival_time,event,ward_number,survival_time_hours,event_prediction
10049,101004801993,2023-04-27 09:44:00,2023-04-28 09:44:22,NaT,OVERDUE,Open,,Sidewalk Repair (Make Safe),Public Works Department,Highway Maintenance,...,6 Harlow St,2125.0,42.3188,-71.0718,Constituent Call,NaT,0,13,,0
14240,101004819305,2023-05-10 07:58:00,2023-05-11 08:30:00,NaT,OVERDUE,Open,,Request for Pothole Repair,Public Works Department,Highway Maintenance,...,219 Bellevue St,2132.0,42.2782,-71.1499,Citizens Connect App,NaT,0,20,,0
14246,101004819647,2023-05-10 10:13:00,2023-05-15 10:13:15,NaT,OVERDUE,Open,,Contractor Complaints,Public Works Department,Highway Maintenance,...,43 Sycamore St,2131.0,42.283,-71.1266,Constituent Call,NaT,0,19,,0
14866,101004817927,2023-05-09 10:40:00,2023-05-10 10:40:37,NaT,OVERDUE,Open,,Parking Enforcement,Transportation - Traffic Division,Enforcement & Abandoned Vehicles,...,110 Belgrade Ave,2131.0,42.2861,-71.1349,Citizens Connect App,NaT,0,20,,0
15025,101004796294,2023-04-23 09:03:00,2023-04-25 08:30:00,NaT,OVERDUE,Open,,Request for Pothole Repair,Public Works Department,Highway Maintenance,...,1672-1672R Washington St,2118.0,42.3373,-71.0753,Citizens Connect App,NaT,0,8,,0


In [11]:
logistic_tree_model.save('.','logtree','logtreeproperties')

In [None]:
from google.colab import files
files.download('logtree.pkl')
files.download('logtreeproperties.json')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [12]:
logistic_model.run_pipeline()

Files with different number of columns from File 0:  []
Files with same number of columns as File 0:  [0, 1]
Files with different column order from File 0:  []
Files with same column order as File 0:  [0, 1]
Starting Training at 2023-05-23 19:51:21.354773
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test accuracy: 0.8917638659477234
Ending Training at 2023-05-23 19:52:25.927190
Training took 0:01:04.572417


In [13]:
logistic_model.save('.','logreg','logregproperties')

In [None]:
files.download('logreg.h5')
files.download('logregproperties.json')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [14]:
logistic_prediction = logistic_model.predict()

Files with different number of columns from File 0:  []
Files with same number of columns as File 0:  [0]
Files with different column order from File 0:  []
Files with same column order as File 0:  [0]


In [15]:
logistic_prediction['event_prediction'].value_counts()

0.999999    334
0.994016    294
0.079691    241
0.506344    238
0.780878    231
           ... 
0.082213      1
0.258326      1
0.119026      1
0.988522      1
0.994917      1
Name: event_prediction, Length: 449, dtype: int64

In [16]:
logistic_prediction[logistic_prediction['event_prediction'] < .5].shape[0]

1692

In [17]:
linear_tree_model.run_pipeline()

Files with different number of columns from File 0:  []
Files with same number of columns as File 0:  [0, 1]
Files with different column order from File 0:  []
Files with same column order as File 0:  [0, 1]
Starting Training at 2023-05-23 19:53:05.176405
Testing accuracy: 0.7205850216963756
Ending Training at 2023-05-23 19:53:15.953322
Training took 0:00:10.776917


In [18]:
linear_prediction = linear_tree_model.predict()

Files with different number of columns from File 0:  []
Files with same number of columns as File 0:  [0]
Files with different column order from File 0:  []
Files with same column order as File 0:  [0]


In [19]:
linear_prediction.head(20)

Unnamed: 0,case_enquiry_id,open_dt,target_dt,closed_dt,ontime,case_status,closure_reason,case_title,subject,reason,...,location_street_name,location_zipcode,latitude,longitude,source,survival_time,event,ward_number,survival_time_hours,survival_prediction
9423,101004842971,2023-05-19 12:14:00,,NaT,ONTIME,Open,,Mattress Pickup,Public Works Department,Sanitation,...,54 Hamilton St,2136.0,42.234,-71.132,Constituent Call,NaT,0,18,,1-7 days
9583,101004842977,2023-05-19 12:15:00,,NaT,ONTIME,Open,,Mattress Pickup,Public Works Department,Sanitation,...,151 Wrentham St,2124.0,42.2889,-71.0562,Constituent Call,NaT,0,16,,1-7 days
9612,101004842983,2023-05-19 12:18:00,,NaT,ONTIME,Open,,Mattress Pickup,Public Works Department,Sanitation,...,78 Ballou Ave,2124.0,42.2856,-71.0822,Constituent Call,NaT,0,14,,1-7 days
10049,101004801993,2023-04-27 09:44:00,2023-04-28 09:44:22,NaT,OVERDUE,Open,,Sidewalk Repair (Make Safe),Public Works Department,Highway Maintenance,...,6 Harlow St,2125.0,42.3188,-71.0718,Constituent Call,NaT,0,13,,1-12 months
10668,101004844212,2023-05-20 11:19:00,2025-05-09 11:19:40,NaT,ONTIME,Open,,Contractors Complaint,Inspectional Services,Building,...,32 Fayette St,2116.0,42.3486,-71.0677,Constituent Call,NaT,0,5,,1-12 months
11354,101004844217,2023-05-20 11:22:00,,NaT,ONTIME,Open,,Schedule a Bulk Item Pickup,Public Works Department,Sanitation,...,59 Newfield St,2132.0,42.2906,-71.1667,Constituent Call,NaT,0,20,,1-7 days
11355,101004844403,2023-05-20 14:31:00,2023-06-13 08:30:00,NaT,ONTIME,Open,,Request for Recycling Cart,Public Works Department,Recycling,...,23 Cordis St,2129.0,42.3755,-71.0626,Constituent Call,NaT,0,2,,0-24 hours
12628,101004807799,2023-05-01 23:53:27,2024-04-30 23:53:28,NaT,ONTIME,Open,,New Tree Requests,Parks & Recreation Department,Trees,...,15 Morey Rd,2132.0,42.2942,-71.1404,Citizens Connect App,NaT,0,20,,1-12 months
12750,101004815615,2023-05-08 09:16:38,2023-05-15 09:16:39,NaT,OVERDUE,Open,,Equipment Repair,Parks & Recreation Department,Park Maintenance & Safety,...,49R Imrie Rd,2134.0,42.3594,-71.0587,Citizens Connect App,NaT,0,21,,1-7 days
14240,101004819305,2023-05-10 07:58:00,2023-05-11 08:30:00,NaT,OVERDUE,Open,,Request for Pothole Repair,Public Works Department,Highway Maintenance,...,219 Bellevue St,2132.0,42.2782,-71.1499,Citizens Connect App,NaT,0,20,,over a year


In [20]:
linear_prediction.shape[0]

4582

In [21]:
logistic_prediction.shape[0]

4582

In [22]:
logistic_tree_prediction.shape[0]

4582

##Join the tables

In [None]:
merged_df = logistic_tree_prediction.merge(logistic_prediction[['case_enquiry_id','event_prediction']], on='case_enquiry_id', how="outer").merge(linear_prediction[['case_enquiry_id','survival_prediction']], on='case_enquiry_id', how="outer")

In [None]:
merged_df.shape[0]

4583

In [None]:
merged_df.head()

Unnamed: 0,case_enquiry_id,open_dt,target_dt,closed_dt,ontime,case_status,closure_reason,case_title,subject,reason,...,latitude,longitude,source,survival_time,event,ward_number,survival_time_hours,event_prediction_x,event_prediction_y,survival_prediction
0,101004808258,2023-05-02 10:00:15,2024-05-01 10:00:17,NaT,ONTIME,Open,,New Tree Requests,Parks & Recreation Department,Trees,...,42.3594,-71.0587,Citizens Connect App,NaT,0,5,,1,0.435624,1-12 months
1,101004797913,2023-04-24 15:10:00,2023-07-23 15:10:44,NaT,ONTIME,Open,,SCH5/2Cross Metering - Sub-Metering,Inspectional Services,Housing,...,42.365,-71.0569,Constituent Call,NaT,0,3,,1,0.836357,1-7 days
2,101004805089,2023-04-29 13:41:00,2023-05-08 08:30:00,NaT,OVERDUE,Open,,Equipment Repair: Beethoven School Play Area -...,Parks & Recreation Department,Park Maintenance & Safety,...,42.2639,-71.1553,Citizens Connect App,NaT,0,20,,1,0.844345,1-7 days
3,101004805096,2023-04-29 13:49:00,2023-05-02 08:30:00,NaT,OVERDUE,Open,,Sidewalk Repair (Make Safe),Public Works Department,Highway Maintenance,...,42.3589,-71.0706,Citizens Connect App,NaT,0,5,,0,0.072754,1-12 months
4,101004811328,2023-05-04 11:07:00,2023-05-05 11:07:29,NaT,OVERDUE,Open,,Sidewalk Repair (Make Safe),Public Works Department,Highway Maintenance,...,42.342,-71.0702,Constituent Call,NaT,0,3,,0,0.085065,1-12 months


##Save the prediction data

In [None]:
merged_df.to_csv('predictions.csv', index=False)

In [None]:
files.download('predictions.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
import gc
gc.collect()

100

#Send to remote mySQL database

In [None]:
pip install mysql-connector-python sqlalchemy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting mysql-connector-python
  Downloading mysql_connector_python-8.0.33-cp310-cp310-manylinux1_x86_64.whl (27.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.4/27.4 MB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: mysql-connector-python
Successfully installed mysql-connector-python-8.0.33


In [None]:
from sqlalchemy import create_engine
import pandas as pd

# Create an engine that connects to a MySQL database
# Replace 'username', 'password', 'hostname', 'dbname' with your actual credentials
engine = create_engine('mysql+mysqlconnector://username:password@hostname/dbname')

# Write the data from your DataFrame to the 'table_name' table in the database
merged_df.to_sql('predictions', con=engine, if_exists='replace', index=False)

  merged_df.to_sql('predictions', con=engine, if_exists='replace', index=False)


4583