#Boston 311 v10 - Using Our Code as a Python Package

Our immediate concern is our data loading and cleaning functions. They are unstructured and unfocused, and create a few problems for us:

1. Our feature matrix changes depending on our input data set. This is because one-hot-encoding categorical variables creates a different set of feature columns depending on the categories included.

2. The scenarios are hard coded, and hard to remember. It might be better to have an intuitive way to specify our scenarios.

3. Our data loading automatically loads all of the data from all of the years. It might be useful to be able to specify years. It might be even more useful to be able to specify a range of datetimes

4. Our data cleaning and splitting are together, and drop the case_enquiry_id, our unique case identifier, so we can't easily match our prediction with our original data records. 

Problems 1 and 4 in particular leads to sticky challenges. If we want to make predictions with a model on data from the last month, we need to clean and split it with the entire dataset, hoping that no new categories have been added in the last month for any of the categorical value columns we are one-hot-encoding, and then we can't match the results easily back to our original data.

What changes can we make to fix these problems with our pipeline?

From my research, it looks like scikit-learn has a whole set of modules to handle these pipeline problems, and a mature and experienced data scientist could probably solve this problem quickly using those tools. However, in the interests of preserving the work I've done so far and not getting stuck wrestling with high level scikit-learn modules, let's do this the hard way. This will also let us keep an eye on RAM usage since that's a problem for us on the limite resources of Google Colab.

Let's create a class 'Boston311Model' that keeps track of the parameters of our training. It will have properties:

model - our model once trained
feature_columns - a list of our feature columns
feature_dict - a dictionary with the keys being the names of our feature columns and the values being lists of all the possible values
train_date_range - a dict with keys "start" and "end" and datetime values
predict_date_range - a dict with keys "start" and "end" and datetime values
scenario - our scenario data, maybe a list, maybe a dict, depending on how we recode our scenarios
model_type - Our type of model, linear, logistic, etc

The functions will be:
When you create the object, you will specify the feature_columns, the model type, the scenario, the train_date_range, and the predict_date_range

load_data() - this will use the start_date and end_date. It will return a dataframe

enhance_data( data ) - this will enhance the data according to our needs

clean_data() - this will drop any columns not in feature_columns, create the feature_dict, and one-hot encode the training data

split_data( data ) - this takes data that is ready for training and splits it into an id series, a feature matrix, and a label series

train_model( X, y) - this trains the model and returns the model object

clean_data_for_prediction( data ) - this will drop any columns not in feature_columns, and one hot encode the training data for prediction with the model by using the feature_columns and feature_dict to ensure the cleaned data is in the correct format for prediction with this model.

predict() - this will load the data based on the predict_date_range, call clean_data_for_prediction, call split data, use the model to predict the label, then use the id series to join the predictions with the original data, returning a data frame.

##Questions and To-Dos:

Below is our open questions and to-dos consolidated from the last notebook. Moving forward we will probably keep this list at the top of each notebook.

2. Add more features
3. clean up the data by removing outliers
6. look at the currently available android app and see what values are available to the user to select, and which categories might be assigned by the 311 agents after receiving a new case.
7. compare a basic model which only uses the department value as a feature to our more complex models as a heuristic for whether additional features actually improve predictions.
8. Moving forward compare our model predictions with the target date assigned by 311 to see which performs better.

Questions to answer:
1. Can we find some basic commonality between open cases?
2. When and how is the target date set? How about the overdue flag?
3. Do cases autoclose after a certain time?

###Conclusions from Boston311_v9, copied from below:
We have a problem now though, which is that our data cleaning functions drop the case_enquiry_id before returning the data, which is good for training, but it means we can't match up prediction results with the original cases. 

Our immediate next task for prediction should be to refactor the data cleaning and splitting functions to make predicting cases possible with a particular model.

###Conclusions from Boston311_v8:

We got some variable results on these models. Now that we have several scenarios, we might want to come up with ways to compare the performance of these models easily.

We are reaching the limits of the capabilities of Google Colaboratory. When we train our models, we might want to try deleting the test data after splitting it so we can save ram. If we want to save it we can write it to a file and download it before deleting it.

Additionally, we want to delete any intermediary data frames created during training before doing the next training. The best way to do that will be to put our data splitting and training inside functions so when the functions complete the variables go out of scope and the RAM they used is freed. Anything that needs to be kept can be saved to files.


##Install the package from github using pip

In [1]:
! pip install git+https://github.com/mindfulcoder49/Boston_311.git

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/mindfulcoder49/Boston_311.git
  Cloning https://github.com/mindfulcoder49/Boston_311.git to /tmp/pip-req-build-uges679e
  Running command git clone --filter=blob:none --quiet https://github.com/mindfulcoder49/Boston_311.git /tmp/pip-req-build-uges679e
  Resolved https://github.com/mindfulcoder49/Boston_311.git to commit b8dd8ace373b7d38bfee9d506e2077fba6faa2cb
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: boston311
  Building wheel for boston311 (pyproject.toml) ... [?25l[?25hdone
  Created wheel for boston311: filename=boston311-0.0.2-py3-none-any.whl size=6678 sha256=681f373b0c2a8971628335f4c333dc26db4ea2b4c48dd2480d82fecffa574956
  Stored in directory: /tmp/pip-ephem-wheel-cache-x81ghcsz/

##Import the Boston311Model class

In [2]:
from boston311 import Boston311Model

##Train a Model with a specific scenario

Our new Boston311Model class has a more flexible way of specifying scenarios:

The scenarios parameter is a dict. 

Valid keys and values for all algorithms:

- `dropColumnValues`:
    - Value: a dict of column names and lists of values to drop.
    - Example: `{'source':['City Worker App', 'Employee Generated']}`

- `keepColumnValues`:
    - Value: a dict of column names and lists of values to keep, all others being dropped.
    - Example: `{'source':['Constituent Call']}`

- `dropOpen`:
    - Drops all open cases after a certain date.
    - Value: datestring.
    - Example: `'2023-05-13'`

- `survivalTimeMin`:
    - Drops all closed cases where survival time is less than a given number of seconds.
    - Value: int, a number of seconds.
    - Example: `3600`

- `survivalTimeMax`:
    - Drops all closed cases where survival time is more than a given number of seconds.
    - Value: int, a number of seconds.
    - Example: `2678400`

Algorithms to be implemented later:

- `survivalTimeFill`:
    - Fills `survival_time` and `survival_time_hours` as though they were closed on a given date.
    - Value: datestring.
    - Example: `2023-05-14`





In [3]:
my311Model = Boston311Model(train_date_range={'start':'2011-01-01','end':'2013-12-31'},
                            model_type='logistic',
                            feature_columns=['subject', 'reason', 'department', 'source', 'ward_number','type'],
                            scenario={'dropColumnValues': {'source':['City Worker App', 'Employee Generated']},
                                      'dropOpen': '2023-04-13',
                                      'survivalTimeMin':300,
                                      'survivalTimeMax':2678400})

In [4]:
data = my311Model.load_data()

Files with different number of columns from File 0:  []
Files with same number of columns as File 0:  [0, 1, 2]
Files with different column order from File 0:  []
Files with same column order as File 0:  [0, 1, 2]


In [5]:
data = my311Model.enhance_data(data)

In [6]:
data = my311Model.clean_data(data)

In [7]:
data.head()

Unnamed: 0,event,survival_time_hours,subject_Animal Control,subject_Boston Police Department,subject_Boston Water & Sewer Commission,subject_Inspectional Services,subject_Mayor's 24 Hour Hotline,subject_Neighborhood Services,subject_Parks & Recreation Department,subject_Property Management,...,type_Utility Casting Repair,type_Valet Parking Problems,type_Walk-In Service Inquiry,type_Water in Gas - High Priority,type_Watermain Break,type_Work Hours-Loud Noise Complaints,type_Work w/out Permit,type_Working Beyond Hours,type_Yardwaste Asian Longhorned Beetle Affected Area,type_Zoning
1,1,3.903333,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,517.165833,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,3.001944,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,1,101.550833,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10,1,98.781667,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
X, y = my311Model.split_data(data)

In [9]:
my311Model.train_model( X, y )

Starting Training at 2023-05-14 23:11:15.358873
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test accuracy: 0.9488019347190857
Ending Training at 2023-05-14 23:12:41.557495
Training took 0:01:26.198622


In [3]:
my311Model = Boston311Model(train_date_range={'start':'2015-01-01','end':'2023-12-31'},
                            model_type='logistic',
                            feature_columns=['subject', 'reason', 'department', 'source', 'ward_number','type'],
                            scenario={'dropColumnValues': {'source':['City Worker App', 'Employee Generated']},
                                      'dropOpen': '2023-04-13',
                                      'survivalTimeMin':300,
                                      'survivalTimeMax':2678400})

In [4]:
my311Model.run_pipeline()

  df = pd.read_csv(file)


Files with different number of columns from File 0:  []
Files with same number of columns as File 0:  [0, 1, 2, 3, 4, 5, 6, 7, 8]
Files with different column order from File 0:  []
Files with same column order as File 0:  [0, 1, 2, 3, 4, 5, 6, 7, 8]
Starting Training at 2023-05-15 01:51:41.019156
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Test accuracy: 0.9246886968612671
Ending Training at 2023-05-15 01:58:36.058045
Training took 0:06:55.038889


In [15]:
my311Model.feature_dict

{'subject': ["Mayor's 24 Hour Hotline",
  'Public Works Department',
  'Animal Control',
  'Inspectional Services',
  'Transportation - Traffic Division',
  'Property Management',
  'Parks & Recreation Department',
  'Disability Department',
  'Boston Water & Sewer Commission',
  'Boston Police Department',
  'Neighborhood Services',
  'Veterans',
  'Consumer Affairs & Licensing'],
 'reason': ['Notification',
  'Street Lights',
  'Animal Issues',
  'Street Cleaning',
  'Housing',
  'Enforcement & Abandoned Vehicles',
  'Sanitation',
  'Signs & Signals',
  'Graffiti',
  'Employee & General Comments',
  'Highway Maintenance',
  'Recycling',
  'Park Maintenance & Safety',
  'Building',
  'Health',
  'Administrative & General Requests',
  'Environmental Services',
  'Disability',
  'Sidewalk Cover / Manhole',
  'Fire Hydrant',
  'Operations',
  'Catchbasin',
  'Programs',
  'Trees',
  'Weights and Measures',
  'Office of The Parking Clerk',
  'Traffic Management & Engineering',
  'Cemetery

##Problems:

- load_data_from_urls can't handle train start and end dates before 2011 or after 2023




##Predict the Outcome of Open Cases

The scenario we ran dropped all the cases from the last month, so we should predict the cases from the last month. 

We have a problem now though, which is that our data cleaning functions drop the case_enquiry_id before returning the data, which is good for training, but it means we can't match up prediction results with the original cases. 

Our immediate next task for prediction should be to refactor the data cleaning and splitting functions to make predicting cases possible with a particular model.


##Save the model

In [None]:
from google.colab import files

model.save("model_log_1234.h5")
files.download('model_log_1234.h5')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [13]:
import gc
gc.collect()

2333