#Boston 311 v8 - Creating More Data Cleaning Scenarios and Models

In the last notebook we created unit tests for our data cleaning functions. We also did a small amount of data analysis that led us to some ideas for more data cleaning scenarios:

There are 212 unique results for the type column, which is a lot, but not that many for our 2 million cases. This might be a defining and predictive categorical variable worth including as a feature.

All but one case with a negative survival time was generated by a city worker or an employee. From some ad hoc case inspection, it looks like the public works department uses this database to track filling in potholes even if they haven't been reported by a constituent. That information would probably not be relevant to a model that's supposed to predict how long it takes to address constituent concerns. 

Lastly, it might be worth considering dropping cases that are closed too soon after being opened, since our model might be most useful for predicting how long it takes to close a case that doesn't have a quick resolution.

These considerations can lead to new scenarios for model training to compare to previous ones.

Additionally, it might be nice to refactor the scenario functionality to accept a list of integers and separate the data cleaning options so that they can be combined in different ways more easily.

Below is our open questions and to-dos consolidated from the last notebook. Moving forward we will probably keep this list at the top of each notebook.

##Questions and To-Dos:

2. Add more features
3. clean up the data by removing outliers
6. look at the currently available android app and see what values are available to the user to select, and which categories might be assigned by the 311 agents after receiving a new case.
7. compare a basic model which only uses the department value as a feature to our more complex models as a heuristic for whether additional features actually improve predictions.
8. Moving forward compare our model predictions with the target date assigned by 311 to see which performs better.

Questions to answer:
1. Can we find some basic commonality between open cases?
2. When and how is the target date set? How about the overdue flag?
3. Do cases autoclose after a certain time?

##Conclusions from this notebook, copied from the end:

We got some variable results on these models. Now that we have several scenarios, we might want to come up with ways to compare the performance of these models easily.

We are reaching the limits of the capabilities of Google Colaboratory. When we train our models, we might want to try deleting the test data after splitting it so we can save ram. If we want to save it we can write it to a file and download it before deleting it.

Additionally, we want to delete any intermediary data frames created during training before doing the next training. The best way to do that will be to put our data splitting and training inside functions so when the functions complete the variables go out of scope and the RAM they used is freed. Anything that needs to be kept can be saved to files.


##Import libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import tensorflow as tf
import glob
import pprint
from google.colab import files
from tensorflow import keras
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.preprocessing import StandardScaler
from datetime import datetime

from IPython.display import display

%matplotlib inline

#first of course we must import the necessary modules

##Load all data source URLs into variables
Let's add some code to load the records directly from their storage so we don't have to upload them to google colaboratory each time. So far the URLs appear to be constant.

In [None]:
url_2023 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/e6013a93-1321-4f2a-bf91-8d8a02f1e62f/download/tmp9g_820k8.csv"
url_2022 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/81a7b022-f8fc-4da5-80e4-b160058ca207/download/tmph4izx_fb.csv"
url_2021 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/f53ebccd-bc61-49f9-83db-625f209c95f5/download/tmppgq9965_.csv"
url_2020 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/6ff6a6fd-3141-4440-a880-6f60a37fe789/download/script_105774672_20210108153400_combine.csv"
url_2019 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/ea2e4696-4a2d-429c-9807-d02eb92e0222/download/311_service_requests_2019.csv"
url_2018 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/2be28d90-3a90-4af1-a3f6-f28c1e25880a/download/311_service_requests_2018.csv"
url_2017 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/30022137-709d-465e-baae-ca155b51927d/download/311_service_requests_2017.csv"
url_2016 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/b7ea6b1b-3ca4-4c5b-9713-6dc1db52379a/download/311_service_requests_2016.csv"
url_2015 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/c9509ab4-6f6d-4b97-979a-0cf2a10c922b/download/311_service_requests_2015.csv"
url_2014 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/bdae89c8-d4ce-40e9-a6e1-a5203953a2e0/download/311_service_requests_2014.csv"
url_2013 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/407c5cd0-f764-4a41-adf8-054ff535049e/download/311_service_requests_2013.csv"
url_2012 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/382e10d9-1864-40ba-bef6-4eea3c75463c/download/311_service_requests_2012.csv"
url_2011 = "https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/94b499d9-712a-4d2a-b790-7ceec5c9c4b1/download/311_service_requests_2011.csv"


##Data Cleaning Functions

Let's update the scenario functionality to take a list of integers that specify which scenarios to run, and add scenarios for the data cleaning we described above. We will need to update our unit tests accordingly.

In [None]:
def clean_and_split_for_logistic(myData, scenario) :

  data = myData.copy()
  # Convert the 'open_dt' and 'close_dt' columns to datetime
  data['open_dt'] = pd.to_datetime(data['open_dt'])
  data['closed_dt'] = pd.to_datetime(data['closed_dt'])
  data['survival_time'] = data['closed_dt'] - data['open_dt']
  data['event'] = data['closed_dt'].notnull().astype(int)
  data['ward_number'] = data['ward'].str.extract(r'0*(\d+)')

  #this is a comment

  cols_to_keep = ['case_enquiry_id', 'survival_time', 'event', 'subject', 'reason', 'department', 'source', 'ward_number']
  cols_to_drop = [
 'open_dt',
 'target_dt',
 'closed_dt',
 'ontime',
 'case_status',
 'closure_reason',
 'case_title',
 'type',
 'queue',
 'submittedphoto',
 'closedphoto',
 'location',
 'fire_district',
 'pwd_district',
 'city_council_district',
 'police_district',
 'neighborhood',
 'neighborhood_services_district',
 'ward',
 'precinct',
 'location_street_name',
 'location_zipcode',
 'latitude',
 'longitude']

  #scenarios
  #scenario 0: no outlier adjustments
  
  #scenario 1: drop any open cases from the last month
  if 1 in scenario :
    # Convert the date string to a pandas Timestamp object
    cutoff_date = pd.Timestamp('2023-04-09')

    # Filter the DataFrame to include only rows where event is 1 or open_dt is before the cutoff date
    data = data[(data['event'] == 1) | (data['open_dt'] < cutoff_date)]

    #switch the event value for any cases that took longer than a month to close

  #scenario 2: switch the event value for any cases that took longer than a month to close
  if 2 in scenario :
    delta = pd.Timedelta(seconds=2678400)
    data.loc[(data['event'] == 1) & (data['survival_time'] > delta), 'event'] = 0

  #scenario 3: Remove all records where source is "Employee Generated" or "City Worker App"
  if 3 in scenario :
    data = data[~data['source'].isin(["Employee Generated", "City Worker App"])]

  #scenario 4: Remove all records where survival time is less than an hour
  if 4 in scenario :
    delta = pd.Timedelta(seconds=3600)
    data = data[(data['event'] == 0) | (data['survival_time'] < delta)]


  dummy_list = ['subject', 'reason', 'department', 'source', 'ward_number']

  #scenario 5: Add type as a one hot encoded categorical variable
  if 5 in scenario :
    dummy_list.append('type')
    cols_to_drop.remove('type')



  data = data.drop(cols_to_drop, axis=1)

  data = pd.get_dummies(data, columns=dummy_list)



  #fix this line to also drop the case_enquiry_id
  X = data.drop(['case_enquiry_id','event', 'survival_time'], axis=1)
  y = data['event']

  return X, y

def clean_and_split_for_linear(myData, scenario) :

  data = myData.copy()
  # Convert the 'open_dt' and 'close_dt' columns to datetime
  data['open_dt'] = pd.to_datetime(data['open_dt'])
  data['closed_dt'] = pd.to_datetime(data['closed_dt'])
  data['survival_time'] = data['closed_dt'] - data['open_dt']
  data['event'] = data['closed_dt'].notnull().astype(int)
  data['ward_number'] = data['ward'].str.extract(r'0*(\d+)')

  cols_to_keep = ['case_enquiry_id', 'survival_time', 'event', 'subject', 'reason', 'department', 'source', 'ward_number']
  cols_to_drop = [
 'open_dt',
 'target_dt',
 'closed_dt',
 'ontime',
 'case_status',
 'closure_reason',
 'case_title',
 'type',
 'queue',
 'submittedphoto',
 'closedphoto',
 'location',
 'fire_district',
 'pwd_district',
 'city_council_district',
 'police_district',
 'neighborhood',
 'neighborhood_services_district',
 'ward',
 'precinct',
 'location_street_name',
 'location_zipcode',
 'latitude',
 'longitude']

  #scenario 3: Remove all records where source is "Employee Generated" or "City Worker App"
  if 3 in scenario :
    data = data[~data['source'].isin(["Employee Generated", "City Worker App"])]

  dummy_list = ['subject', 'reason', 'department', 'source', 'ward_number']
  
  #scenario 5: Add type as a one hot encoded categorical variable
  if 5 in scenario :
    dummy_list.append('type')
    cols_to_drop.remove('type')



  data = data.drop(cols_to_drop, axis=1)

  data = pd.get_dummies(data, columns=dummy_list)

  data_survival_mask = data["survival_time"].notnull()
  clean_data = data[data_survival_mask].copy()
  clean_data['survival_time_hours'] = clean_data['survival_time'].apply(lambda x: x.total_seconds()/3600)

  #add scenarios
  #scenario 0: no outlier adjustments

  #scenario 1: remove records if the case took more than a month to close
  if 1 in scenario :
    clean_data = clean_data[(clean_data['survival_time_hours'] <= 744)]
  
  #scenario 2: remove records just if the time to close is negative 
  if 2 in scenario :
    clean_data = clean_data[(clean_data['survival_time_hours'] >= 0)]



  #scenario 4: Remove all records where survival time is less than an hour
  if 4 in scenario :
    clean_data = clean_data[(clean_data['survival_time_hours'] >= 1)]

  #scenario 5: Add type as a one hot encoded categorical variable

  #fix this line to also drop the case_enquiry_id
  X = clean_data.drop(['case_enquiry_id','survival_time_hours', 'survival_time', 'event'], axis=1) 
  y = clean_data['survival_time_hours']
  
  return X, y

##Unit Testing

Let's add some unit tests. The easiest way to create unit tests here is to pull some records from our data, and modify a few so our data covers all of the scenarios in our data cleaning functions. Let's outline the scenarios here:

In the last notebook our scenario parameter was an integer, and one scenario might add a couple different options. In this notebook we refactored the scenario parameter to be a list of integers, and we separated unrelated data cleaning options for better control. We updated the test function below accordingly. 

Basically, for Logistic cleaning, we separated scenario 1 into scenarios 1 and 2, and for linear cleaning, we simplified scenario 1 so the former scenario 1 is now a combination of scenarios 1 and 2. Therefore the output obtained from scenario 1 for both is now obtained from running scenario [1, 2] for each one.

##Logistic cleaning:
###Scenario 0: 
No outlier removal. y output is a series of 0 or 1 corresponding to whether a case is Open or Closed, with 0 marking Open and 1 marking closed. X dataframe should contain only dummied columns for the 'subject', 'reason', 'department', 'source', and 'ward_number' columns.
###Scenario 1: 
drop any open cases from the last month

###Scenario 2:
switch the event value for any cases that took longer than a month to close.

##Linear cleaning:
###Scenario 0: 
No outlier removal. y output is a series of floats corresponding to the number of hours between case open date and case close date. All open cases are dropped.
###Scenario 1:
remove records if the case took more than a month to close
###scenario 2: 
remove records only if the time to close is negative

##How to add a new scenario and create a new test
When we add scenarios, we will want to check it didn't break other scenarios add a new test. To do that, we will:

1. Add the scenario code to the data cleaning function.
2. Run the current unit test function.
3. Run the new scenario on the test sample data.
4. Inspect the output visually to see if the scenario is working. 
5. Print and copy the verified output into our testing function.
6. Add the scenario call and assertions to the testing function.

##Pull data and print it
We have to pull data, modify it to encompass all scenarios, and then inspect the current output for errors. Once we establish the functions currently work as expected, we can save hardcoded copies of the data to create unit tests to run after future changes to the functions.

Some of the code has been removed to simplify adding new unit tests now that we have hardcoded test data. To see how we originally built the unit tests, see Boston311_v7.

In [None]:
test_data_2022 = pd.DataFrame({'case_enquiry_id': [101004125189,
                     101004161747,
                     101004149944,
                     101004113302,
                     101004122704,
                     101004122479,
                     101004113310,
                     101004113311,
                     101004113328,
                     101004113550],
 'case_status': ['Open',
                 'Closed',
                 'Open',
                 'Closed',
                 'Open',
                 'Open',
                 'Closed',
                 'Closed',
                 'Open',
                 'Closed'],
 'case_title': ['Illegal Rooming House',
                'PublicWorks: Complaint',
                'Space Savers',
                'Parking Enforcement',
                'DISPATCHED Heat - Excessive  Insufficient',
                'Generic Noise Disturbance',
                'Parking Enforcement',
                'General Lighting Request',
                'Loud Parties/Music/People',
                'Requests for Street Cleaning'],
 'city_council_district': ['4', ' ', '4', '2', '7', '8', '3', '6', '1', '2'],
 'closed_dt': [np.nan,
               '2021-02-02 11:45:47',
               np.nan,
               '2022-01-03 00:13:17',
               np.nan,
               np.nan,
               '2022-01-03 00:13:02',
               '2022-04-02 13:01:14',
               np.nan,
               '2022-05-03 05:59:20'],
 'closedphoto': [np.nan,
                 np.nan,
                 np.nan,
                 np.nan,
                 np.nan,
                 np.nan,
                 np.nan,
                 np.nan,
                 np.nan,
                 'https://spot-boston-res.cloudinary.com/image/upload/v1641207557/boston/production/o0vkrv9zckukp8httr7g.jpg'],
 'closure_reason': [' ',
                    'Case Closed Case Noted    ',
                    ' ',
                    'Case Closed. Closed date : 2022-01-03 00:13:17.393 Case '
                    'Resolved CLEAR ',
                    ' ',
                    ' ',
                    'Case Closed. Closed date : 2022-01-03 00:13:02.72 Case '
                    'Resolved CLEAR ',
                    'Case Closed. Closed date : Sat Apr 02 13:01:14 EDT 2022 '
                    'Noted ',
                    ' ',
                    'Case Closed. Closed date : Mon Jan 03 05:59:20 EST 2022 '
                    'Noted 3 bags of trash collected at intersection of '
                    'Dartmouth and Warren at 5:56 a.m. on Monday 1/3/22. We '
                    'will return on next scheduled trash day. '],
 'department': ['ISD',
                'PWDx',
                'PWDx',
                'BTDT',
                'ISD',
                'INFO',
                'BTDT',
                'PWDx',
                'INFO',
                'PWDx'],
 'fire_district': ['8', ' ', '8', '6', '9', '3', '8', '9', '3', '4'],
 'latitude': [42.2896,
              42.3594,
              42.2876,
              42.3594,
              42.311,
              42.3657,
              42.291,
              42.3594,
              42.3669,
              42.3594],
 'location': ['27 Lithgow St  Dorchester  MA  02124',
              ' ',
              '492 Harvard St  Dorchester  MA  02124',
              'INTERSECTION of Seaport Blvd & Sleeper St  Boston  MA  ',
              '15 Crawford St  Dorchester  MA  02121',
              '50-150 Causeway St  Boston  MA  02114',
              '16 Frost Ave  Dorchester  MA  02122',
              'INTERSECTION of Boylston St & Moraine St  Jamaica Plain  MA  ',
              '194 Salem St  Boston  MA  02113',
              'INTERSECTION of Warren Ave & Dartmouth St  Boston  MA  '],
 'location_street_name': ['27 Lithgow St',
                          np.nan,
                          '492 Harvard St',
                          'INTERSECTION Seaport Blvd & Sleeper St',
                          '15 Crawford St',
                          '50-150 Causeway St',
                          '16 Frost Ave',
                          'INTERSECTION Boylston St & Moraine St',
                          '194 Salem St',
                          'INTERSECTION Warren Ave & Dartmouth St'],
 'location_zipcode': [2124.0,
                      np.nan,
                      2124.0,
                      np.nan,
                      2121.0,
                      2114.0,
                      2122.0,
                      np.nan,
                      2113.0,
                      np.nan],
 'longitude': [-71.0701,
               -71.0587,
               -71.0936,
               -71.0587,
               -71.0841,
               -71.0617,
               -71.0503,
               -71.0587,
               -71.0546,
               -71.0587],
 'neighborhood': ['Dorchester',
                  ' ',
                  'Greater Mattapan',
                  'South Boston / South Boston Waterfront',
                  'Roxbury',
                  'Boston',
                  'Dorchester',
                  'Jamaica Plain',
                  'Downtown / Financial District',
                  'South End'],
 'neighborhood_services_district': ['8',
                                    ' ',
                                    '9',
                                    '5',
                                    '13',
                                    '3',
                                    '7',
                                    '11',
                                    '3',
                                    '6'],
 'ontime': ['OVERDUE',
            'ONTIME',
            'ONTIME',
            'ONTIME',
            'OVERDUE',
            'ONTIME',
            'ONTIME',
            'OVERDUE',
            'ONTIME',
            'ONTIME'],
 'open_dt': ['2023-05-09 12:59:00',
             '2022-02-02 11:42:49',
             '2022-01-28 19:36:00',
             '2022-01-01 00:36:24',
             '2022-01-11 09:47:00',
             '2022-01-10 21:49:00',
             '2022-01-01 01:13:52',
             '2022-01-01 01:14:39',
             '2022-01-01 03:08:00',
             '2022-01-01 13:51:00'],
 'police_district': ['C11',
                     ' ',
                     'B3',
                     'C6',
                     'B2',
                     'A1',
                     'C11',
                     'E13',
                     'A1',
                     'D4'],
 'precinct': ['1706',
              ' ',
              '1411',
              '0601',
              '1202',
              ' ',
              '1607',
              '1903',
              '0302',
              '0401'],
 'pwd_district': ['07', ' ', '07', '05', '10B', '1B', '07', '02', '1B', '1C'],
 'queue': ['ISD_Housing (INTERNAL)',
           'PWDx_General Comments',
           'PWDx_Space Saver Removal',
           'BTDT_Parking Enforcement',
           'ISD_Housing (INTERNAL)',
           'INFO01_GenericeFormforOtherServiceRequestTypes',
           'BTDT_Parking Enforcement',
           'PWDx_Street Light_General Lighting Request',
           'INFO01_GenericeFormforOtherServiceRequestTypes',
           'PWDx_Missed Trash\\Recycling\\Yard Waste\\Bulk Item'],
 'reason': ['Building',
            'Employee & General Comments',
            'Sanitation',
            'Enforcement & Abandoned Vehicles',
            'Housing',
            'Generic Noise Disturbance',
            'Enforcement & Abandoned Vehicles',
            'Street Lights',
            'Noise Disturbance',
            'Street Cleaning'],
 'source': ['Constituent Call',
            'Constituent Call',
            'Constituent Call',
            'Citizens Connect App',
            'Constituent Call',
            'Constituent Call',
            'Citizens Connect App',
            'City Worker App',
            'Constituent Call',
            'Citizens Connect App'],
 'subject': ['Inspectional Services',
             "Mayor's 24 Hour Hotline",
             'Public Works Department',
             'Transportation - Traffic Division',
             'Inspectional Services',
             "Mayor's 24 Hour Hotline",
             'Transportation - Traffic Division',
             'Public Works Department',
             'Boston Police Department',
             'Public Works Department'],
 'submittedphoto': [np.nan,
                    np.nan,
                    np.nan,
                    'https://311.boston.gov/media/boston/report/photos/61cfe84b05bbcf180c293ece/photo_20220101_003547.jpg',
                    np.nan,
                    np.nan,
                    'https://311.boston.gov/media/boston/report/photos/61cff11805bbcf180c2944b1/report.jpg',
                    np.nan,
                    np.nan,
                    'https://311.boston.gov/media/boston/report/photos/61d0a2af05bbcf180c2993e3/report.jpg'],
 'target_dt': ['2022-01-20 12:59:39',
               '2022-02-16 11:42:49',
               np.nan,
               '2022-01-04 08:30:00',
               '2022-02-10 09:47:22',
               np.nan,
               '2022-01-04 08:30:00',
               '2022-02-15 01:14:45',
               np.nan,
               '2022-01-04 08:30:00'],
 'type': ['Illegal Rooming House',
          'General Comments For a Program or Policy',
          'Space Savers',
          'Parking Enforcement',
          'Heat - Excessive  Insufficient',
          'Undefined Noise Disturbance',
          'Parking Enforcement',
          'General Lighting Request',
          'Loud Parties/Music/People',
          'Requests for Street Cleaning'],
 'ward': ['Ward 17',
          ' ',
          'Ward 14',
          '6',
          'Ward 12',
          '03',
          'Ward 16',
          '19',
          'Ward 3',
          '4']})

##Run the data cleaning on our test data

Now we can run the test data through data cleaning and look at the outputs, checking visually if it is correct. It would be nice to write logic to check the data, but then we'd need tests for the tests, and there's a never ending issue there.

In [None]:
logistic_test_X_0, logistic_test_y_0 = clean_and_split_for_logistic(test_data_2022, 0)


linear_test_X_0, linear_test_y_0 = clean_and_split_for_linear(test_data_2022, 0)


logistic_test_Xy_0 = logistic_test_X_0.copy()

linear_test_Xy_0 = linear_test_X_0.copy()


logistic_test_Xy_0['event'] = logistic_test_y_0


linear_test_Xy_0['survival_time_hours'] = linear_test_y_0


In [None]:
test_data_2022.head(10)

Unnamed: 0,case_enquiry_id,case_status,case_title,city_council_district,closed_dt,closedphoto,closure_reason,department,fire_district,latitude,...,precinct,pwd_district,queue,reason,source,subject,submittedphoto,target_dt,type,ward
0,101004125189,Open,Illegal Rooming House,4.0,,,,ISD,8.0,42.2896,...,1706.0,07,ISD_Housing (INTERNAL),Building,Constituent Call,Inspectional Services,,2022-01-20 12:59:39,Illegal Rooming House,Ward 17
1,101004161747,Closed,PublicWorks: Complaint,,2021-02-02 11:45:47,,Case Closed Case Noted,PWDx,,42.3594,...,,,PWDx_General Comments,Employee & General Comments,Constituent Call,Mayor's 24 Hour Hotline,,2022-02-16 11:42:49,General Comments For a Program or Policy,
2,101004149944,Open,Space Savers,4.0,,,,PWDx,8.0,42.2876,...,1411.0,07,PWDx_Space Saver Removal,Sanitation,Constituent Call,Public Works Department,,,Space Savers,Ward 14
3,101004113302,Closed,Parking Enforcement,2.0,2022-01-03 00:13:17,,Case Closed. Closed date : 2022-01-03 00:13:17...,BTDT,6.0,42.3594,...,601.0,05,BTDT_Parking Enforcement,Enforcement & Abandoned Vehicles,Citizens Connect App,Transportation - Traffic Division,https://311.boston.gov/media/boston/report/pho...,2022-01-04 08:30:00,Parking Enforcement,6
4,101004122704,Open,DISPATCHED Heat - Excessive Insufficient,7.0,,,,ISD,9.0,42.311,...,1202.0,10B,ISD_Housing (INTERNAL),Housing,Constituent Call,Inspectional Services,,2022-02-10 09:47:22,Heat - Excessive Insufficient,Ward 12
5,101004122479,Open,Generic Noise Disturbance,8.0,,,,INFO,3.0,42.3657,...,,1B,INFO01_GenericeFormforOtherServiceRequestTypes,Generic Noise Disturbance,Constituent Call,Mayor's 24 Hour Hotline,,,Undefined Noise Disturbance,03
6,101004113310,Closed,Parking Enforcement,3.0,2022-01-03 00:13:02,,Case Closed. Closed date : 2022-01-03 00:13:02...,BTDT,8.0,42.291,...,1607.0,07,BTDT_Parking Enforcement,Enforcement & Abandoned Vehicles,Citizens Connect App,Transportation - Traffic Division,https://311.boston.gov/media/boston/report/pho...,2022-01-04 08:30:00,Parking Enforcement,Ward 16
7,101004113311,Closed,General Lighting Request,6.0,2022-04-02 13:01:14,,Case Closed. Closed date : Sat Apr 02 13:01:14...,PWDx,9.0,42.3594,...,1903.0,02,PWDx_Street Light_General Lighting Request,Street Lights,City Worker App,Public Works Department,,2022-02-15 01:14:45,General Lighting Request,19
8,101004113328,Open,Loud Parties/Music/People,1.0,,,,INFO,3.0,42.3669,...,302.0,1B,INFO01_GenericeFormforOtherServiceRequestTypes,Noise Disturbance,Constituent Call,Boston Police Department,,,Loud Parties/Music/People,Ward 3
9,101004113550,Closed,Requests for Street Cleaning,2.0,2022-05-03 05:59:20,https://spot-boston-res.cloudinary.com/image/u...,Case Closed. Closed date : Mon Jan 03 05:59:20...,PWDx,4.0,42.3594,...,401.0,1C,PWDx_Missed Trash\Recycling\Yard Waste\Bulk Item,Street Cleaning,Citizens Connect App,Public Works Department,https://311.boston.gov/media/boston/report/pho...,2022-01-04 08:30:00,Requests for Street Cleaning,4


In [None]:
logistic_test_Xy_0.head(10)

In [None]:
linear_test_Xy_1.head(10)

##Print out cleaned test data to hardcode into unit tests

Now that we've inspected the cleaned data for all scenarios to ensure the output is what we expected, we can print it out for hardcoding.

In [None]:
print("tlogistic_test_X_0 = pd.DataFrame(", end='')
pprint.pprint(logistic_test_X_0.to_dict(orient='list'))
print(')')

print("tlogistic_test_X_1 = pd.DataFrame(", end='')
pprint.pprint(logistic_test_X_1.to_dict(orient='list'))
print(')')

print("tlinear_test_X_0 = pd.DataFrame(", end='')
pprint.pprint(linear_test_X_0.to_dict(orient='list'))
print(')')

print("tlinear_test_X_1 = pd.DataFrame(", end='')
pprint.pprint(linear_test_X_1.to_dict(orient='list'))
print(')')

print("tlinear_test_X_2 = pd.DataFrame(", end='')
pprint.pprint(linear_test_X_2.to_dict(orient='list'))
print(')')

print("tlogistic_test_y_0 = pd.Series(", end='')
pprint.pprint(logistic_test_y_0.to_dict())
print(')')

print("tlogistic_test_y_1 = pd.Series(", end='')
pprint.pprint(logistic_test_y_1.to_dict())
print(')')

print("tlinear_test_y_0 = pd.Series(", end='')
pprint.pprint(linear_test_y_0.to_dict())
print(')')

print("tlinear_test_y_1 = pd.Series(", end='')
pprint.pprint(linear_test_y_1.to_dict())
print(')')

print("tlinear_test_y_2 = pd.Series(", end='')
pprint.pprint(linear_test_y_2.to_dict())
print(')')


tlogistic_test_X_0 = pd.DataFrame({'department_BTDT': [0, 0, 0, 1, 0, 0, 1, 0, 0, 0],
 'department_INFO': [0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
 'department_ISD': [1, 0, 0, 0, 1, 0, 0, 0, 0, 0],
 'department_PWDx': [0, 1, 1, 0, 0, 0, 0, 1, 0, 1],
 'reason_Building': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'reason_Employee & General Comments': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
 'reason_Enforcement & Abandoned Vehicles': [0, 0, 0, 1, 0, 0, 1, 0, 0, 0],
 'reason_Generic Noise Disturbance': [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
 'reason_Housing': [0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
 'reason_Noise Disturbance': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
 'reason_Sanitation': [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
 'reason_Street Cleaning': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
 'reason_Street Lights': [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
 'source_Citizens Connect App': [0, 0, 0, 1, 0, 0, 1, 0, 0, 1],
 'source_City Worker App': [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
 'source_Constituent Call': [1, 1, 1, 0, 1, 1, 0, 0, 1, 0],
 'subject_Boston Police De

##Here's our self contained unit test function:


In [None]:
from pandas.testing import assert_frame_equal, assert_series_equal

def test_data_clean_functions() :
  #set up the test data
  test_data_2022 = pd.DataFrame({'case_enquiry_id': [101004125189,
                     101004161747,
                     101004149944,
                     101004113302,
                     101004122704,
                     101004122479,
                     101004113310,
                     101004113311,
                     101004113328,
                     101004113550],
 'case_status': ['Open',
                 'Closed',
                 'Open',
                 'Closed',
                 'Open',
                 'Open',
                 'Closed',
                 'Closed',
                 'Open',
                 'Closed'],
 'case_title': ['Illegal Rooming House',
                'PublicWorks: Complaint',
                'Space Savers',
                'Parking Enforcement',
                'DISPATCHED Heat - Excessive  Insufficient',
                'Generic Noise Disturbance',
                'Parking Enforcement',
                'General Lighting Request',
                'Loud Parties/Music/People',
                'Requests for Street Cleaning'],
 'city_council_district': ['4', ' ', '4', '2', '7', '8', '3', '6', '1', '2'],
 'closed_dt': [np.nan,
               '2021-02-02 11:45:47',
               np.nan,
               '2022-01-03 00:13:17',
               np.nan,
               np.nan,
               '2022-01-03 00:13:02',
               '2022-04-02 13:01:14',
               np.nan,
               '2022-05-03 05:59:20'],
 'closedphoto': [np.nan,
                 np.nan,
                 np.nan,
                 np.nan,
                 np.nan,
                 np.nan,
                 np.nan,
                 np.nan,
                 np.nan,
                 'https://spot-boston-res.cloudinary.com/image/upload/v1641207557/boston/production/o0vkrv9zckukp8httr7g.jpg'],
 'closure_reason': [' ',
                    'Case Closed Case Noted    ',
                    ' ',
                    'Case Closed. Closed date : 2022-01-03 00:13:17.393 Case '
                    'Resolved CLEAR ',
                    ' ',
                    ' ',
                    'Case Closed. Closed date : 2022-01-03 00:13:02.72 Case '
                    'Resolved CLEAR ',
                    'Case Closed. Closed date : Sat Apr 02 13:01:14 EDT 2022 '
                    'Noted ',
                    ' ',
                    'Case Closed. Closed date : Mon Jan 03 05:59:20 EST 2022 '
                    'Noted 3 bags of trash collected at intersection of '
                    'Dartmouth and Warren at 5:56 a.m. on Monday 1/3/22. We '
                    'will return on next scheduled trash day. '],
 'department': ['ISD',
                'PWDx',
                'PWDx',
                'BTDT',
                'ISD',
                'INFO',
                'BTDT',
                'PWDx',
                'INFO',
                'PWDx'],
 'fire_district': ['8', ' ', '8', '6', '9', '3', '8', '9', '3', '4'],
 'latitude': [42.2896,
              42.3594,
              42.2876,
              42.3594,
              42.311,
              42.3657,
              42.291,
              42.3594,
              42.3669,
              42.3594],
 'location': ['27 Lithgow St  Dorchester  MA  02124',
              ' ',
              '492 Harvard St  Dorchester  MA  02124',
              'INTERSECTION of Seaport Blvd & Sleeper St  Boston  MA  ',
              '15 Crawford St  Dorchester  MA  02121',
              '50-150 Causeway St  Boston  MA  02114',
              '16 Frost Ave  Dorchester  MA  02122',
              'INTERSECTION of Boylston St & Moraine St  Jamaica Plain  MA  ',
              '194 Salem St  Boston  MA  02113',
              'INTERSECTION of Warren Ave & Dartmouth St  Boston  MA  '],
 'location_street_name': ['27 Lithgow St',
                          np.nan,
                          '492 Harvard St',
                          'INTERSECTION Seaport Blvd & Sleeper St',
                          '15 Crawford St',
                          '50-150 Causeway St',
                          '16 Frost Ave',
                          'INTERSECTION Boylston St & Moraine St',
                          '194 Salem St',
                          'INTERSECTION Warren Ave & Dartmouth St'],
 'location_zipcode': [2124.0,
                      np.nan,
                      2124.0,
                      np.nan,
                      2121.0,
                      2114.0,
                      2122.0,
                      np.nan,
                      2113.0,
                      np.nan],
 'longitude': [-71.0701,
               -71.0587,
               -71.0936,
               -71.0587,
               -71.0841,
               -71.0617,
               -71.0503,
               -71.0587,
               -71.0546,
               -71.0587],
 'neighborhood': ['Dorchester',
                  ' ',
                  'Greater Mattapan',
                  'South Boston / South Boston Waterfront',
                  'Roxbury',
                  'Boston',
                  'Dorchester',
                  'Jamaica Plain',
                  'Downtown / Financial District',
                  'South End'],
 'neighborhood_services_district': ['8',
                                    ' ',
                                    '9',
                                    '5',
                                    '13',
                                    '3',
                                    '7',
                                    '11',
                                    '3',
                                    '6'],
 'ontime': ['OVERDUE',
            'ONTIME',
            'ONTIME',
            'ONTIME',
            'OVERDUE',
            'ONTIME',
            'ONTIME',
            'OVERDUE',
            'ONTIME',
            'ONTIME'],
 'open_dt': ['2023-05-09 12:59:00',
             '2022-02-02 11:42:49',
             '2022-01-28 19:36:00',
             '2022-01-01 00:36:24',
             '2022-01-11 09:47:00',
             '2022-01-10 21:49:00',
             '2022-01-01 01:13:52',
             '2022-01-01 01:14:39',
             '2022-01-01 03:08:00',
             '2022-01-01 13:51:00'],
 'police_district': ['C11',
                     ' ',
                     'B3',
                     'C6',
                     'B2',
                     'A1',
                     'C11',
                     'E13',
                     'A1',
                     'D4'],
 'precinct': ['1706',
              ' ',
              '1411',
              '0601',
              '1202',
              ' ',
              '1607',
              '1903',
              '0302',
              '0401'],
 'pwd_district': ['07', ' ', '07', '05', '10B', '1B', '07', '02', '1B', '1C'],
 'queue': ['ISD_Housing (INTERNAL)',
           'PWDx_General Comments',
           'PWDx_Space Saver Removal',
           'BTDT_Parking Enforcement',
           'ISD_Housing (INTERNAL)',
           'INFO01_GenericeFormforOtherServiceRequestTypes',
           'BTDT_Parking Enforcement',
           'PWDx_Street Light_General Lighting Request',
           'INFO01_GenericeFormforOtherServiceRequestTypes',
           'PWDx_Missed Trash\\Recycling\\Yard Waste\\Bulk Item'],
 'reason': ['Building',
            'Employee & General Comments',
            'Sanitation',
            'Enforcement & Abandoned Vehicles',
            'Housing',
            'Generic Noise Disturbance',
            'Enforcement & Abandoned Vehicles',
            'Street Lights',
            'Noise Disturbance',
            'Street Cleaning'],
 'source': ['Constituent Call',
            'Constituent Call',
            'Constituent Call',
            'Citizens Connect App',
            'Constituent Call',
            'Constituent Call',
            'Citizens Connect App',
            'City Worker App',
            'Constituent Call',
            'Citizens Connect App'],
 'subject': ['Inspectional Services',
             "Mayor's 24 Hour Hotline",
             'Public Works Department',
             'Transportation - Traffic Division',
             'Inspectional Services',
             "Mayor's 24 Hour Hotline",
             'Transportation - Traffic Division',
             'Public Works Department',
             'Boston Police Department',
             'Public Works Department'],
 'submittedphoto': [np.nan,
                    np.nan,
                    np.nan,
                    'https://311.boston.gov/media/boston/report/photos/61cfe84b05bbcf180c293ece/photo_20220101_003547.jpg',
                    np.nan,
                    np.nan,
                    'https://311.boston.gov/media/boston/report/photos/61cff11805bbcf180c2944b1/report.jpg',
                    np.nan,
                    np.nan,
                    'https://311.boston.gov/media/boston/report/photos/61d0a2af05bbcf180c2993e3/report.jpg'],
 'target_dt': ['2022-01-20 12:59:39',
               '2022-02-16 11:42:49',
               np.nan,
               '2022-01-04 08:30:00',
               '2022-02-10 09:47:22',
               np.nan,
               '2022-01-04 08:30:00',
               '2022-02-15 01:14:45',
               np.nan,
               '2022-01-04 08:30:00'],
 'type': ['Illegal Rooming House',
          'General Comments For a Program or Policy',
          'Space Savers',
          'Parking Enforcement',
          'Heat - Excessive  Insufficient',
          'Undefined Noise Disturbance',
          'Parking Enforcement',
          'General Lighting Request',
          'Loud Parties/Music/People',
          'Requests for Street Cleaning'],
 'ward': ['Ward 17',
          ' ',
          'Ward 14',
          '6',
          'Ward 12',
          '03',
          'Ward 16',
          '19',
          'Ward 3',
          '4']})

  #define the expected output
  tlogistic_test_X_0 = pd.DataFrame({'department_BTDT': [0, 0, 0, 1, 0, 0, 1, 0, 0, 0],
  'department_INFO': [0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
  'department_ISD': [1, 0, 0, 0, 1, 0, 0, 0, 0, 0],
  'department_PWDx': [0, 1, 1, 0, 0, 0, 0, 1, 0, 1],
  'reason_Building': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  'reason_Employee & General Comments': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
  'reason_Enforcement & Abandoned Vehicles': [0, 0, 0, 1, 0, 0, 1, 0, 0, 0],
  'reason_Generic Noise Disturbance': [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
  'reason_Housing': [0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
  'reason_Noise Disturbance': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
  'reason_Sanitation': [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
  'reason_Street Cleaning': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
  'reason_Street Lights': [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
  'source_Citizens Connect App': [0, 0, 0, 1, 0, 0, 1, 0, 0, 1],
  'source_City Worker App': [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
  'source_Constituent Call': [1, 1, 1, 0, 1, 1, 0, 0, 1, 0],
  'subject_Boston Police Department': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
  'subject_Inspectional Services': [1, 0, 0, 0, 1, 0, 0, 0, 0, 0],
  "subject_Mayor's 24 Hour Hotline": [0, 1, 0, 0, 0, 1, 0, 0, 0, 0],
  'subject_Public Works Department': [0, 0, 1, 0, 0, 0, 0, 1, 0, 1],
  'subject_Transportation - Traffic Division': [0, 0, 0, 1, 0, 0, 1, 0, 0, 0],
  'ward_number_12': [0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
  'ward_number_14': [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
  'ward_number_16': [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
  'ward_number_17': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  'ward_number_19': [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
  'ward_number_3': [0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
  'ward_number_4': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
  'ward_number_6': [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]}
  )
  tlogistic_test_X_1 = pd.DataFrame({'department_BTDT': [0, 0, 1, 0, 0, 1, 0, 0, 0],
  'department_INFO': [0, 0, 0, 0, 1, 0, 0, 1, 0],
  'department_ISD': [0, 0, 0, 1, 0, 0, 0, 0, 0],
  'department_PWDx': [1, 1, 0, 0, 0, 0, 1, 0, 1],
  'reason_Employee & General Comments': [1, 0, 0, 0, 0, 0, 0, 0, 0],
  'reason_Enforcement & Abandoned Vehicles': [0, 0, 1, 0, 0, 1, 0, 0, 0],
  'reason_Generic Noise Disturbance': [0, 0, 0, 0, 1, 0, 0, 0, 0],
  'reason_Housing': [0, 0, 0, 1, 0, 0, 0, 0, 0],
  'reason_Noise Disturbance': [0, 0, 0, 0, 0, 0, 0, 1, 0],
  'reason_Sanitation': [0, 1, 0, 0, 0, 0, 0, 0, 0],
  'reason_Street Cleaning': [0, 0, 0, 0, 0, 0, 0, 0, 1],
  'reason_Street Lights': [0, 0, 0, 0, 0, 0, 1, 0, 0],
  'source_Citizens Connect App': [0, 0, 1, 0, 0, 1, 0, 0, 1],
  'source_City Worker App': [0, 0, 0, 0, 0, 0, 1, 0, 0],
  'source_Constituent Call': [1, 1, 0, 1, 1, 0, 0, 1, 0],
  'subject_Boston Police Department': [0, 0, 0, 0, 0, 0, 0, 1, 0],
  'subject_Inspectional Services': [0, 0, 0, 1, 0, 0, 0, 0, 0],
  "subject_Mayor's 24 Hour Hotline": [1, 0, 0, 0, 1, 0, 0, 0, 0],
  'subject_Public Works Department': [0, 1, 0, 0, 0, 0, 1, 0, 1],
  'subject_Transportation - Traffic Division': [0, 0, 1, 0, 0, 1, 0, 0, 0],
  'ward_number_12': [0, 0, 0, 1, 0, 0, 0, 0, 0],
  'ward_number_14': [0, 1, 0, 0, 0, 0, 0, 0, 0],
  'ward_number_16': [0, 0, 0, 0, 0, 1, 0, 0, 0],
  'ward_number_19': [0, 0, 0, 0, 0, 0, 1, 0, 0],
  'ward_number_3': [0, 0, 0, 0, 1, 0, 0, 1, 0],
  'ward_number_4': [0, 0, 0, 0, 0, 0, 0, 0, 1],
  'ward_number_6': [0, 0, 1, 0, 0, 0, 0, 0, 0]}
  )
  tlinear_test_X_0 = pd.DataFrame({'department_BTDT': [0, 1, 1, 0, 0],
  'department_INFO': [0, 0, 0, 0, 0],
  'department_ISD': [0, 0, 0, 0, 0],
  'department_PWDx': [1, 0, 0, 1, 1],
  'reason_Building': [0, 0, 0, 0, 0],
  'reason_Employee & General Comments': [1, 0, 0, 0, 0],
  'reason_Enforcement & Abandoned Vehicles': [0, 1, 1, 0, 0],
  'reason_Generic Noise Disturbance': [0, 0, 0, 0, 0],
  'reason_Housing': [0, 0, 0, 0, 0],
  'reason_Noise Disturbance': [0, 0, 0, 0, 0],
  'reason_Sanitation': [0, 0, 0, 0, 0],
  'reason_Street Cleaning': [0, 0, 0, 0, 1],
  'reason_Street Lights': [0, 0, 0, 1, 0],
  'source_Citizens Connect App': [0, 1, 1, 0, 1],
  'source_City Worker App': [0, 0, 0, 1, 0],
  'source_Constituent Call': [1, 0, 0, 0, 0],
  'subject_Boston Police Department': [0, 0, 0, 0, 0],
  'subject_Inspectional Services': [0, 0, 0, 0, 0],
  "subject_Mayor's 24 Hour Hotline": [1, 0, 0, 0, 0],
  'subject_Public Works Department': [0, 0, 0, 1, 1],
  'subject_Transportation - Traffic Division': [0, 1, 1, 0, 0],
  'ward_number_12': [0, 0, 0, 0, 0],
  'ward_number_14': [0, 0, 0, 0, 0],
  'ward_number_16': [0, 0, 1, 0, 0],
  'ward_number_17': [0, 0, 0, 0, 0],
  'ward_number_19': [0, 0, 0, 1, 0],
  'ward_number_3': [0, 0, 0, 0, 0],
  'ward_number_4': [0, 0, 0, 0, 1],
  'ward_number_6': [0, 1, 0, 0, 0]}
  )
  tlinear_test_X_1 = pd.DataFrame({'department_BTDT': [1, 1],
  'department_INFO': [0, 0],
  'department_ISD': [0, 0],
  'department_PWDx': [0, 0],
  'reason_Building': [0, 0],
  'reason_Employee & General Comments': [0, 0],
  'reason_Enforcement & Abandoned Vehicles': [1, 1],
  'reason_Generic Noise Disturbance': [0, 0],
  'reason_Housing': [0, 0],
  'reason_Noise Disturbance': [0, 0],
  'reason_Sanitation': [0, 0],
  'reason_Street Cleaning': [0, 0],
  'reason_Street Lights': [0, 0],
  'source_Citizens Connect App': [1, 1],
  'source_City Worker App': [0, 0],
  'source_Constituent Call': [0, 0],
  'subject_Boston Police Department': [0, 0],
  'subject_Inspectional Services': [0, 0],
  "subject_Mayor's 24 Hour Hotline": [0, 0],
  'subject_Public Works Department': [0, 0],
  'subject_Transportation - Traffic Division': [1, 1],
  'ward_number_12': [0, 0],
  'ward_number_14': [0, 0],
  'ward_number_16': [0, 1],
  'ward_number_17': [0, 0],
  'ward_number_19': [0, 0],
  'ward_number_3': [0, 0],
  'ward_number_4': [0, 0],
  'ward_number_6': [1, 0]}
  )
  tlinear_test_X_2 = pd.DataFrame({'department_BTDT': [1, 1, 0, 0],
  'department_INFO': [0, 0, 0, 0],
  'department_ISD': [0, 0, 0, 0],
  'department_PWDx': [0, 0, 1, 1],
  'reason_Building': [0, 0, 0, 0],
  'reason_Employee & General Comments': [0, 0, 0, 0],
  'reason_Enforcement & Abandoned Vehicles': [1, 1, 0, 0],
  'reason_Generic Noise Disturbance': [0, 0, 0, 0],
  'reason_Housing': [0, 0, 0, 0],
  'reason_Noise Disturbance': [0, 0, 0, 0],
  'reason_Sanitation': [0, 0, 0, 0],
  'reason_Street Cleaning': [0, 0, 0, 1],
  'reason_Street Lights': [0, 0, 1, 0],
  'source_Citizens Connect App': [1, 1, 0, 1],
  'source_City Worker App': [0, 0, 1, 0],
  'source_Constituent Call': [0, 0, 0, 0],
  'subject_Boston Police Department': [0, 0, 0, 0],
  'subject_Inspectional Services': [0, 0, 0, 0],
  "subject_Mayor's 24 Hour Hotline": [0, 0, 0, 0],
  'subject_Public Works Department': [0, 0, 1, 1],
  'subject_Transportation - Traffic Division': [1, 1, 0, 0],
  'ward_number_12': [0, 0, 0, 0],
  'ward_number_14': [0, 0, 0, 0],
  'ward_number_16': [0, 1, 0, 0],
  'ward_number_17': [0, 0, 0, 0],
  'ward_number_19': [0, 0, 1, 0],
  'ward_number_3': [0, 0, 0, 0],
  'ward_number_4': [0, 0, 0, 1],
  'ward_number_6': [1, 0, 0, 0]}
  )
  tlogistic_test_y_0 = pd.Series({0: 0, 1: 1, 2: 0, 3: 1, 4: 0, 5: 0, 6: 1, 7: 1, 8: 0, 9: 1}
  )
  tlogistic_test_y_1 = pd.Series({1: 1, 2: 0, 3: 1, 4: 0, 5: 0, 6: 1, 7: 0, 8: 0, 9: 0}
  )
  tlinear_test_y_0 = pd.Series({1: -8759.950555555555,
  3: 47.61472222222222,
  6: 46.986111111111114,
  7: 2195.776388888889,
  9: 2920.1388888888887}
  )
  tlinear_test_y_1 = pd.Series({3: 47.61472222222222, 6: 46.986111111111114}
  )
  tlinear_test_y_2 = pd.Series({3: 47.61472222222222,
  6: 46.986111111111114,
  7: 2195.776388888889,
  9: 2920.1388888888887}
  )

  #call the function with the test data
  logistic_test_X_0, logistic_test_y_0 = clean_and_split_for_logistic(test_data_2022, [0])
  logistic_test_X_1, logistic_test_y_1 = clean_and_split_for_logistic(test_data_2022, [1, 2])

  linear_test_X_0, linear_test_y_0 = clean_and_split_for_linear(test_data_2022, [0])
  linear_test_X_1, linear_test_y_1 = clean_and_split_for_linear(test_data_2022, [1, 2])
  linear_test_X_2, linear_test_y_2 = clean_and_split_for_linear(test_data_2022, [2])


  #check if the function output matches the expected output when reindexed

  test_data = [
      (logistic_test_X_0, tlogistic_test_X_0),
      (logistic_test_X_1, tlogistic_test_X_1),
      (linear_test_X_0, tlinear_test_X_0),
      (linear_test_X_1, tlinear_test_X_1),
      (linear_test_X_2, tlinear_test_X_2),
      (logistic_test_y_0, tlogistic_test_y_0),
      (logistic_test_y_1, tlogistic_test_y_1),
      (linear_test_y_0, tlinear_test_y_0),
      (linear_test_y_1, tlinear_test_y_1),
      (linear_test_y_2, tlinear_test_y_2)
  ]

  for data, expected in test_data:
        if isinstance(data, pd.DataFrame):
            # Sort the DataFrames by index and column names
            data = data.sort_index(axis=0).sort_index(axis=1)
            expected = expected.sort_index(axis=0).sort_index(axis=1)
            # Reset the index to avoid issues with different index types
            data = data.reset_index(drop=True)
            expected = expected.reset_index(drop=True)
            # Compare the DataFrames and assert that they are equal
            #print("Dataframe indices:")
            #print(data.index)
            #print(expected.index)
            #print("Dataframe columns:")
            #print(data.columns)
            #print(expected.columns)
            #diff = data.compare(expected)
            #if not diff.empty:
            #    print(f"DataFrames are different:\n{diff}")
            assert_frame_equal(data, expected, check_dtype=False)
        elif isinstance(data, pd.Series):
            # Sort the Series by index
            #data = data.sort_index(axis=0)
            data = data.rename(None)
            #expected = expected.sort_index(axis=0)
            # Compare the Series and assert that they are equal
            #print("Series indices:")
            #print(data.index)
            #print(expected.index)
            #diff = data.compare(expected)
            #if not diff.empty:
            #    print(f"Series are different:\n{diff}")
            assert_series_equal(data, expected, check_dtype=False)



##Call Unit Test Function

In [None]:
test_data_clean_functions()

##Ingest all the data 

Here is the link to all the data sets:

https://data.boston.gov/dataset/311-service-requests

In [None]:

# Get a list of all CSV files in the directory
all_files = [url_2023, url_2022, url_2021, url_2020, url_2019, url_2018, url_2017, url_2016, url_2015, url_2014, url_2013, url_2012, url_2011]

# Create an empty list to store the dataframes
dfs = []

# Loop through the files and load them into dataframes
for file in all_files:
  df = pd.read_csv(file)
  dfs.append(df)

  df = pd.read_csv(file)
  df = pd.read_csv(file)


In [None]:
#check that the files all have the same number of columns, and the same names
same_list_num_col = []
diff_list_num_col = []
same_list_order_col = []
diff_list_order_col = []

for i in range(len(dfs)):

  if dfs[i].shape[1] != dfs[0].shape[1]:
    #print('Error: File', i, 'does not have the same number of columns as File 0')
    diff_list_num_col.append(i)
  else:
    #print('File', i, 'has same number of columns as File 0')
    same_list_num_col.append(i)
  if not dfs[i].columns.equals(dfs[0].columns):
    #print('Error: File', i, 'does not have the same column names and order as File 0')
    diff_list_order_col.append(i)
  else:
    #print('File', i, 'has the same column name and order as File 0')
    same_list_order_col.append(i)

print("Files with different number of columns from File 0: ", diff_list_num_col)
print("Files with same number of columns as File 0: ", same_list_num_col)
print("Files with different column order from File 0: ", diff_list_order_col)
print("Files with same column order as File 0: ", same_list_order_col)

Files with different number of columns from File 0:  []
Files with same number of columns as File 0:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
Files with different column order from File 0:  []
Files with same column order as File 0:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]


In [None]:
# Concatenate the dataframes into a single dataframe
df_all = pd.concat(dfs, ignore_index=True)

In [None]:
#save ram by deleting the dfs variable
del dfs

##Examine our source data for patterns to inform scenarios

##Clean and split our data for training models on scenarios 1, 2, 3, 4

In [None]:
logistic_X, logistic_y = clean_and_split_for_logistic(df_all, [1, 2, 3, 4])

In [None]:
linear_X, linear_y = clean_and_split_for_linear(df_all, [1, 2, 3, 4])

##Train Models with Early Stopping

This time we are also going to add Early Stopping to our model training based on the validation loss, and we are adding validation to our logistic regression model.

In [None]:
#Train a logistic regression model

start_time = datetime.now()
print("Starting Training at {}".format(start_time))

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(logistic_X, logistic_y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Build model
model_log_1234 = keras.Sequential([
    keras.layers.Dense(units=1, input_shape=(X_train.shape[1],), activation='sigmoid')
])

# Compile model
model_log_1234.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Define early stopping callback
early_stopping = keras.callbacks.EarlyStopping(monitor='val_loss', patience=2)

# Train model with early stopping
model_log_1234.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_val, y_val), callbacks=[early_stopping])

# Evaluate model
test_loss, test_acc = model_log_1234.evaluate(X_test, y_test)

print('Test accuracy:', test_acc)

end_time = datetime.now()
total_time = (end_time - start_time)
print("Ending Training at {}".format(end_time))
print("Training took {}".format(total_time))

Starting Training at 2023-05-11 17:03:46.369439
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test accuracy: 0.8684390783309937
Ending Training at 2023-05-11 17:07:12.247633
Training took 0:03:25.878194


In [None]:
model_log_1234.save("model_log_1234.h5")
files.download('model_log_1234.h5')

In [None]:
#Train a linear regression model

start_time = datetime.now()
print("Starting Training at {}".format(start_time))

scaler = StandardScaler()
X_scaled = scaler.fit_transform(linear_X) # scale the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, linear_y, test_size=0.2, random_state=42)

# split the data again to create a validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# define the model architecture
model_lin_1234 = keras.Sequential([
    keras.layers.Dense(units=1, input_dim=X_train.shape[1])
])

# compile the model
model_lin_1234.compile(optimizer='adam', loss='mean_squared_error')

# train the model
# we are adding early stopping based on the validation loss
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=2, verbose=1, mode='min')
history = model_lin_1234.fit(X_train, y_train, epochs=50, batch_size=32, verbose=1, validation_data=(X_val, y_val), callbacks=[early_stop])

end_time = datetime.now()
total_time = (end_time - start_time)
print("Ending Training at {}".format(end_time))
print("Training took {}".format(total_time))

Starting Training at 2023-05-11 17:07:12.258215
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 4: early stopping
Ending Training at 2023-05-11 17:10:08.728977
Training took 0:02:56.470762


In [None]:
model_lin_1234.save("model_lin_1234.h5")
files.download('model_lin_1234.h5')

##Clean and split our data for training models on scenarios 1, 2, 3, 4, 5

In [None]:
logistic_X, logistic_y = clean_and_split_for_logistic(df_all, [1, 2, 3, 4, 5])

In [None]:
linear_X, linear_y = clean_and_split_for_linear(df_all, [1, 2, 3, 4, 5])

In [None]:
#delete df_all to save ram
del df_all

##Train Models with Early Stopping

This time we are also going to add Early Stopping to our model training based on the validation loss, and we are adding validation to our logistic regression model.

In [None]:
#Train a logistic regression model

start_time = datetime.now()
print("Starting Training at {}".format(start_time))

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(logistic_X, logistic_y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Build model
model_log_12345 = keras.Sequential([
    keras.layers.Dense(units=1, input_shape=(X_train.shape[1],), activation='sigmoid')
])

# Compile model
model_log_12345.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Define early stopping callback
early_stopping = keras.callbacks.EarlyStopping(monitor='val_loss', patience=2)

# Train model with early stopping
model_log_12345.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_val, y_val), callbacks=[early_stopping])

# Evaluate model
test_loss, test_acc = model_log_12345.evaluate(X_test, y_test)

print('Test accuracy:', test_acc)

end_time = datetime.now()
total_time = (end_time - start_time)
print("Ending Training at {}".format(end_time))
print("Training took {}".format(total_time))

Starting Training at 2023-05-11 17:11:10.537063
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Test accuracy: 0.911194384098053
Ending Training at 2023-05-11 17:14:40.230726
Training took 0:03:29.693663


In [None]:
model_log_12345.save("model_log_12345.h5")
files.download('model_log_12345.h5')

In [None]:
#Train a linear regression model

start_time = datetime.now()
print("Starting Training at {}".format(start_time))

scaler = StandardScaler()
X_scaled = scaler.fit_transform(linear_X) # scale the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, linear_y, test_size=0.2, random_state=42)

# split the data again to create a validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# define the model architecture
model_lin_12345 = keras.Sequential([
    keras.layers.Dense(units=1, input_dim=X_train.shape[1])
])

# compile the model
model_lin_12345.compile(optimizer='adam', loss='mean_squared_error')

# train the model
# we are adding early stopping based on the validation loss
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=2, verbose=1, mode='min')
history = model_lin_12345.fit(X_train, y_train, epochs=50, batch_size=32, verbose=1, validation_data=(X_val, y_val), callbacks=[early_stop])

end_time = datetime.now()
total_time = (end_time - start_time)
print("Ending Training at {}".format(end_time))
print("Training took {}".format(total_time))

Starting Training at 2023-05-11 17:21:37.680961
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 3: early stopping
Ending Training at 2023-05-11 17:23:55.004931
Training took 0:02:17.323970


In [None]:
model_lin_12345.save("model_lin_12345.h5")
files.download('model_lin_12345.h5')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
import gc
gc.collect()

1576

We got some variable results on these models. Now that we have several scenarios, we might want to come up with ways to compare the performance of these models easily.

We are reaching the limits of the capabilities of Google Colaboratory. When we train our models, we might want to try deleting the test data after splitting it so we can save ram. If we want to save it we can write it to a file and download it before deleting it.

Additionally, we want to delete any intermediary data frames created during training before doing the next training. The best way to do that will be to put our data splitting and training inside functions so when the functions complete the variables go out of scope and the RAM they used is freed. Anything that needs to be kept can be saved to files.