In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.preprocessing import StandardScaler

from IPython.display import display

%matplotlib inline

#first of course we must import the necessary modules

#Boston 311 v3 - Exploring Outliers and Adjusting Data Cleaning Functions

What type of outliers do we have in our data? Let's refer back to the first notebook and take a look at the graphs of our features.

1. subject - 8 of our 10 subject categories have very few caases compared to the other 2
2. reason - Of the more than 40 reason categories, about half have very few records associated with them
3. department - Of our 16 department categories, about half have very few records associated with them
4. source - We have five source categories, but the vast majority of our data is isolated to two of them
5. ward_number - Our ward data is actually fairly normally distributed

Of our five feature categories, only the ward does not have underrepresented categories

For our labels, the logistic label of "Open" or "Closed" shows that the vast majority of cases are eventually closed. Additionally, we don't know if the cases that are open are later closed in the 2023 data set.

Let's count these minority categories and see what's in them:

In [None]:
df2022 = pd.read_csv("https://data.boston.gov/dataset/8048697b-ad64-4bfc-b090-ee00169f2323/resource/81a7b022-f8fc-4da5-80e4-b160058ca207/download/tmph4izx_fb.csv",
                            parse_dates=['open_dt', 'target_dt', 'closed_dt'])

In [None]:
subject_counts = df2022['subject'].value_counts()
reason_counts = df2022['reason'].value_counts()
department_counts = df2022['department'].value_counts()
source_counts = df2022['source'].value_counts()

In [None]:
#minority subjects:
df2022['subject'].value_counts()


Public Works Department              145989
Transportation - Traffic Division     75015
Inspectional Services                 19276
Parks & Recreation Department         16712
Mayor's 24 Hour Hotline               10610
Animal Control                         4027
Property Management                    2902
Boston Water & Sewer Commission        1469
Boston Police Department                701
Neighborhood Services                    22
Name: subject, dtype: int64

In [None]:
df2022['reason'].value_counts()

Enforcement & Abandoned Vehicles     62656
Street Cleaning                      40582
Code Enforcement                     30933
Sanitation                           29538
Highway Maintenance                  26682
Signs & Signals                      10839
Recycling                             8944
Trees                                 8358
Street Lights                         8224
Park Maintenance & Safety             8052
Housing                               7116
Needle Program                        6845
Building                              6065
Environmental Services                4764
Animal Issues                         4027
Graffiti                              2902
Administrative & General Requests     2075
Employee & General Comments           1835
Health                                1283
Abandoned Bicycle                     1057
Noise Disturbance                      701
Traffic Management & Engineering       646
Notification                           616
Catchbasin 

In [None]:
df2022['department'].value_counts()

PWDx    136719
BTDT     74856
ISD      18473
PARK     15697
INFO     14643
GEN_      6872
BWSC      4935
PROP      3079
ANML       736
BPD_       298
BHA_       147
BPS_       140
ONS_        79
DND_        27
DISB        19
ECON         3
Name: department, dtype: int64

In [None]:
df2022['source'].value_counts()

Citizens Connect App    140829
Constituent Call         97507
City Worker App          24618
Self Service              8756
Employee Generated        5013
Name: source, dtype: int64

The graphs were a little deceptive for some of these categories. 

Subjects - The lowest category is neighborhood services, with 22 records, but the next lowest is Boston Police Department, with over 700 records. It doesn't seem like a good idea to remove the BPD records, but we can look at the neighborhood services records:

In [None]:
from IPython.display import display



display(df2022[df2022['subject'] == "Neighborhood Services"])

Unnamed: 0,case_enquiry_id,open_dt,target_dt,closed_dt,ontime,case_status,closure_reason,case_title,subject,reason,...,police_district,neighborhood,neighborhood_services_district,ward,precinct,location_street_name,location_zipcode,latitude,longitude,source
926,101004126469,2022-01-14 16:42:00,NaT,2022-02-14 11:42:58,ONTIME,Closed,Case Closed. Closed date : 2022-02-14 11:42:58...,Dumpster & Loading Noise Disturbances,Neighborhood Services,Neighborhood Services Issues,...,A1,Back Bay,6,Ward 5,501,200 Stuart St,2116.0,42.3505,-71.0676,Constituent Call
12728,101004290312,2022-05-13 05:49:00,NaT,2022-05-23 17:37:39,ONTIME,Closed,Case Closed. Closed date : 2022-05-23 17:37:39...,Dumpster & Loading Noise Disturbances,Neighborhood Services,Neighborhood Services Issues,...,D14,Allston / Brighton,14,Ward 21,2103,89 Gardner St,2134.0,42.3533,-71.1259,Constituent Call
35510,101004148796,2022-01-27 19:51:00,NaT,2022-05-17 17:13:06,ONTIME,Closed,Case Closed. Closed date : 2022-05-17 17:13:06...,Dumpster & Loading Noise Disturbances,Neighborhood Services,Neighborhood Services Issues,...,A1,Downtown / Financial District,4,Ward 3,308,2 Winter Pl,2111.0,42.3555,-71.0614,Constituent Call
42381,101004123940,2022-01-12 11:53:00,NaT,2022-02-14 11:29:06,ONTIME,Closed,Case Closed. Closed date : 2022-02-14 11:29:06...,Dumpster & Loading Noise Disturbances,Neighborhood Services,Neighborhood Services Issues,...,C11,Dorchester,8,Ward 15,1503,75 Coleman St,2125.0,42.3076,-71.0672,Constituent Call
46088,101004116461,2022-01-05 04:52:00,NaT,2022-02-14 11:37:58,ONTIME,Closed,Case Closed. Closed date : 2022-02-14 11:37:58...,Dumpster & Loading Noise Disturbances,Neighborhood Services,Neighborhood Services Issues,...,E18,Hyde Park,10,18,1820,INTERSECTION Industrial Dr & Milton St,,42.3594,-71.0587,Constituent Call
49124,101004164889,2022-02-04 04:53:38,NaT,2022-02-08 12:41:34,ONTIME,Closed,Case Closed. Closed date : 2022-02-08 12:41:34...,Dumpster & Loading Noise Disturbances,Neighborhood Services,Neighborhood Services Issues,...,C6,South Boston / South Boston Waterfront,5,6,601,INTERSECTION W Second St & W Third St,,42.3594,-71.0587,Constituent Call
56524,101004163038,2022-02-02 22:14:00,NaT,2022-02-15 14:09:03,ONTIME,Closed,Case Closed. Closed date : 2022-02-15 14:09:03...,Dumpster & Loading Noise Disturbances,Neighborhood Services,Neighborhood Services Issues,...,B2,Roxbury,13,Ward 8,807,46 Blue Hill Ave,2119.0,42.3235,-71.0761,Constituent Call
75488,101004227021,2022-03-20 12:35:00,NaT,2022-04-13 10:51:40,ONTIME,Closed,Case Closed. Closed date : 2022-04-13 10:51:40...,Dumpster & Loading Noise Disturbances,Neighborhood Services,Neighborhood Services Issues,...,A1,Downtown / Financial District,3,3,304,INTERSECTION Thacher St & Endicott St,,42.3594,-71.0587,Constituent Call
81422,101004257519,2022-04-15 04:24:00,NaT,2022-04-22 18:04:48,ONTIME,Closed,Case Closed. Closed date : 2022-04-22 18:04:48...,Dumpster & Loading Noise Disturbances,Neighborhood Services,Neighborhood Services Issues,...,D4,South End,6,Ward 8,802,81 E Brookline St,2118.0,42.3375,-71.0702,Constituent Call
121951,101004169889,2022-02-08 04:54:00,NaT,2022-02-14 13:48:14,ONTIME,Closed,Case Closed. Closed date : 2022-02-14 13:48:14...,Dumpster & Loading Noise Disturbances,Neighborhood Services,Neighborhood Services Issues,...,C11,Dorchester,8,15,1503,278 Bowdoin St,2125.0,42.3076,-71.0668,Constituent Call




These 22 cases are all about Dumpster loading noise complaints related to private trash pickup, and almost all of them mention that according to law the state cannot do anything about private trash pickup times. Interestingly, they all have the same case_title. Is it possible case title is a meaningful and regulated category? Let's count the values:

In [None]:
len(df2022['case_title'].value_counts().tolist())

6575

With 6575 different values, it probably is not a good categorical variable for us right now.

What about The cases where the reason has less than 25 records?

In [None]:
reason_filter = df2022['reason'].isin(reason_counts[reason_counts < 25].index)
display(df2022[reason_filter])


Unnamed: 0,case_enquiry_id,open_dt,target_dt,closed_dt,ontime,case_status,closure_reason,case_title,subject,reason,...,police_district,neighborhood,neighborhood_services_district,ward,precinct,location_street_name,location_zipcode,latitude,longitude,source
926,101004126469,2022-01-14 16:42:00,NaT,2022-02-14 11:42:58,ONTIME,Closed,Case Closed. Closed date : 2022-02-14 11:42:58...,Dumpster & Loading Noise Disturbances,Neighborhood Services,Neighborhood Services Issues,...,A1,Back Bay,6,Ward 5,0501,200 Stuart St,2116.0,42.3505,-71.0676,Constituent Call
3165,101004150433,2022-01-29 14:00:00,NaT,NaT,ONTIME,Open,,Fire Department,Mayor's 24 Hour Hotline,Fire Department,...,C11,Dorchester,8,Ward 15,1502,3 Davidson Ave,2121.0,42.3064,-71.0710,Constituent Call
12728,101004290312,2022-05-13 05:49:00,NaT,2022-05-23 17:37:39,ONTIME,Closed,Case Closed. Closed date : 2022-05-23 17:37:39...,Dumpster & Loading Noise Disturbances,Neighborhood Services,Neighborhood Services Issues,...,D14,Allston / Brighton,14,Ward 21,2103,89 Gardner St,2134.0,42.3533,-71.1259,Constituent Call
19321,101004387857,2022-07-13 23:28:00,NaT,NaT,ONTIME,Open,,Aircraft Noise Disturbance,Mayor's 24 Hour Hotline,Massport,...,A15,Charlestown,2,Ward 2,0206,32 Mead St,2129.0,42.3803,-71.0684,Constituent Call
19608,101004400706,2022-07-24 21:32:00,2022-08-08 08:30:00,NaT,OVERDUE,Open,,Valet Parking Problems,Transportation - Traffic Division,Valet,...,A1,Downtown / Financial District,3,Ward 3,0303,10 Garden Court St,2113.0,42.3644,-71.0532,Constituent Call
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
258838,101004365592,2022-06-25 13:25:15,NaT,2022-06-25 13:29:59,ONTIME,Closed,Case Closed. Closed date : 2022-06-25 13:29:59...,Phone Bank Service Inquiry,Transportation - Traffic Division,Office of The Parking Clerk,...,,,,,,,,42.3594,-71.0587,Constituent Call
266092,101004464909,2022-09-02 12:34:00,NaT,NaT,ONTIME,Open,,Billing Complaint,Boston Water & Sewer Commission,Billing,...,C11,Dorchester,7,Ward 13,1309,99 Cushing Ave,2125.0,42.3134,-71.0630,Constituent Call
271889,101004545274,2022-10-31 14:47:37,NaT,2022-11-01 06:15:57,ONTIME,Closed,Case Closed. Closed date : 2022-11-01 06:15:57...,Bridge Maintenance,Public Works Department,Bridge Maintenance,...,A1,Beacon Hill,14,Ward 5,0503,53 Chestnut St,2108.0,42.3574,-71.0689,Constituent Call
273937,101004581723,2022-11-25 10:12:40,NaT,2022-11-25 10:13:56,ONTIME,Closed,Case Closed. Closed date : 2022-11-25 10:13:56...,Phone Bank Service Inquiry,Transportation - Traffic Division,Office of The Parking Clerk,...,,,,,,,,42.3594,-71.0587,Constituent Call




This set of records includes all the neighborhood services subject records we just looked at, and the rest of them are mostly open and contain little information. It seems like they might get routed to other departments that don't end up getting back to 311 to have the case closed. These reason categories are:



```
Fire Department                         12
Office of The Parking Clerk              9
Bridge Maintenance                       7
Billing                                  7
Massport                                 6
Valet                                    4
Alert Boston                             3
MBTA                                     1
```

It seems that some of our category values might be obsolete. Since our goal is to predict time to close and whether a case will be closed moving forward, it might be a good idea to look at the currently available android app and see what values are available to the user to select, and which categories might be assigned by the 311 agents after receiving a new case. Let's add this to the to-dos at the end of this notebook.

For now let's keep looking at our minority category values, continuing with any department category value with fewer than 30 records:



In [None]:
department_filter = df2022['department'].isin(department_counts[department_counts < 30].index)
display(df2022[department_filter])

Unnamed: 0,case_enquiry_id,open_dt,target_dt,closed_dt,ontime,case_status,closure_reason,case_title,subject,reason,...,police_district,neighborhood,neighborhood_services_district,ward,precinct,location_street_name,location_zipcode,latitude,longitude,source
546,101004114962,2022-01-03 14:21:00,2022-01-04 14:21:27,2022-05-05 11:27:24,OVERDUE,Closed,Case Closed. Closed date : 2022-05-05 11:27:24...,Parking Enforcement,Transportation - Traffic Division,Enforcement & Abandoned Vehicles,...,C6,South Boston / South Boston Waterfront,5.0,Ward 7,704.0,284 W Fifth St,2127.0,42.335,-71.0483,Citizens Connect App
10877,101004271503,2022-04-27 12:14:00,2023-04-27 12:14:02,NaT,ONTIME,Open,,Tree Maintenance Requests,Parks & Recreation Department,Trees,...,E18,Hyde Park,10.0,Ward 18,1808.0,57 Collins St,2136.0,42.2714,-71.117,Constituent Call
15393,101004338627,2022-06-12 09:30:00,2022-06-20 08:30:00,NaT,OVERDUE,Open,,Ground Maintenance: --Not in list-- - BPRD,Parks & Recreation Department,Park Maintenance & Safety,...,B2,Roxbury,13.0,12,1207.0,INTERSECTION Sonoma St & Maple St,,42.3594,-71.0587,Constituent Call
21213,101004421186,2022-08-09 18:02:00,NaT,2022-08-16 11:03:36,ONTIME,Closed,Case Closed. Closed date : 2022-08-16 11:03:36...,Transportation General Request,Transportation - Traffic Division,Administrative & General Requests,...,C11,Dorchester,7.0,Ward 16,1605.0,47 Houghton St,2122.0,42.2948,-71.0516,Constituent Call
21632,101004426281,2022-08-13 14:40:00,NaT,2022-09-07 09:36:56,ONTIME,Closed,Case Closed. Closed date : 2022-09-07 09:36:56...,Abandoned Bicycle,Mayor's 24 Hour Hotline,Abandoned Bicycle,...,A7,East Boston,1.0,1,110.0,INTERSECTION Neptune Rd & Bennington St,,42.3594,-71.0587,Citizens Connect App
22250,101004444094,2022-08-19 10:24:00,2022-09-18 10:24:56,2022-09-09 09:54:03,ONTIME,Closed,Case Closed. Closed date : 2022-09-09 09:54:03...,New Sign Crosswalk or Pavement Marking,Transportation - Traffic Division,Signs & Signals,...,D4,Back Bay,6.0,4,402.0,INTERSECTION Ring Rd & Boylston St,,42.3594,-71.0587,Constituent Call
27681,101004518642,2022-10-06 15:25:00,NaT,2022-11-07 15:01:01,ONTIME,Closed,Case Closed. Closed date : 2022-11-07 15:01:01...,Transportation General Request,Transportation - Traffic Division,Administrative & General Requests,...,A15,Charlestown,2.0,Ward 2,207.0,57 Baldwin St,2129.0,42.3819,-71.0702,Constituent Call
33910,101004128101,2022-01-17 09:52:00,2023-01-17 09:52:43,2022-01-19 16:57:43,ONTIME,Closed,Case Closed. Closed date : 2022-01-19 16:57:43...,Tree Maintenance Requests,Parks & Recreation Department,Trees,...,B2,Roxbury,13.0,Ward 11,1102.0,126 Thornton St,2119.0,42.3228,-71.0914,Citizens Connect App
34205,101004592444,2022-12-06 08:35:00,2022-12-07 08:35:31,2022-12-06 14:20:39,ONTIME,Closed,Case Closed. Closed date : 2022-12-06 14:20:39...,Tree Emergencies,Parks & Recreation Department,Trees,...,C11,Dorchester,8.0,Ward 15,1504.0,26 Ronan St,2125.0,42.311,-71.0662,Constituent Call
34207,101004592454,2022-12-06 08:38:00,2023-12-06 08:38:41,2022-12-07 10:27:35,ONTIME,Closed,Case Closed. Closed date : 2022-12-07 10:27:35...,Tree Maintenance Requests,Parks & Recreation Department,Trees,...,C11,Dorchester,8.0,Ward 15,1504.0,26 Ronan St,2125.0,42.311,-71.0662,Constituent Call




These cases appear to fall in three meaningful groups according to the 3 departments involved. This seems to show that department is a very meaningful categorical variable, and the minority values are important as well. These 3 are disability related requests, which we would definitely want to keep, tree maintenance requests, another important category, and requests related to outdoor dining, which were routed to an "ECON" department. We might want to compare a basic model which only uses the department value as a feature to our more complex models as a heuristic for whether additional features actually improve predictions. Let's add this to our to-dos as well. 

To-Dos:

1. look at the currently available android app and see what values are available to the user to select, and which categories might be assigned by the 311 agents after receiving a new case.
2. compare a basic model which only uses the department value as a feature to our more complex models as a heuristic for whether additional features actually improve predictions.