This notebook displays end-to-end machine learning process: 

1) data cleaning, 

2) data processing & feature engineering, 

3) model building 

The majority of work involves extensive data cleaning and engineering:

- various datasets (2015-2019)
- geodata and time series 
- RequestType is used as the dependent variable to setup for multi-class classification task
- where(zipcodes/districts/neighborhoods) vs when(days difference/date categories) vs what RequestType
- final clean dataframe is sorted by ascending CreatedDate
- LGBMClassifier is used and tested on validation set 
- only around 0.06 in multi-logloss for this first round of testing 
- about 0.02 small difference between train and validation sets; thus, no overfitting issue
- tested with CatboostClassifier as well; however, due to slow running time and memory issue, only included LGBM here.

For data analysis part, please refer to my previous work as well:

https://github.com/hackforla/311-data/blob/dev/dataAnalysis/minaAnalysis_311data.ipynb


In [2]:
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import auc,roc_auc_score
from catboost import CatBoostClassifier,Pool, cv
from sklearn.preprocessing import LabelEncoder
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_curve, auc, classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split, cross_val_score, KFold, StratifiedKFold

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import datetime
from datetime import time
import gc
sns.set()
%matplotlib inline
import matplotlib.gridspec as gridspec
import geopandas
from geopandas import GeoDataFrame
from shapely.geometry import Point

import warnings
warnings.filterwarnings("ignore")

In [3]:
df_19=pd.read_csv('../input/datafolder/MyLA311_Service_Request_Data_2019.csv')
df_15=pd.read_csv('../input/la3111-data/MyLA311_Service_Request_Data_2015.csv')
df_16=pd.read_csv('../input/la3111-data/MyLA311_Service_Request_Data_2016.csv')
df_17=pd.read_csv('../input/la3111-data/MyLA311_Service_Request_Data_2017.csv')
df_18=pd.read_csv('../input/la3111-data/MyLA311_Service_Request_Data_2018.csv')
print(df_15.shape, df_16.shape, df_17.shape, df_18.shape, df_19.shape)

(237305, 33) (952486, 33) (1131558, 33) (1210075, 33) (1308093, 34)


In [4]:
df_19.drop(['CreatedByUserOrganization'], axis=1, inplace=True)
print(df_19.shape)

(1308093, 33)


In [5]:
df = pd.concat([df_15, df_16, df_17, df_18, df_19])
df.drop(['HouseNumber', #'Latitude', 'Longitude'
        ], axis=1, inplace=True)
print(df.shape)

(4839517, 32)


In [6]:
del [[df_15, df_16, df_17, df_18, df_19]]
gc.collect()

4

1) Data Cleaning

In [7]:
#1
df = df[df['ServiceDate'].notnull()]
df = df[df['ClosedDate'].notnull()]
print(df.shape)

(4576273, 32)


In [8]:
#2
df['CreatedDate'] = pd.to_datetime(df['CreatedDate'])

In [9]:
df['UpdatedDate'] = pd.to_datetime(df['UpdatedDate'])

In [10]:
df['ServiceDate'] = pd.to_datetime(df['ServiceDate'])

In [11]:
df['ClosedDate'] = pd.to_datetime(df['ClosedDate'])

In [12]:
df['service']= (df['ServiceDate']).astype('str').str.slice(stop=11)

In [13]:
#3
df = df[~df['service'].str.startswith(('1900', '2109', '2107', '2027', '2026', '2022',
                                       '2007', '2012', '2013', '2014',
                                     '2020-12', '2020-11', '2020-10', '2020-09',
                                     '2020-08', '2020-07', '2020-06', '2020-05', '2020-04',
                                     '2015-01', '2015-02', '2015-03', '2015-04',
                                     '2015-05', '2015-06', '2015-07', '2015-08-05'
                                    ))]
print(df.shape)

(4575985, 33)


In [14]:
#4
df['ClosedCreatedDiff'] = df['ClosedDate'] - df['CreatedDate']
df['ServiceCreatedDiff'] = df['ServiceDate'] - df['CreatedDate']
df['ClosedServiceDiff'] = df['ClosedDate'] - df['ServiceDate']

df['UpdatedCreatedDiff'] = df['UpdatedDate'] - df['CreatedDate']
df['UpdatedServiceDiff'] = df['UpdatedDate'] - df['ServiceDate']
df['UpdatedClosedDiff'] = df['UpdatedDate'] - df['ClosedDate']

In [15]:
#5
def ddiff2days(ddiff):
    if not pd.isnull(ddiff):
        return pd.Timedelta.total_seconds(ddiff)/(24.*3600)
    else:
        return np.NaN

In [16]:
#6
cols = ['ClosedCreatedDiff', 'ServiceCreatedDiff', 'ClosedServiceDiff', 
   'UpdatedCreatedDiff', 'UpdatedServiceDiff', 'UpdatedClosedDiff']
for c in cols:
    df[c] = df[c].apply(ddiff2days)
    df[c] = df[c].astype('float32')

In [17]:
#7
df = df[df['ServiceCreatedDiff'] > 0.00]
df = df[df['ClosedServiceDiff'] > 0.00]
print(df.shape)

(3549977, 39)


2) Data Processing & Feature Engineering

In [18]:
#8
df['Address'] = df['Address'].str.lower()
df['Address'] = df['Address'].str.replace(',','')
df = df[df['Address'].notnull()]
df['Address'] = df['Address'].str.replace('1  ', '1 ')
df['Address'] = df['Address'].str.replace('1/2 ', '')
df['Address'] = df['Address'].str.replace('at ', '')
print(df.shape)

(3549967, 39)


In [19]:
#9
import re
df['st_1'] = df['Address'].str.extract('([a-zA-Z ]+)', expand=False).str.strip()
df['st_1'] = (df['st_1'].replace('','Unknown')).astype('object')
df['First'] = df['st_1'].str.split('\s+').str[0]
df['Third'] = df['st_1'].str.split('\s+').str[-1]
df['zip'] = df['Address'].str.extract(r'(\d{5}\-?\d{0,4})')
df['zip'] = df['zip'].str.slice(stop=5)
df.drop(['ZipCode', 'service'], axis=1, inplace=True)

In [20]:
#10
cols = ['Direction', 'StreetName', 'Suffix',
        'MobileOS', 'AssignTo', 'ApproximateAddress',
        'Location', 'TBMColumn', 'APC', 
        'CDMember', 'NCName', 'PolicePrecinct']
for c in cols:
    df[c].fillna('unknown', inplace=True)
    df[c] = df[c].astype('object')

df['NC'].fillna(999.0, inplace=True)
df['TBMPage'].fillna(999.0, inplace=True)
df['CD'].fillna(99.0, inplace=True)
df['TBMRow'].fillna(99.0, inplace=True)
df['zip'].fillna('99999', inplace=True)

cols = ['NC', 'TBMPage', 'CD', 'TBMRow', 'zip']
for c in cols:
    df[c] = df[c].astype('int64')

In [21]:
#11
df['NCName'] = df['NCName'].str.lower()
df['NCName'] = df['NCName'].astype('str').str.replace('nc', '').str.replace('cc', '').str.replace('ndc', '').str.replace('p.i.c.o.', 'pico union')
df['TBMColumn'] = df['TBMColumn'].astype('str').str.replace('unknown','A').str.replace('I','A')
df['assign'] = df['AssignTo'].str.extract('([a-zA-Z ]+)', expand=False).str.strip()
df['assign'] = df['assign'].astype('str').str.replace('SSC Residential', 'unknown').str.replace('LSD', 'unknown')
df.drop(['AssignTo'], axis=1, inplace=True)

In [22]:
#12
df['Direction'] = (df['Direction'].map({
'W': 'W', 'N': 'N', 'S': 'S', 'E': 'E', 'unknown': 'W', 'WEST': 'W', 'R': 'W', 'NORTH': 'N',
'SHELDON': 'W', 'EXPOSITION': 'W', 'BURBANK': 'W', 'SOUTH': 'S',
'ELYSIAN': 'W', 'CORBIN': 'W', 'VAN': 'W', 'VICTORY': 'W',
'HOMER': 'W', 'ARLETA': 'W', 'WORTH': 'W', 'SAN': 'W',
'ALLEGHENY': 'W', 'TOBERMAN': 'W', 'SANTA': 'W', 'FALLBROOK': 'W',
'UNIVERSAL': 'W', 'CLEAR': 'W', 'RESEDA': 'W',
})).astype('object')

In [23]:
#13
df['Suffix'] = (df['Suffix'].map({
'AVE': 'ave', 'ST': 'st', 'BLVD': 'blvd', 'unknown': 'unknown', 'DR': 'dr', 'PL': 'pl',
'WAY': 'wy', 'ROAD': 'rd', 'LANE': 'ln', 'TER': 'ter', 'CT': 'ct', 'AV': 'ave',
'CIR': 'ct', 'HWY': 'wy', 'RD': 'rd', 'WALK': 'wy', 'TR': 'ter', 'GRN': 'ct',        
'LN': 'ln', 'WY': 'wy', 'MALL': 'wy', 'CL': 'ct', 'PARK': 'wy', 'WK': 'wy',
'PKWY': 'wy', 'CK': 'ct', 'VISTA': 'unknown', 'HILL': 'unknown', 'PASS': 'unknown', 'VIS': 'unknown',
'COVE': 'unknown', 'CYN': 'unknown', 'ROW': 'unknown', 'PT': 'unknown', 'AL': 'unknown', 'SQ': 'unknown',
'PZ': 'unknown', 'PASEO': 'unknown', 'RDG': 'unknown', 'VIEW': 'unknown', 'VIA': 'unknown', 'SP': 'unknown',
'HL': 'unknown', 'VW': 'unknown', 'CV': 'unknown'
})).astype('object')

In [24]:
#14
action = []
#For each row in the column,
for row in df['ActionTaken']:
    if row == 'SR Created':
        action.append('SR Created')
    else:
        action.append('Others')

df['action'] = action
df['action'] = (df['action']).astype('object')
df.drop(['ActionTaken'], axis=1, inplace=True)

In [25]:
#15
own = []
for row in df['Owner']:
    if row == 'RAP':
        own.append('BOS')
    else:
        own.append(row)
df['own'] = own
df['own'] = (df['own']).astype('object')
df.drop(['Owner'], axis=1, inplace=True)

In [26]:
#16
source= []
for row in df['RequestSource']:
    if row == 'Call':
        source.append('Call')
    elif row == 'Mobile App':
        source.append('Mobile App')
    elif row == 'Self Service':
        source.append('Self Service')
    elif row == 'Email':
        source.append('Email')
    else:
        source.append('Mobile App')
        
df['source'] = source
df['source'] = (df['source']).astype('object')
df.drop(['RequestSource', 'Address', # 'NCName', 'own', 'action'
        ], axis=1, inplace=True)

In [27]:
#17
df['created_year'] = (df['CreatedDate'].dt.year).astype('int16')
df['created_month'] = (df['CreatedDate'].dt.month).astype('int16')
df['created_hour'] = (df['CreatedDate'].dt.hour).astype('int16')
df['created_week'] = (df['CreatedDate'].dt.week).astype('int16')
df['created_dayofweek'] = (df['CreatedDate'].dt.dayofweek).astype('int16')
df['created_weekend'] = (np.where(np.logical_or(df['created_dayofweek'] == 5,
                                               df['created_dayofweek'] == 6), 1, 0)).astype('int16')
df['service_year'] = (df['ServiceDate'].dt.year).astype('int16')
df['service_month'] = (df['ServiceDate'].dt.month).astype('int16')
df['service_week'] = (df['ServiceDate'].dt.week).astype('int16')

df['closed_year'] = (df['ClosedDate'].dt.year).astype('int16')
df['closed_month'] = (df['ClosedDate'].dt.month).astype('int16')
df['closed_week'] = (df['ClosedDate'].dt.week).astype('int16')

In [28]:
#18
df['oversixmonths_closedcreated'] = (np.where(df['ClosedCreatedDiff']> 180.00, 1, 0)).astype('int8')
df['overthreemonths_closedservice'] = (np.where(df['ClosedServiceDiff']> 120.00, 1, 0)).astype('int8')
df['overthreemonths_servicecreated'] = (np.where(df['ServiceCreatedDiff']> 120.00, 1, 0)).astype('int8')
df['oversixmonths_updatedcreated'] = (np.where(df['UpdatedCreatedDiff']> 180.00, 1, 0)).astype('int8')
df['overthreemonths_updatedservice'] = (np.where(df['UpdatedServiceDiff']> 120.00, 1, 0)).astype('int8')

In [29]:
#19
cols = ['ClosedCreatedDiff', 'ServiceCreatedDiff', 'ClosedServiceDiff', 
        'UpdatedCreatedDiff', 
       ]
for c in cols:
    df['max_zipcd_'+c]=(df.groupby(['zip', 'CD'])[c].transform('max')).astype('float32')
    df['max_zip_'+c]=(df.groupby(['zip', 'CD', 'RequestType'])[c].transform('max')).astype('float32')
    df['max_zipsourequest_'+c]=(df.groupby(['zip', 'RequestType'])[c].transform('max')).astype('float32')
    
    df['max_zipassign_'+c]=(df.groupby(['zip', 'assign'])[c].transform('max')).astype('float32')
    df['max_st1assign_'+c]=(df.groupby(['st_1', 'assign'])[c].transform('max')).astype('float32')
    df['max_firstassign_'+c]=(df.groupby(['First', 'assign'])[c].transform('max')).astype('float32')
    df['max_thirdassign_'+c]=(df.groupby(['Third', 'assign'])[c].transform('max')).astype('float32')
    df['max_Locationassign_'+c]=(df.groupby(['Location', 'assign'])[c].transform('max')).astype('float32')
    
    df['max_st1cd_'+c]=(df.groupby(['st_1', 'CD'])[c].transform('max')).astype('float32')
    df['max_firstcd_'+c]=(df.groupby(['First', 'CD'])[c].transform('max')).astype('float32')
    df['max_thirdcd_'+c]=(df.groupby(['Third', 'CD'])[c].transform('max')).astype('float32')
    df['max_Locationcd_'+c]=(df.groupby(['Location', 'CD'])[c].transform('max')).astype('float32')
    
    df['max_st1type_'+c]=(df.groupby(['st_1', 'RequestType'])[c].transform('max')).astype('float32')
    df['max_firsttype_'+c]=(df.groupby(['First', 'RequestType'])[c].transform('max')).astype('float32')
    df['max_thirdtype_'+c]=(df.groupby(['Third', 'RequestType'])[c].transform('max')).astype('float32')
    df['max_Locationtype_'+c]=(df.groupby(['Location', 'RequestType'])[c].transform('max')).astype('float32')
    
    df['max_zipnc_'+c]=(df.groupby(['zip', 'NC'])[c].transform('max')).astype('float32')
    df['max_st1nc_'+c]=(df.groupby(['st_1', 'NC'])[c].transform('max')).astype('float32')
    df['max_firstnc_'+c]=(df.groupby(['First', 'NC'])[c].transform('max')).astype('float32')
    df['max_thirdnc_'+c]=(df.groupby(['Third', 'NC'])[c].transform('max')).astype('float32')
    df['max_Locationnc_'+c]=(df.groupby(['Location', 'NC'])[c].transform('max')).astype('float32')
    
    df['max_zipapc_'+c]=(df.groupby(['zip', 'APC'])[c].transform('max')).astype('float32')
    df['max_st1apc_'+c]=(df.groupby(['st_1', 'APC'])[c].transform('max')).astype('float32')
    df['max_firstapc_'+c]=(df.groupby(['First', 'APC'])[c].transform('max')).astype('float32')
    df['max_thirdapc_'+c]=(df.groupby(['Third', 'APC'])[c].transform('max')).astype('float32')
    df['max_Locationapc_'+c]=(df.groupby(['Location', 'APC'])[c].transform('max')).astype('float32')
    
df.drop(['created_hour', 'created_dayofweek', 'created_weekend',
'created_week', 'service_week', 'closed_week', 
], axis=1, inplace=True)

In [30]:
#20
df['source_type'] = df['source'].astype('str')+'_'+df['RequestType'].astype('str')
df['typesource_os'] = df['RequestType'].astype('str')+'_'+df['source'].astype('str')+df['MobileOS'].astype('str')
df['verify_approx'] = df['AddressVerified'].astype('str')+'_'+df['ApproximateAddress'].astype('str')
df['page_row_column'] = df['TBMPage'].astype('str')+'_'+df['TBMRow'].astype('str')+'_'+df['TBMColumn'].astype('str')

df['pc_apc'] = df['PolicePrecinct'].astype('str')+'_'+df['APC'].astype('str')
df['cd_member'] = df['CD'].astype('str')+'_'+df['CDMember'].astype('str')
df['cd_assign'] = df['CD'].astype('str')+'_'+df['assign'].astype('str')
df['cd_nc_pc'] = df['CD'].astype('str')+'_'+df['NC'].astype('str')

In [31]:
#21
cols = ['ClosedCreatedDiff', 'ServiceCreatedDiff', 'ClosedServiceDiff', 
        'UpdatedCreatedDiff', #'UpdatedClosedDiff', 
        #'UpdatedServiceDiff'
       ]
for c in cols:
    df['count_'+c] = (df[c].map(df[c].value_counts())).astype('int32')
    
del cols
gc.collect()

12

In [32]:
df.drop(['max_zipnc_ClosedCreatedDiff', 'max_st1nc_ClosedCreatedDiff', 
         'max_firstnc_ClosedCreatedDiff', 'max_thirdnc_ClosedCreatedDiff', 'max_Locationnc_ClosedCreatedDiff',
         'max_zipnc_ClosedServiceDiff', 'max_st1nc_ClosedServiceDiff', 
         'max_firstnc_ClosedServiceDiff', 'max_thirdnc_ClosedServiceDiff', 'max_Locationnc_ClosedServiceDiff',
         'max_zipnc_ServiceCreatedDiff', 'max_st1nc_ServiceCreatedDiff', 
         'max_firstnc_ServiceCreatedDiff', 'max_thirdnc_ServiceCreatedDiff', 'max_Locationnc_ServiceCreatedDiff',
        'max_zipapc_ClosedCreatedDiff', 'max_st1apc_ClosedCreatedDiff', 
         'max_firstapc_ClosedCreatedDiff', 'max_thirdapc_ClosedCreatedDiff', 'max_Locationapc_ClosedCreatedDiff',
         'max_zipapc_ClosedServiceDiff', 'max_st1apc_ClosedServiceDiff', 
         'max_firstapc_ClosedServiceDiff', 'max_thirdapc_ClosedServiceDiff', 'max_Locationapc_ClosedServiceDiff',
         'max_zipapc_ServiceCreatedDiff', 'max_st1apc_ServiceCreatedDiff', 
         'max_firstapc_ServiceCreatedDiff', 'max_thirdapc_ServiceCreatedDiff', 'max_Locationapc_ServiceCreatedDiff',        
        'count_ClosedCreatedDiff', 'count_ServiceCreatedDiff', 'count_ClosedServiceDiff', 
        'count_UpdatedCreatedDiff',], axis=1, inplace=True)

In [33]:
#22
df['Status'] = (df['Status'].map({
'Closed': 'closed',
'Cancelled': 'cancelled',
'Referred Out': 'Open',
'Forward': 'open',
'Open': 'open',
'Pending': 'open'
})).astype('object')

In [34]:
#23
cols = [c for c in df.columns if df[c].dtypes=='object']
for c in cols:
    le = LabelEncoder()
    df[c] = (le.fit_transform(df[c])).astype('int32')

3) Model Building 

In [35]:
#24
df.sort_values(by=['CreatedDate'], inplace=True, ascending=True)

In [36]:
#25
y = pd.DataFrame(df['Status'])
X = df.drop(['CreatedDate',
             'ServiceDate', 'UpdatedDate', 'ClosedDate', 'Status', 
             'NCName', 'action', 'own', 
             'StreetName', 
             #'Suffix', 
             #'First', 'Third',
             'Latitude', 'Longitude',
             #'pc_apc', 'cd_nc_pc',
             #'page_row_column', 
             #'TBMPage', 'TBMColumn', 'TBMRow'
            ], axis=1)
print(X.shape, df.shape, y.shape)

#del df
gc.collect()

(3549967, 122) (3549967, 133) (3549967, 1)


0

In [37]:
#26
X_train = X.iloc[:2549967]
X_val = X.iloc[2549967:]
y_train = y.iloc[:2549967]
y_val = y.iloc[2549967:]
print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

(2549967, 122) (1000000, 122) (2549967, 1) (1000000, 1)


In [38]:
lgb_clf = LGBMClassifier(
                      objective='multiclass',
                      n_estimators=4000,
                      learning_rate=0.01,
                      feature_fraction=0.2,
                      bagging_fraction=0.2,
                      min_data_in_leaf=13,
                      max_depth=-1,
                      num_leaves=20,
                      early_stopping_rounds=100,
                      bagging_freq=5,
                      random_state=42,
                     )

lgb_clf.fit(X_train, y_train,
      eval_set = [(X_train, y_train),(X_val, y_val.values)],
      eval_metric = 'multiclass', 
      early_stopping_rounds = 100,
      verbose = 100
    )
lgb_pred = lgb_clf.predict_proba(X_val)[:, -1]

Training until validation scores don't improve for 100 rounds
[100]	training's multi_logloss: 0.0548404	valid_1's multi_logloss: 0.0758155
[200]	training's multi_logloss: 0.0490642	valid_1's multi_logloss: 0.0689052
[300]	training's multi_logloss: 0.046885	valid_1's multi_logloss: 0.0666342
[400]	training's multi_logloss: 0.0457289	valid_1's multi_logloss: 0.0656381
[500]	training's multi_logloss: 0.0448872	valid_1's multi_logloss: 0.0650859
[600]	training's multi_logloss: 0.0442919	valid_1's multi_logloss: 0.0647865
[700]	training's multi_logloss: 0.0438255	valid_1's multi_logloss: 0.064601
[800]	training's multi_logloss: 0.0434253	valid_1's multi_logloss: 0.0644372
[900]	training's multi_logloss: 0.0431106	valid_1's multi_logloss: 0.0642786
[1000]	training's multi_logloss: 0.0428335	valid_1's multi_logloss: 0.0641462
[1100]	training's multi_logloss: 0.0425762	valid_1's multi_logloss: 0.0640373
[1200]	training's multi_logloss: 0.042357	valid_1's multi_logloss: 0.0638922
[1300]	trainin