## Business Understanding

### Business Problem
An urban Emergency Medical Services (EMS) agency is attempting to identify factors which they can manipulate in order to improve operational efficiency. The agency runs approximately 130,000 911 calls for service each year. Over the last several years, the agency has implemented brand new resources like basic life support, and single resource 'fly' cars in an attempt to provide the right resourfces to the right patients at the right time. Currently, the agency's primary operation metrics is Response Time Compliance (RTC). The RTC goal is to have a 90th percentile response time to emergent calls in 9 minutes or less.

### Objective:
- **Goal:** Analyze data to identify factors that have the most impact on RTC. Create a viable model which can provide explanitory insights into the operation impacts of RTC.

- **Key Questions:**

  1. What factors influence RTC?
  2. Can a model accurately identify cases of compliance and non-compliance?

- **Success Criteria:**

  1. Identify modifiable features which can be manipulated to hopefully improve RTC.

## Data Understanding

In [22]:
# # Data Manipulation and Analysis
import pandas as pd
import numpy as np

# # Data Visualization
# import matplotlib.pyplot as plt
# import seaborn as sns

# # Statistical Analysis
# import statsmodels.api as sm
# from scipy.stats import ttest_ind, kstest, mannwhitneyu, norm

# # Machine Learning Models
# from sklearn.linear_model import LogisticRegression, LinearRegression
# from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

# # Preprocessing
from sklearn.impute import KNNImputer
# from sklearn.preprocessing import StandardScaler

# # Model Evaluation
# from sklearn.metrics import (
#     accuracy_score,
#     classification_report,
#     confusion_matrix,
#     roc_auc_score,
#     roc_curve,
#     f1_score,
#     recall_score,
#     mean_squared_error,
#     mean_absolute_error,
#     r2_score,
#     auc
# )

# # Class Weight Evaluation
# from sklearn.utils.class_weight import compute_class_weight

# # Model Selection and Cross-Validation
# from sklearn.model_selection import (
#     GridSearchCV,
#     RandomizedSearchCV,
#     train_test_split,
#     StratifiedKFold,
#     cross_val_score
# )

# # Pipelines
# from sklearn.pipeline import Pipeline

In [12]:
# Data loading and preview

data = pd.read_csv('data/ems_ops.csv')
data.head()

Unnamed: 0,year,month,day,day_of_week,hour,incident_count,dist_mean,perc_from_hosp,base_ed_divert,system_overload,weather_status,RTC,emergent_responses,non_emergent_responses,mean_response_all,mean_response_emergent,mean_response_non_emergent,percentile_90_response_all,percentile_90_response_emergent,percentile_90_response_non_emergent,bls_ambulances,satellite_ambulances,als_ambulances,fly_cars,total_cars,non_emergent_transports,emergent_transports,chute_times_all,chute_times_non_emergent,chute_times_emergent,temperature,rain,snowfall
0,2014,11,1,Sat,0,13,1.58586,0.153846,0,0,0,1.0,7,4.0,380.909091,267.571429,579.25,576.0,439.2,759.4,0.0,0.0,0.0,0.0,0.0,6.0,1.0,,,,12.2545,0.0,0.0
1,2014,11,1,Sat,1,24,3.213414,0.166667,0,0,0,0.9375,16,5.0,407.571429,326.4375,667.2,725.0,522.0,817.6,0.0,0.0,0.0,0.0,0.0,14.0,,,,,10.3545,0.0,0.0
2,2014,11,1,Sat,2,26,2.709078,0.269231,1,1,0,0.904762,21,5.0,389.153846,355.047619,532.4,699.0,511.0,1007.0,0.0,0.0,0.0,0.0,0.0,16.0,2.0,,,,8.7045,0.0,0.0
3,2014,11,1,Sat,3,8,3.461524,0.375,1,1,0,1.0,3,4.0,396.285714,298.666667,469.5,624.0,347.4,688.5,0.0,0.0,0.0,0.0,0.0,14.0,,,,,7.2545,0.0,0.0
4,2014,11,1,Sat,4,6,1.970146,0.0,1,1,0,0.833333,6,0.0,332.833333,332.833333,0.0,577.5,577.5,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,,,,5.5045,0.0,0.0


In [13]:
# Data overview
# Set the option to display all columns
pd.set_option('display.max_columns', None)

data.describe()

Unnamed: 0,year,month,day,hour,incident_count,dist_mean,perc_from_hosp,base_ed_divert,system_overload,weather_status,RTC,emergent_responses,non_emergent_responses,mean_response_all,mean_response_emergent,mean_response_non_emergent,percentile_90_response_all,percentile_90_response_emergent,percentile_90_response_non_emergent,bls_ambulances,satellite_ambulances,als_ambulances,fly_cars,total_cars,non_emergent_transports,emergent_transports,chute_times_all,chute_times_non_emergent,chute_times_emergent,temperature,rain,snowfall
count,87088.0,87088.0,87088.0,87088.0,87088.0,87088.0,87088.0,87088.0,87088.0,87088.0,87088.0,87088.0,87088.0,87088.0,87088.0,87088.0,87088.0,87088.0,87088.0,87088.0,87088.0,87088.0,87088.0,87088.0,85726.0,28370.0,51483.0,45528.0,51237.0,87088.0,87088.0,87088.0
mean,2019.340185,6.523964,15.732569,11.544817,12.329242,3.87792,0.143623,0.117008,0.24907,0.006752,0.882147,7.742766,3.068632,434.538436,358.594131,578.353522,687.013872,508.327457,751.153516,0.130614,0.860627,13.29503,0.076501,14.362772,7.479085,1.307367,105.773613,122.724327,99.096213,9.601541,0.033006,0.012297
std,2.896795,3.448693,8.802954,6.914466,5.203629,27.1293,0.120634,0.321432,0.432477,0.081892,0.153357,3.729647,1.979666,97.73144,86.116725,273.71423,205.475725,146.660822,364.075128,0.44915,0.978303,5.596428,0.267326,5.838694,3.783527,0.617571,30.097419,71.723473,29.612193,11.994113,0.238435,0.088175
min,2014.0,1.0,1.0,0.0,1.0,0.008578,0.0,0.0,0.0,0.0,0.0,1.0,0.0,5.538462,1.0,0.0,6.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,5.0,1.13,1.696,-36.645,0.0,0.0
25%,2017.0,4.0,8.0,6.0,8.0,2.112666,0.055556,0.0,0.0,0.0,0.8,5.0,2.0,366.75,301.727273,438.5,538.9,412.5,545.3,0.0,0.0,8.758889,0.0,9.546389,5.0,1.0,87.5,82.44655,81.0,0.2045,0.0,0.0
50%,2019.0,7.0,16.0,12.0,12.0,2.58179,0.130435,0.0,0.0,0.0,0.923077,7.0,3.0,426.0,348.75,580.75,657.5,487.0,753.0,0.0,1.0,14.0,0.0,15.0,7.0,1.0,101.857143,109.333333,95.4414,9.655001,0.0,0.0
75%,2022.0,10.0,23.0,18.0,16.0,3.157333,0.214286,0.0,0.0,0.0,1.0,10.0,4.0,492.538462,403.960526,735.0,804.025,578.8,983.5,0.0,1.729167,18.0,0.0,19.108681,10.0,1.0,119.156316,144.827333,112.5,18.6045,0.0,0.0
max,2024.0,12.0,31.0,23.0,88.0,2623.5661,1.0,1.0,1.0,1.0,1.0,27.0,15.0,1541.0,1732.0,1801.0,1795.0,1742.5,1801.0,6.0,5.487778,37.828611,2.0,45.579722,26.0,11.0,971.0,3599.87,944.440333,38.305,11.4,3.5


In [14]:
# Values
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87088 entries, 0 to 87087
Data columns (total 33 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   year                                 87088 non-null  int64  
 1   month                                87088 non-null  int64  
 2   day                                  87088 non-null  int64  
 3   day_of_week                          87088 non-null  object 
 4   hour                                 87088 non-null  int64  
 5   incident_count                       87088 non-null  int64  
 6   dist_mean                            87088 non-null  float64
 7   perc_from_hosp                       87088 non-null  float64
 8   base_ed_divert                       87088 non-null  int64  
 9   system_overload                      87088 non-null  int64  
 10  weather_status                       87088 non-null  int64  
 11  RTC                         

## Data Preparation

In [15]:
# Check to see how many chute_times_emergent is null
data['chute_times_emergent'].isnull().value_counts()

chute_times_emergent
False    51237
True     35851
Name: count, dtype: int64

In [16]:
# Finding first index where chute time(s) aren't null
first_valid_index = data['chute_times_all'].first_valid_index()
print(first_valid_index)

# Looks like the majority of the early data isn't valid. Too much to impute so I'll drop those
data = data.loc[first_valid_index:]

35199


In [17]:
# Missing values

# tx_counts - if null consider 0
data['non_emergent_transports'] = data['non_emergent_transports'].fillna(0)
data['emergent_transports'] = data['emergent_transports'].fillna(0)
# chute times - impute
knn_imputer = KNNImputer(n_neighbors=5) # Taking the mean value from 5 most similar rows
chute_times = ['chute_times_all', 'chute_times_non_emergent', 'chute_times_emergent']
data[chute_times] = knn_imputer.fit_transform(data[chute_times])

In [18]:
# total_cars min is 0, this should never happen, dropping rows where < 1
data = data[data['total_cars'] >= 1]

In [19]:
# dist_mean: mean + 1 std is roughly the diameter of the city
# Calculate the upper limit
k = 1
mean = data['dist_mean'].mean()
std_dev = data['dist_mean'].std()
upper_limit = mean + k * std_dev

# Filter the dataset
data = data[data['dist_mean'] <= upper_limit]

In [20]:
# Going to drop features that are derivatives of others or otherwise not needed
data = data.drop(columns=['mean_response_all',
                          'mean_response_emergent',
                          'mean_response_non_emergent',
                          'percentile_90_response_all',
                          'RTC',
                          'incident_count',
                          'system_overload'])

In [23]:
# Feature engineering

# tx_per_ambulance: Transports per transporting car
data['tx_per_ambulance'] = np.where(
    (data['als_ambulances'] + data['bls_ambulances'] + data['satellite_ambulances']) == 0,
    0,
    (data['non_emergent_transports'] + data['emergent_transports']) / (data['als_ambulances'] + data['bls_ambulances'] + data['satellite_ambulances'])
)

# resp_per_ambulance: Responses per responding unit
data['resp_per_ambulance'] = np.where(
    data['total_cars'] == 0,
    0,
    (data['emergent_responses'] + data['non_emergent_responses']) / data['total_cars']
)

# als_resources_per_emergent_response: ALS resources per emergent response
data['als_resources_per_emergent_response'] = np.where(
    data['emergent_responses'] == 0,
    0,
    (data['als_ambulances'] + data['satellite_ambulances'] + data['fly_cars']) / data['emergent_responses']
)

# Lags and diffs
data['emergent_responses_lag1'] = data['emergent_responses'].shift(1)
data['non_emergent_responses_lag1'] = data['non_emergent_responses'].shift(1)
data['base_ed_divert_lag1'] = data['base_ed_divert'].shift(1)
data['non_emergent_transports_lag1'] = data['non_emergent_transports'].shift(1)
data['emergent_transports_lag1'] = data['emergent_transports'].shift(1)
data['percentile_90_response_emergent_lag1'] = data['percentile_90_response_emergent'].shift(1)
data['percentile_90_response_emergent_lag2'] = data['percentile_90_response_emergent'].shift(2)

data['total_cars_dif1'] = data['total_cars'].diff()
data['als_ambulances_dif1'] = data['als_ambulances'].diff()
data['bls_ambulances_dif1'] = data['bls_ambulances'].diff()
data['satellite_ambulances_dif1'] = data['satellite_ambulances'].diff()
data['fly_cars_dif1'] = data['fly_cars'].diff()

# Testing features
data['is_peak'] = data['hour'].apply(lambda x: 1 if 7 <= x <= 21 else 0) # Flagging peak times
data['overload_flag'] = (data['emergent_responses'] + data['non_emergent_responses']) > data['total_cars'] # Flag if more calls than cars
data['emergent_ratio'] = np.where(
    (data['emergent_responses'] + data['non_emergent_responses']) == 0,
    0,  # Set ratio to 0 when there are no responses
    data['emergent_responses'] / (data['emergent_responses'] + data['non_emergent_responses'])
) # Proportion of emergent vs. non-emergent calls
data['rolling_emergent_avg'] = data['emergent_responses'].rolling(7).mean() # 7 hour rolling average of emergent responses

# Dropping NaNs created by lag and diff features
data = data.dropna()

In [24]:
# Handle encoding
data = pd.get_dummies(data, columns=['day_of_week'], drop_first=True)

## Exploratory Data Analysis (EDA)

In [25]:
# Checking for any remaining non-null values
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 51531 entries, 35205 to 87087
Data columns (total 50 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   year                                  51531 non-null  int64  
 1   month                                 51531 non-null  int64  
 2   day                                   51531 non-null  int64  
 3   hour                                  51531 non-null  int64  
 4   dist_mean                             51531 non-null  float64
 5   perc_from_hosp                        51531 non-null  float64
 6   base_ed_divert                        51531 non-null  int64  
 7   weather_status                        51531 non-null  int64  
 8   emergent_responses                    51531 non-null  int64  
 9   non_emergent_responses                51531 non-null  float64
 10  percentile_90_response_emergent       51531 non-null  float64
 11  percentile_90_re

In [26]:
# One last inspection
data.describe()

Unnamed: 0,year,month,day,hour,dist_mean,perc_from_hosp,base_ed_divert,weather_status,emergent_responses,non_emergent_responses,percentile_90_response_emergent,percentile_90_response_non_emergent,bls_ambulances,satellite_ambulances,als_ambulances,fly_cars,total_cars,non_emergent_transports,emergent_transports,chute_times_all,chute_times_non_emergent,chute_times_emergent,temperature,rain,snowfall,tx_per_ambulance,resp_per_ambulance,als_resources_per_emergent_response,emergent_responses_lag1,non_emergent_responses_lag1,base_ed_divert_lag1,non_emergent_transports_lag1,emergent_transports_lag1,percentile_90_response_emergent_lag1,percentile_90_response_emergent_lag2,total_cars_dif1,als_ambulances_dif1,bls_ambulances_dif1,satellite_ambulances_dif1,fly_cars_dif1,is_peak,emergent_ratio,rolling_emergent_avg
count,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0,51531.0
mean,2021.36054,6.482642,15.777047,11.541499,2.753897,0.144996,0.147465,0.011275,8.028449,2.946692,531.587728,748.914199,0.190294,1.448881,13.097726,0.119147,14.856048,8.216976,0.530671,105.755024,120.239122,99.151086,9.622149,0.028507,0.011871,0.724722,0.900917,2.293268,8.028391,2.946556,0.147465,8.216899,0.53071,531.588114,531.591285,0.000324,0.000305,-3.016266e-20,1.9e-05,-1.7235809999999998e-20,0.627933,0.728896,8.028213
std,1.7298,3.439255,8.811233,6.920089,0.905737,0.120347,0.354572,0.105583,3.812046,1.945405,155.617769,373.194988,0.550964,0.872457,5.946367,0.316515,6.283984,3.668753,0.774099,29.908328,68.591918,29.636565,12.210261,0.233575,0.089333,0.558215,0.666003,1.855665,3.812054,1.945297,0.354572,3.668758,0.774122,155.617652,155.615503,2.091173,2.022183,0.1562025,0.400075,0.1558567,0.483361,0.159997,2.463328
min,2018.0,1.0,1.0,0.0,0.008578,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,5.0,1.13,1.696,-36.645,0.0,0.0,0.0,0.049086,0.1,1.0,0.0,0.0,0.0,0.0,1.0,1.0,-19.115556,-16.615,-2.616389,-2.593333,-1.731111,0.0,0.1,1.285714
25%,2020.0,4.0,8.0,6.0,2.139314,0.058824,0.0,0.0,5.0,1.0,428.0,541.6,0.0,1.0,8.446111,0.0,9.699028,5.0,0.0,87.625,82.0,81.101417,-0.245,0.0,0.0,0.421525,0.531357,1.265215,5.0,1.0,0.0,5.0,0.0,428.0,428.0,-1.729722,-1.624028,0.0,0.0,0.0,0.0,0.625,6.0
50%,2021.0,6.0,16.0,12.0,2.611918,0.133333,0.0,0.0,8.0,3.0,507.5,754.3,0.0,1.22,13.746111,0.0,15.661389,8.0,0.0,102.083333,107.6192,95.7,9.455,0.0,0.0,0.592593,0.735198,1.845301,8.0,3.0,0.0,8.0,0.0,507.5,507.5,0.253056,0.223611,0.0,0.0,0.0,1.0,0.75,8.0
75%,2023.0,9.0,23.0,18.0,3.210157,0.214286,0.0,0.0,11.0,4.0,607.95,988.5,0.0,2.0,18.0,0.0,20.068333,11.0,1.0,118.932045,140.72125,112.375,18.905,0.0,0.0,0.843948,1.034047,2.715663,11.0,4.0,0.0,11.0,1.0,607.95,607.95,1.624444,1.597361,0.0,0.0,0.0,1.0,0.833333,9.857143
max,2024.0,12.0,31.0,23.0,32.215901,1.0,1.0,1.0,27.0,14.0,1742.5,1801.0,6.0,5.487778,37.828611,2.0,45.579722,26.0,11.0,971.0,3599.87,944.440333,38.105,11.4,3.5,17.647059,13.0,25.0,27.0,14.0,1.0,26.0,11.0,1742.5,1742.5,21.8425,18.499444,3.343056,2.979167,1.4775,1.0,1.0,16.285714


## Modeling

## Evaluation

## Recommendations