# Hospital Wait Time Forecast

In this notebook we use Azure AutoML to forecast the average wait time of patients in each city

$*****$ Important – Do not use in production, for demonstration purposes only – please review the legal notices before continuing $*****$ 

## Legal Notices 

This presentation, demonstration, and demonstration model are for informational purposes only. Microsoft makes no warranties, express or implied, in this presentation demonstration, and demonstration model. Nothing in this presentation, demonstration, or demonstration model modifies any of the terms and conditions of Microsoft’s written and signed agreements. This is not an offer and applicable terms and the information provided is subject to revision and may be changed at any time by Microsoft.

This presentation, demonstration, and/or demonstration model do not give you or your organization any license to any patents, trademarks, copyrights, or other intellectual property covering the subject matter in this presentation, demonstration, and demonstration model.

The information contained in this presentation, demonstration and demonstration model represent the current view of Microsoft on the issues discussed as of the date of presentation and/or demonstration, and the duration of your access to the demonstration model. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of presentation and/or demonstration and for the duration of your access to the demonstration model.

No Microsoft technology, nor any of its component technologies, including the demonstration model, is intended or made available: (1) as a medical device; (2) for the diagnosis of disease or other conditions, or in the cure, mitigation, treatment or prevention of a disease or other conditions; or (3) as a substitute for the professional clinical advice, opinion, or judgment of a treating healthcare professional. Partners or customers are responsible for ensuring the regulatory compliance of any solution they build using Microsoft technologies.

© 2020 Microsoft Corporation. All rights reserved

## Setting up the workspace

In [1]:
import azureml.core
import pandas as pd
import azureml.core
from azureml.core import Workspace, Datastore, Dataset, Experiment
from azureml.data import DataType
from azureml.data.datapath import DataPath
from azureml.core.compute import AmlCompute
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun
import os

print("SDK Version:", azureml.core.VERSION)

from azureml.core import Workspace, Datastore, Dataset
ws = Workspace.from_config()
ws

SDK Version: 1.48.0


Workspace.create(name='mlw-healthcare2-prod', subscription_id='506e86fc-853c-4557-a6e5-ad72114efd2b', resource_group='rg-healthcare2-prod')

#### Create new datastore for Datasets

In [2]:
import GlobalVariables

In [3]:
from azureml.core import Datastore

blob_datastore_name=GlobalVariables.WAIT_TIME_DATASTORE_NAME # Name of the datastore in workspace 
container_name=GlobalVariables.GLOBAL_CONTAINER_NAME
account_name=GlobalVariables.STORAGE_ACCOUNT_NAME
account_key=GlobalVariables.STORAGE_ACCOUNT_KEY # Storage account access key

blob_datastore = Datastore.register_azure_blob_container(workspace=ws, 
                                                         datastore_name=blob_datastore_name, 
                                                         container_name=container_name, 
                                                         account_name=account_name,
                                                         account_key=account_key)

dstore = Datastore.get(ws, datastore_name=blob_datastore_name)
dstore

{
  "name": "wait_time_prediction_store",
  "container_name": "predictiveanalytics",
  "account_name": "sthealthcare2prod",
  "protocol": "https",
  "endpoint": "core.windows.net"
}

In [4]:
from azureml.data.datapath import DataPath
filepath = GlobalVariables.WAIT_TIME_INPUT_FILE_NAME
print(filepath)

# Set the path to the storage account containing the file
datastore_path = [DataPath(dstore, filepath)]
patientdataset = Dataset.Tabular.from_delimited_files(path=datastore_path)
patientdataset.take(5).to_pandas_dataframe()

/pbiPatientPredictiveSet.csv


Unnamed: 0,encounter_id,hospital_id,department_id,city,patient_id,patient_age,risk_level,acute_type,patient_category,doctor_id,...,drug_cost,hospital_expense,follow_up,readmitted_patient,payment_type,date,month,year,disease,reason_for_readmission
0,21059,1,1,Los Angeles,738311d9-2f2c-11eb-aa27-70b5e8b8edbb,61,5,Acute,InPatient,9542,...,840,6300,0,0,Medicaid,2016-06-28 17:46:00,Jun,2016,,radiotherapy
1,2305342,2,6,Chicago,0e930c2e-2f31-11eb-8d13-70b5e8b8edbb,59,2,Non Acute,InPatient,3127,...,754,6074,0,0,Medicaid,2019-06-18 20:47:00,Jun,2019,,alzheimer
2,426911,1,1,Los Angeles,c7ae99aa-2f2c-11eb-88b4-70b5e8b8edbb,79,1,Non Acute,InPatient,7261,...,1017,7037,0,0,Private Insurance,2017-08-10 23:47:00,Aug,2017,,radiotherapy
3,797146,1,2,Los Angeles,4a63175e-2f2d-11eb-bf71-70b5e8b8edbb,14,2,Non Acute,InPatient,11029,...,691,6069,0,0,Medicare,2019-11-04 02:59:00,Nov,2019,,scoliosis
4,2847178,21,7,Miami,79430da9-2f32-11eb-87ff-70b5e8b8edbb,50,4,Acute,InPatient,12480,...,751,5659,1,0,Private Insurance,2016-06-11 17:28:00,Jun,2016,,flu


#### Convert to Pandas DataFrame to do data preparation

In [5]:
patient_df = patientdataset.to_pandas_dataframe()
patient_df.head()

Unnamed: 0,encounter_id,hospital_id,department_id,city,patient_id,patient_age,risk_level,acute_type,patient_category,doctor_id,...,drug_cost,hospital_expense,follow_up,readmitted_patient,payment_type,date,month,year,disease,reason_for_readmission
0,21059,1,1,Los Angeles,738311d9-2f2c-11eb-aa27-70b5e8b8edbb,61,5,Acute,InPatient,9542,...,840,6300,0,0,Medicaid,2016-06-28 17:46:00,Jun,2016,,radiotherapy
1,2305342,2,6,Chicago,0e930c2e-2f31-11eb-8d13-70b5e8b8edbb,59,2,Non Acute,InPatient,3127,...,754,6074,0,0,Medicaid,2019-06-18 20:47:00,Jun,2019,,alzheimer
2,426911,1,1,Los Angeles,c7ae99aa-2f2c-11eb-88b4-70b5e8b8edbb,79,1,Non Acute,InPatient,7261,...,1017,7037,0,0,Private Insurance,2017-08-10 23:47:00,Aug,2017,,radiotherapy
3,797146,1,2,Los Angeles,4a63175e-2f2d-11eb-bf71-70b5e8b8edbb,14,2,Non Acute,InPatient,11029,...,691,6069,0,0,Medicare,2019-11-04 02:59:00,Nov,2019,,scoliosis
4,2847178,21,7,Miami,79430da9-2f32-11eb-87ff-70b5e8b8edbb,50,4,Acute,InPatient,12480,...,751,5659,1,0,Private Insurance,2016-06-11 17:28:00,Jun,2016,,flu


In [6]:
# View info to see what the column names and types are
patient_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4369400 entries, 0 to 4369399
Data columns (total 25 columns):
 #   Column                  Dtype         
---  ------                  -----         
 0   encounter_id            int64         
 1   hospital_id             int64         
 2   department_id           int64         
 3   city                    object        
 4   patient_id              object        
 5   patient_age             int64         
 6   risk_level              int64         
 7   acute_type              object        
 8   patient_category        object        
 9   doctor_id               int64         
 10  length_of_stay          int64         
 11  wait_time               int64         
 12  type_of_stay            object        
 13  treatment_cost          int64         
 14  claim_cost              int64         
 15  drug_cost               int64         
 16  hospital_expense        int64         
 17  follow_up               int64         
 18  re

## Data Preparation for AutoML

In [7]:
timeseries_df = patient_df[['city','date', 'wait_time']]
timeseries_df

Unnamed: 0,city,date,wait_time
0,Los Angeles,2016-06-28 17:46:00,31
1,Chicago,2019-06-18 20:47:00,38
2,Los Angeles,2017-08-10 23:47:00,35
3,Los Angeles,2019-11-04 02:59:00,42
4,Miami,2016-06-11 17:28:00,50
...,...,...,...
4369395,Miami,2018-11-11 20:06:00,41
4369396,Miami,2018-11-14 15:48:00,44
4369397,Miami,2018-11-07 05:06:00,43
4369398,Miami,2018-11-02 07:03:00,44


#### Remove time dimension from the date column

In [8]:
timeseries_df['date'] = pd.to_datetime(timeseries_df['date'].dt.date)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  timeseries_df['date'] = pd.to_datetime(timeseries_df['date'].dt.date)


In [9]:
timeseries_df.head()

Unnamed: 0,city,date,wait_time
0,Los Angeles,2016-06-28,31
1,Chicago,2019-06-18,38
2,Los Angeles,2017-08-10,35
3,Los Angeles,2019-11-04,42
4,Miami,2016-06-11,50


In [10]:
timeseries_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4369400 entries, 0 to 4369399
Data columns (total 3 columns):
 #   Column     Dtype         
---  ------     -----         
 0   city       object        
 1   date       datetime64[ns]
 2   wait_time  int64         
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 100.0+ MB


In [11]:
timeseries_df_grouped = timeseries_df.groupby(['city','date'])['wait_time'].mean().reset_index()
timeseries_df_grouped = timeseries_df_grouped.sort_values(['city','date']).reset_index(drop=True)
timeseries_df_grouped.head()

Unnamed: 0,city,date,wait_time
0,Anchorage,2015-12-17,37.0
1,Anchorage,2015-12-18,46.5
2,Anchorage,2015-12-19,41.666667
3,Anchorage,2015-12-20,39.0
4,Anchorage,2015-12-21,43.0


## Split Data based on Cities

In [12]:
city_wise_dfs = {}

cities = list(timeseries_df_grouped['city'].unique())
for city in cities:
    city_df = timeseries_df_grouped[timeseries_df_grouped['city'] == city]
    city_wise_dfs[city] = city_df[['date', 'wait_time']]
    
city_wise_dfs['Honolulu'].head()

Unnamed: 0,date,wait_time
3618,2015-12-16,36.0
3619,2015-12-17,39.25
3620,2015-12-18,45.0
3621,2015-12-19,37.727273
3622,2015-12-20,40.666667


## Prepare Training and Testing set

Since we plan on predicting whether patients would be readmitted in October, November or December, we split the training and testing data based on the date

#### Split data based on time

In [13]:
date_cutoff = pd.to_datetime('2020-10-01')

all_train_dfs = {}
for city, df in city_wise_dfs.items():
    train_df = df[df['date'] < date_cutoff]
    all_train_dfs[city] = train_df

all_train_dfs[city].head()

Unnamed: 0,date,wait_time
7238,2015-12-16,43.0
7239,2015-12-17,42.111111
7240,2015-12-18,40.444444
7241,2015-12-19,41.052632
7242,2015-12-20,40.5


In [14]:
all_test_dfs = {}
for city, df in city_wise_dfs.items():
    test_df = df[df['date'] >= date_cutoff]
    all_test_dfs[city] = test_df
    
all_test_dfs['Honolulu'].head()

Unnamed: 0,date,wait_time
5369,2020-10-01,42.799257
5370,2020-10-02,41.12
5371,2020-10-03,41.098425
5372,2020-10-04,42.310484
5373,2020-10-05,42.034483


#### Upload training and testing set to the Storage Account

In [15]:
import os

local_data_folder = 'wait_time_data/'
if not os.path.exists(local_data_folder):
    os.mkdir(local_data_folder)

base_train_file = 'wait_time_data_train_'
base_test_file = 'wait_time_data_test_'

local_files = []
for city, train_df in all_train_dfs.items():
    city_without_spaces = '-'.join(city.split(' '))
  
    # Save train file
    train_file = base_train_file + city_without_spaces + '.csv'
    train_df.to_csv(local_data_folder + train_file, index=False)
    local_files.append(local_data_folder + train_file)
    
    # Save test file
    test_file = base_test_file + city_without_spaces + '.csv'
    test_df = all_test_dfs[city]
    test_df.to_csv(local_data_folder + test_file, index=False)
    local_files.append(local_data_folder + test_file)


In [16]:
# Upload the data
print(local_files)

dstore.upload_files(
    files = local_files,
    relative_root = local_data_folder,
    target_path = '/',
    overwrite=True,
    show_progress=True
)

['wait_time_data/wait_time_data_train_Anchorage.csv', 'wait_time_data/wait_time_data_test_Anchorage.csv', 'wait_time_data/wait_time_data_train_Chicago.csv', 'wait_time_data/wait_time_data_test_Chicago.csv', 'wait_time_data/wait_time_data_train_Honolulu.csv', 'wait_time_data/wait_time_data_test_Honolulu.csv', 'wait_time_data/wait_time_data_train_Los-Angeles.csv', 'wait_time_data/wait_time_data_test_Los-Angeles.csv', 'wait_time_data/wait_time_data_train_Miami.csv', 'wait_time_data/wait_time_data_test_Miami.csv']
Uploading an estimated of 10 files
Uploading wait_time_data/wait_time_data_train_Anchorage.csv
Uploaded wait_time_data/wait_time_data_train_Anchorage.csv, 1 files out of an estimated total of 10
Uploading wait_time_data/wait_time_data_test_Anchorage.csv
Uploaded wait_time_data/wait_time_data_test_Anchorage.csv, 2 files out of an estimated total of 10
Uploading wait_time_data/wait_time_data_train_Chicago.csv
Uploaded wait_time_data/wait_time_data_train_Chicago.csv, 3 files out of 

$AZUREML_DATAREFERENCE_wait_time_prediction_store

"datastore.upload_files" is deprecated after version 1.0.69. Please use "FileDatasetFactory.upload_directory" instead. See Dataset API change notice at https://aka.ms/dataset-deprecation.


### Set up AutoML Experiment

#### Set the Data Types for each column. 
This needs to be done explicitly since some ID columns are automatically inferred as integers, when they should be treated as strings

In [17]:
from azureml.data import DataType

data_types = {
    'wait_time': DataType.to_long(),
    'date': DataType.to_datetime("%Y-%m-%d"),
}

print(len(data_types))

2


In [18]:
all_train_dfs.keys()


dict_keys(['Anchorage', 'Chicago', 'Honolulu', 'Los Angeles', 'Miami'])

#### Load Training data from Storage Blob as a TabularDataSet

In [19]:
all_train_datasets = {}
for city in all_train_dfs.keys():
    filepath = base_train_file + city_without_spaces + '.csv'

    datastore_path = [DataPath(dstore, filepath)]
    traindataset = Dataset.Tabular.from_delimited_files(path=datastore_path)
    traindataset.to_pandas_dataframe().info()
    all_train_datasets[city] = traindataset
    


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1751 entries, 0 to 1750
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   date       1751 non-null   datetime64[ns]
 1   wait_time  1751 non-null   float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 27.5 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1751 entries, 0 to 1750
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   date       1751 non-null   datetime64[ns]
 1   wait_time  1751 non-null   float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 27.5 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1751 entries, 0 to 1750
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   date       1751 non-null   datetime64[ns]
 1   wait_time  1751 non-null   float64      

In [20]:
y_variable = "wait_time"

#### Setup Computer Instances

In [21]:
from azureml.core.compute import AmlCompute

compute = AmlCompute(ws, "health-cluster")

#### Configure the AutoML model and run it

In [22]:
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig

for city, traindataset in all_train_datasets.items():
    city_without_spaces = '-'.join(city.split(' '))
    experiment_name = 'Waittime-Forecasting-Experiment_' + city_without_spaces
    experiment = Experiment(ws, experiment_name)

    automl_config = AutoMLConfig(task = 'forecasting',
                         debug_log = 'automl_errors.log',
                         iteration_timeout_minutes = 15,
                         n_cross_validations=3,
                         experiment_timeout_minutes = 15,
                         label_column_name=y_variable,
                         time_column_name='date',
                         enable_early_stopping=True,
                         compute_target = compute,
                         training_data = traindataset,
                         model_explainability=True)

    training_run = experiment.submit(automl_config, show_output = True)

Submitting remote run.
No run_configuration provided, running on health-cluster with default configuration
Running on remote compute: health-cluster


Experiment,Id,Type,Status,Details Page,Docs Page
Waittime-Forecasting-Experiment_Anchorage,AutoML_77a6e913-2dd7-4256-9a91-2a55e6e74220,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation



Current status: ModelSelection. Beginning model selection.

********************************************************************************************
DATA GUARDRAILS: 

TYPE:         Time Series ID detection
STATUS:       PASSED
DESCRIPTION:  The data set was analyzed, and no duplicate time index were detected.
              Learn more about time-series forecasting configurations: https://aka.ms/AutomatedMLForecastingConfiguration

********************************************************************************************

TYPE:         Short series handling
STATUS:       PASSED
DESCRIPTION:  Automated ML detected enough data points for each series in the input data to continue with training.
              Learn more about short series handling: https://aka.ms/AutomatedMLShortSeriesHandling

********************************************************************************************

TYPE:         Frequency detection
STATUS:       PASSED
DESCRIPTION:  The time series was analyzed,

Experiment,Id,Type,Status,Details Page,Docs Page
Waittime-Forecasting-Experiment_Chicago,AutoML_0e222fce-c9d3-4d6e-95c4-0c3c1b70f621,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation



Current status: ModelSelection. Beginning model selection.

********************************************************************************************
DATA GUARDRAILS: 

TYPE:         Time Series ID detection
STATUS:       PASSED
DESCRIPTION:  The data set was analyzed, and no duplicate time index were detected.
              Learn more about time-series forecasting configurations: https://aka.ms/AutomatedMLForecastingConfiguration

********************************************************************************************

TYPE:         Short series handling
STATUS:       PASSED
DESCRIPTION:  Automated ML detected enough data points for each series in the input data to continue with training.
              Learn more about short series handling: https://aka.ms/AutomatedMLShortSeriesHandling

********************************************************************************************

TYPE:         Frequency detection
STATUS:       PASSED
DESCRIPTION:  The time series was analyzed,

Experiment,Id,Type,Status,Details Page,Docs Page
Waittime-Forecasting-Experiment_Honolulu,AutoML_9974eb2c-a84d-40b9-8919-733cab3ec8c5,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation



Current status: ModelSelection. Beginning model selection.

********************************************************************************************
DATA GUARDRAILS: 

TYPE:         Time Series ID detection
STATUS:       PASSED
DESCRIPTION:  The data set was analyzed, and no duplicate time index were detected.
              Learn more about time-series forecasting configurations: https://aka.ms/AutomatedMLForecastingConfiguration

********************************************************************************************

TYPE:         Short series handling
STATUS:       PASSED
DESCRIPTION:  Automated ML detected enough data points for each series in the input data to continue with training.
              Learn more about short series handling: https://aka.ms/AutomatedMLShortSeriesHandling

********************************************************************************************

TYPE:         Frequency detection
STATUS:       PASSED
DESCRIPTION:  The time series was analyzed,

Experiment,Id,Type,Status,Details Page,Docs Page
Waittime-Forecasting-Experiment_Los-Angeles,AutoML_c9d3dc5f-65ce-4259-8795-24cd413833dd,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation



Current status: ModelSelection. Beginning model selection.

********************************************************************************************
DATA GUARDRAILS: 

TYPE:         Time Series ID detection
STATUS:       PASSED
DESCRIPTION:  The data set was analyzed, and no duplicate time index were detected.
              Learn more about time-series forecasting configurations: https://aka.ms/AutomatedMLForecastingConfiguration

********************************************************************************************

TYPE:         Short series handling
STATUS:       PASSED
DESCRIPTION:  Automated ML detected enough data points for each series in the input data to continue with training.
              Learn more about short series handling: https://aka.ms/AutomatedMLShortSeriesHandling

********************************************************************************************

TYPE:         Frequency detection
STATUS:       PASSED
DESCRIPTION:  The time series was analyzed,

Experiment,Id,Type,Status,Details Page,Docs Page
Waittime-Forecasting-Experiment_Miami,AutoML_5116e1a1-1d01-4446-8cad-57f985c7e117,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation



Current status: ModelSelection. Beginning model selection.

********************************************************************************************
DATA GUARDRAILS: 

TYPE:         Time Series ID detection
STATUS:       PASSED
DESCRIPTION:  The data set was analyzed, and no duplicate time index were detected.
              Learn more about time-series forecasting configurations: https://aka.ms/AutomatedMLForecastingConfiguration

********************************************************************************************

TYPE:         Short series handling
STATUS:       PASSED
DESCRIPTION:  Automated ML detected enough data points for each series in the input data to continue with training.
              Learn more about short series handling: https://aka.ms/AutomatedMLShortSeriesHandling

********************************************************************************************

TYPE:         Frequency detection
STATUS:       PASSED
DESCRIPTION:  The time series was analyzed,

#### Retrieve model to predict the test set

In [25]:
autoMLRunIds = {
    'Miami': 'AutoML_5116e1a1-1d01-4446-8cad-57f985c7e117',
    'Los Angeles': 'AutoML_c9d3dc5f-65ce-4259-8795-24cd413833dd',
    'Honolulu': 'AutoML_9974eb2c-a84d-40b9-8919-733cab3ec8c5',
    'Chicago': 'AutoML_0e222fce-c9d3-4d6e-95c4-0c3c1b70f621',
    'Anchorage': 'AutoML_77a6e913-2dd7-4256-9a91-2a55e6e74220',    
}

In [26]:
from azureml.train.automl.run import AutoMLRun

all_automl_runs = {}
for city, autoMLRunId in autoMLRunIds.items():
    city_without_spaces = '-'.join(city.split(' '))
    experiment_name = 'Waittime-Forecasting-Experiment_' + city_without_spaces

    experiment = Experiment(workspace = ws, name = experiment_name)
    automl_run = AutoMLRun(experiment, autoMLRunId, outputs = None)
    display(automl_run)
    all_automl_runs[city] = automl_run

Experiment,Id,Type,Status,Details Page,Docs Page
Waittime-Forecasting-Experiment_Miami,AutoML_5116e1a1-1d01-4446-8cad-57f985c7e117,automl,Completed,Link to Azure Machine Learning studio,Link to Documentation


Experiment,Id,Type,Status,Details Page,Docs Page
Waittime-Forecasting-Experiment_Los-Angeles,AutoML_c9d3dc5f-65ce-4259-8795-24cd413833dd,automl,Completed,Link to Azure Machine Learning studio,Link to Documentation


Experiment,Id,Type,Status,Details Page,Docs Page
Waittime-Forecasting-Experiment_Honolulu,AutoML_9974eb2c-a84d-40b9-8919-733cab3ec8c5,automl,Completed,Link to Azure Machine Learning studio,Link to Documentation


Experiment,Id,Type,Status,Details Page,Docs Page
Waittime-Forecasting-Experiment_Chicago,AutoML_0e222fce-c9d3-4d6e-95c4-0c3c1b70f621,automl,Completed,Link to Azure Machine Learning studio,Link to Documentation


Experiment,Id,Type,Status,Details Page,Docs Page
Waittime-Forecasting-Experiment_Anchorage,AutoML_77a6e913-2dd7-4256-9a91-2a55e6e74220,automl,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [27]:
all_models = {}

for city, automl_run in all_automl_runs.items():
    best_run, fitted_model = automl_run.get_output()
    # print(fitted_model.steps)
    model_name = best_run.properties['model_name']
    print(model_name)
    all_models[city] = fitted_model

AutoML5116e1a111
AutoMLc9d3dc5f61
AutoML9974eb2ca1
AutoML0e222fcec13
AutoML77a6e91321


In [28]:
all_models['Honolulu']

ForecastingPipelineWrapper(pipeline=Pipeline(memory=None,
                                             steps=[('timeseriestransformer',
                                                     TimeSeriesTransformer(country_or_region=None, drop_column_names=[], featurization_config=FeaturizationConfig(blocked_transformers=None, column_purposes=None, dataset_language=None, prediction_transform_type=None, transformer_params=None), force_time_index_features=Non...
                                                     SeasonalNaive(timeseries_param_dict={'time_column_name': 'date', 'grain_column_names': None, 'target_column_name': 'wait_time', 'drop_column_names': [], 'overwrite_columns': True, 'dropna': False, 'transform_dictionary': {'min': '_automl_target_col', 'max': '_automl_target_col', 'mean': '_automl_target_col'}, 'max_horizon': 1, 'origin_time_colname': 'origin', 'country_or_region': None, 'n_cross_validations': 3, 'short_series_handling': True, 'max_cores_per_iteration': 1, 'feature_l

#### Upload predictions to storage account

The test_df also contains the y_variable which needs to be dropped

In [29]:
X_test_df = pd.DataFrame({'date': pd.date_range(start='2020-10-01', end='2020-12-31')})
X_test_df

Unnamed: 0,date
0,2020-10-01
1,2020-10-02
2,2020-10-03
3,2020-10-04
4,2020-10-05
...,...
87,2020-12-27
88,2020-12-28
89,2020-12-29
90,2020-12-30


In [31]:
all_predictions = {}
for city, fitted_model in all_models.items():
    predictions = fitted_model.forecast(X_test_df)
    display(predictions)
    all_predictions[city] = predictions

(array([41.59753086, 40.74111675, 40.5495283 , 41.03846154, 41.84827586,
        38.55882353, 41.41666667, 41.59753086, 40.74111675, 40.5495283 ,
        41.03846154, 41.84827586, 38.55882353, 41.41666667, 41.59753086,
        40.74111675, 40.5495283 , 41.03846154, 41.84827586, 38.55882353,
        41.41666667, 41.59753086, 40.74111675, 40.5495283 , 41.03846154,
        41.84827586, 38.55882353, 41.41666667, 41.59753086, 40.74111675,
        40.5495283 , 41.03846154, 41.84827586, 38.55882353, 41.41666667,
        41.59753086, 40.74111675, 40.5495283 , 41.03846154, 41.84827586,
        38.55882353, 41.41666667, 41.59753086, 40.74111675, 40.5495283 ,
        41.03846154, 41.84827586, 38.55882353, 41.41666667, 41.59753086,
        40.74111675, 40.5495283 , 41.03846154, 41.84827586, 38.55882353,
        41.41666667, 41.59753086, 40.74111675, 40.5495283 , 41.03846154,
        41.84827586, 38.55882353, 41.41666667, 41.59753086, 40.74111675,
        40.5495283 , 41.03846154, 41.84827586, 38.5

(array([41.59753086, 40.74111675, 40.5495283 , 41.03846154, 41.84827586,
        38.55882353, 41.41666667, 41.59753086, 40.74111675, 40.5495283 ,
        41.03846154, 41.84827586, 38.55882353, 41.41666667, 41.59753086,
        40.74111675, 40.5495283 , 41.03846154, 41.84827586, 38.55882353,
        41.41666667, 41.59753086, 40.74111675, 40.5495283 , 41.03846154,
        41.84827586, 38.55882353, 41.41666667, 41.59753086, 40.74111675,
        40.5495283 , 41.03846154, 41.84827586, 38.55882353, 41.41666667,
        41.59753086, 40.74111675, 40.5495283 , 41.03846154, 41.84827586,
        38.55882353, 41.41666667, 41.59753086, 40.74111675, 40.5495283 ,
        41.03846154, 41.84827586, 38.55882353, 41.41666667, 41.59753086,
        40.74111675, 40.5495283 , 41.03846154, 41.84827586, 38.55882353,
        41.41666667, 41.59753086, 40.74111675, 40.5495283 , 41.03846154,
        41.84827586, 38.55882353, 41.41666667, 41.59753086, 40.74111675,
        40.5495283 , 41.03846154, 41.84827586, 38.5

(array([39.50277501, 39.44715901, 39.49383374, 39.58764592, 39.62967161,
        39.5886591 , 39.58191926, 39.5839655 , 39.5290955 , 39.56534011,
        39.57522795, 39.68361674, 39.64012458, 39.65640139, 39.69760473,
        39.74888407, 39.75531225, 39.75531225, 39.75799218, 39.74041153,
        39.68696064, 39.73163377, 39.7608042 , 39.72160895, 39.84643256,
        39.76739269, 39.71595438, 39.78547181, 40.08535402, 40.29676234,
        40.35799691, 38.91534774, 38.7440626 , 38.75466041, 38.65130793,
        38.66717056, 38.83854903, 38.77400059, 38.72276974, 38.76794473,
        38.77854254, 38.64263648, 38.63699261, 38.70333745, 38.70333745,
        38.72889023, 38.89086791, 38.90613519, 38.81560581, 38.8862595 ,
        38.97043509, 38.9563225 , 38.96365927, 39.02230009, 39.06189335,
        38.98099025, 38.93229951, 39.16980839, 39.34398352, 40.55680627,
        40.47905207, 40.8755953 , 40.82140348, 40.74430117, 40.76312628,
        40.9216368 , 40.95990889, 40.99870593, 40.9

In [33]:
predicted_dfs = {}

for city, predictions in all_predictions.items():
    df = X_test_df.copy()
    df['wait_time'] = predictions[0]
    display(df)
    predicted_dfs[city] = df

Unnamed: 0,date,wait_time
0,2020-10-01,41.60
1,2020-10-02,40.74
2,2020-10-03,40.55
3,2020-10-04,41.04
4,2020-10-05,41.85
...,...,...
87,2020-12-27,41.04
88,2020-12-28,41.85
89,2020-12-29,38.56
90,2020-12-30,41.42


Unnamed: 0,date,wait_time
0,2020-10-01,41.60
1,2020-10-02,40.74
2,2020-10-03,40.55
3,2020-10-04,41.04
4,2020-10-05,41.85
...,...,...
87,2020-12-27,41.04
88,2020-12-28,41.85
89,2020-12-29,38.56
90,2020-12-30,41.42


Unnamed: 0,date,wait_time
0,2020-10-01,41.60
1,2020-10-02,40.74
2,2020-10-03,40.55
3,2020-10-04,41.04
4,2020-10-05,41.85
...,...,...
87,2020-12-27,41.04
88,2020-12-28,41.85
89,2020-12-29,38.56
90,2020-12-30,41.42


Unnamed: 0,date,wait_time
0,2020-10-01,39.50
1,2020-10-02,39.45
2,2020-10-03,39.49
3,2020-10-04,39.59
4,2020-10-05,39.63
...,...,...
87,2020-12-27,40.69
88,2020-12-28,40.72
89,2020-12-29,40.59
90,2020-12-30,40.69


Unnamed: 0,date,wait_time
0,2020-10-01,41.60
1,2020-10-02,40.74
2,2020-10-03,40.55
3,2020-10-04,41.04
4,2020-10-05,41.85
...,...,...
87,2020-12-27,41.04
88,2020-12-28,41.85
89,2020-12-29,38.56
90,2020-12-30,41.42


#### Upload predictions to storage account

In [34]:
final_dfs = []

for city, predicted_df in predicted_dfs.items():
    print(city)
    train_df = all_train_dfs[city]
    final_df = pd.concat([train_df, predicted_df])
    city_list = [city]*len(final_df)
    final_df['city'] = city_list
    display(final_df)
    final_dfs.append(final_df)

Miami


Unnamed: 0,date,wait_time,city
7238,2015-12-16,43.00,Miami
7239,2015-12-17,42.11,Miami
7240,2015-12-18,40.44,Miami
7241,2015-12-19,41.05,Miami
7242,2015-12-20,40.50,Miami
...,...,...,...
87,2020-12-27,41.04,Miami
88,2020-12-28,41.85,Miami
89,2020-12-29,38.56,Miami
90,2020-12-30,41.42,Miami


Los Angeles


Unnamed: 0,date,wait_time,city
5428,2015-12-16,39.00,Los Angeles
5429,2015-12-17,42.50,Los Angeles
5430,2015-12-18,41.00,Los Angeles
5431,2015-12-19,40.38,Los Angeles
5432,2015-12-20,38.80,Los Angeles
...,...,...,...
87,2020-12-27,41.04,Los Angeles
88,2020-12-28,41.85,Los Angeles
89,2020-12-29,38.56,Los Angeles
90,2020-12-30,41.42,Los Angeles


Honolulu


Unnamed: 0,date,wait_time,city
3618,2015-12-16,36.00,Honolulu
3619,2015-12-17,39.25,Honolulu
3620,2015-12-18,45.00,Honolulu
3621,2015-12-19,37.73,Honolulu
3622,2015-12-20,40.67,Honolulu
...,...,...,...
87,2020-12-27,41.04,Honolulu
88,2020-12-28,41.85,Honolulu
89,2020-12-29,38.56,Honolulu
90,2020-12-30,41.42,Honolulu


Chicago


Unnamed: 0,date,wait_time,city
1808,2015-12-16,45.33,Chicago
1809,2015-12-17,41.33,Chicago
1810,2015-12-18,42.75,Chicago
1811,2015-12-19,42.07,Chicago
1812,2015-12-20,42.62,Chicago
...,...,...,...
87,2020-12-27,40.69,Chicago
88,2020-12-28,40.72,Chicago
89,2020-12-29,40.59,Chicago
90,2020-12-30,40.69,Chicago


Anchorage


Unnamed: 0,date,wait_time,city
0,2015-12-17,37.00,Anchorage
1,2015-12-18,46.50,Anchorage
2,2015-12-19,41.67,Anchorage
3,2015-12-20,39.00,Anchorage
4,2015-12-21,43.00,Anchorage
...,...,...,...
87,2020-12-27,41.04,Anchorage
88,2020-12-28,41.85,Anchorage
89,2020-12-29,38.56,Anchorage
90,2020-12-30,41.42,Anchorage


In [35]:
full_df = pd.concat(final_dfs)
print(full_df.shape)
full_df.to_csv(local_data_folder+'wait_time_forecasted.csv',index=False)

(9214, 3)


In [36]:
# Upload the data
local_files = [local_data_folder + 'wait_time_forecasted.csv']
print(local_files)

dstore.upload_files(
    files = local_files,
    relative_root = local_data_folder,
    target_path = '/',
    overwrite=True,
    show_progress=True
)

['wait_time_data/wait_time_forecasted.csv']
Uploading an estimated of 1 files
Uploading wait_time_data/wait_time_forecasted.csv
Uploaded wait_time_data/wait_time_forecasted.csv, 1 files out of an estimated total of 1
Uploaded 1 files


$AZUREML_DATAREFERENCE_wait_time_prediction_store