<p style="font-size:25px; color:black;"><u><i><b>Predicting the number of customers likely to visit different departments in a store</b></i></u></p>
<p style="font-size:16px; color:#117d30;">
    Time series forecasting is the use of a model to predict future values based on previously observed values.
The AutoML feature of AzureSynapse, in this case uses more than 25 time series forecasting machine learning algorithms to predicts how many customers are likely to visit different departments in a store.
</p>
Note:
</p>
<p style="font-size:15px; color:#117d30;">
 This notebook is written in Scala, and there is interoperability between Scala and Python code.
</p>

<p style="font-size:15px; color:#117d30;">
    <u> Abstract: </u>
</p>
<p style="font-size:16px; color:#117d30;">
1) Ingest  data from Azure Synapse Data Storage account using PySpark.
</p>
<p style="font-size:16px; color:#117d30;">
2) Exploratory Data Analysis 
</p>
<p style="font-size:15px; color:#117d30;">
3) Training more than 25 time series forecasting machine learning algorithms.
</p>
<p style="font-size:15px; color:#117d30;">
4) Predict the number of customers likely to visit different departments in a store by choosing the best performing Machine Learning Algorithm..
</p>


## Introduction
<p style="font-size:16px; color:#117d30;">


### In this notebook we showcased how to:
<p style="font-size:16px; color:#117d30;">
1. Create an experiment using an existing workspace

<p style="font-size:16px; color:#117d30;">
2. Configure AutoML using 'AutoMLConfig'

<p style="font-size:16px; color:#117d30;">
3. Train the model 

<p style="font-size:16px; color:#117d30;">
4. Explore the engineered features and results

<p style="font-size:16px; color:#117d30;">
5. Configuration and remote run of AutoML for a time-series model with lag and rolling window features

<p style="font-size:16px; color:#117d30;">
6. Run and explore the forecast

<p style="font-size:16px; color:#117d30;">
7. Register the model





### Importing required libraries such as azureml, pandas, pandasql, pyspark, and other supporting libraries.



In [3]:
%%pyspark
from azureml.train.automl import AutoMLConfig
# from azureml.widgets import RunDetails
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl.run import AutoMLRun
from sklearn.metrics import mean_squared_error
import math
from pyspark.sql.window import Window
from azureml.core.webservice import AciWebservice
from azureml.core.model import InferenceConfig
from azureml.core.model import Model
from azureml.core.webservice import Webservice
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.environment import Environment

import pandas as pd 
import datetime
import matplotlib.pyplot as plt
import numpy as np 
import seaborn as sns 
import azureml.train.automl.runtime
import logging
import os, tempfile
import pandas as pd 
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark import SparkContext
os.environ['AZURE_SERVICE']="Microsoft.ProjectArcadia"

## *Connecting to Azure Synapse Data Warehouse*
<p style="font-size:16px; color:#117d30;">
    Connection to Azure Synapse Data Warehouse is initiated and the required data is ingested for processing.
    The warehouse is connected with a single line of code. Just point to actions in a table, click on a new notebook, and then click on "Load to DataFrame".  </p>
   <p style="font-size:16px; color:#117d30;"> After providing the necessary details,  the required data is loaded in the form of a Spark dataframe.
One magical line of code converts a dataframe from Scala to Python!
</p>


In [None]:
val df = spark.read.sqlanalytics("AzureSynapseDW.dbo.department_visit_customer")
//Create a Temp view for using the dataframe from Scala to Python
  df.createTempView("df")

In [None]:

display(df)

# Exploratory Data Analysis

<p style="font-size:16px; color:#117d30;">
The goal of performing exploratory data analysis is to understand the underlying patterns and correlations among features in the data. 


In [None]:
%%pyspark
#Calling the dataframe df created in Scala to Python
df = sqlContext.table("df")
# *********************
department_visit_data = df.select("*").toPandas()
department_visit_data['Date'] = pd.to_datetime(department_visit_data['Date']).dt.strftime('%m/%d/%y')
department_visit_data['Month'] = pd.to_datetime(department_visit_data['Date']).dt.strftime('%m')
department_visit_data['DayOfMonth'] = pd.to_datetime(department_visit_data['Date']).dt.strftime('%d')
department_visit_data['Year'] = pd.to_datetime(department_visit_data['Date']).dt.strftime('%y')
department_visit_data['DayOfWeek'] = pd.to_datetime(department_visit_data['Date']).dt.strftime('%a')
department_visit_data[['Accessories_count','Entertainment_count','Gaming','Kids','Mens','Phone_and_GPS','Womens']] = department_visit_data[['Accessories_count','Entertainment_count','Gaming','Kids','Mens','Phone_and_GPS','Womens']].apply(pd.to_numeric)

#display(department_visit_data)

##  Deriving insights from customer visits data  

<p style="font-size:16px; color:#117d30;">
1. Heat Map: Thickness of the color indicates the no of customers visiting the section on that particular day. It provides a quick representation of distribution of traffic across days and in various departments. From the graph, we can infer that more number of customers visit the Entertainment department on Wednesdays, Thursdays and Fridays and there is less foot traffic on Mondays and Fridays in the Phone_and_gps department.


In [None]:
%%pyspark
df_dow = pd.DataFrame(department_visit_data.groupby('DayOfWeek')[['Mens']].sum().sort_values(by = 'DayOfWeek', 
                                                                 ascending=True))

df_dow['Mens'] = pd.DataFrame(department_visit_data.groupby('DayOfWeek')[['Mens']].sum().sort_values(by = 'DayOfWeek', 
                                                                 ascending=True))

df_dow['Womens'] = pd.DataFrame(department_visit_data.groupby('DayOfWeek')[['Womens']].sum().sort_values(by = 'DayOfWeek', 
                                                                 ascending=True))

df_dow['Kids'] = pd.DataFrame(department_visit_data.groupby('DayOfWeek')[['Kids']].sum().sort_values(by = 'DayOfWeek', 
                                                                 ascending=True))

df_dow['Gaming'] = pd.DataFrame(department_visit_data.groupby('DayOfWeek')[['Gaming']].sum().sort_values(by = 'DayOfWeek', 
                                                                 ascending=True))

df_dow['Entertainment'] = pd.DataFrame(department_visit_data.groupby('DayOfWeek')[['Entertainment_count']].sum().sort_values(by = 'DayOfWeek', 
                                                                 ascending=True))

df_dow['Accessories'] = pd.DataFrame(department_visit_data.groupby('DayOfWeek')[['Accessories_count']].sum().sort_values(by = 'DayOfWeek', 
                                                                 ascending=True))

df_dow['Phone_and_GPS'] = pd.DataFrame(department_visit_data.groupby('DayOfWeek')[['Phone_and_GPS']].sum().sort_values(by = 'DayOfWeek', 
                                                                 ascending=True))


df_dow.head(10)

sns.set()
plt.rcParams['font.size'] = 20
bg_color = (0.88,0.85,0.95)
plt.rcParams['figure.facecolor'] = bg_color
plt.rcParams['axes.facecolor'] = bg_color
fig, ax = plt.subplots(1)
cmap = sns.diverging_palette(10, 150, n=2, as_cmap=True)
#cmap = sns.color_palette("hls", 3)

p = sns.heatmap(df_dow,
                cmap=cmap,
                annot=True,
                fmt="d",
                annot_kws={'size':16},
                ax=ax)
plt.xlabel('Category')
plt.ylabel('Day Of Week')
ax.set_ylim((0,7))
plt.text(5,7.4, "Heat Map", fontsize = 25, color='Black', fontstyle='italic')
 
plt.show()


# Data Manipulation  
<p style="font-size:16px; color:#117d30;">
1. Converting date to a specific format and making date fields relevant for prediction.

<p style="font-size:16px; color:#117d30;">
2. Converting the data type of the columns to numeric before being passed as input to the model.


In [None]:
%%pyspark
department_visit_data = df.select("*").toPandas()
department_visit_data['Date'] = pd.to_datetime(department_visit_data['Date']).dt.strftime('%Y-%m-%d')

department_visit_data[['Accessories_count','Entertainment_count','Gaming','Kids','Mens','Phone_and_GPS','Womens']] = department_visit_data[['Accessories_count','Entertainment_count','Gaming','Kids','Mens','Phone_and_GPS','Womens']].apply(pd.to_numeric)

grouped_data = department_visit_data.groupby('Date', as_index=False).sum()

display(grouped_data)
total_rows = grouped_data.count
print(total_rows)

In [None]:
%%pyspark
accessories_data = grouped_data[['Date','Accessories_count']]
display(accessories_data)

In [None]:
%%pyspark
total_rows = accessories_data.count
print(total_rows)

## Split data into train and test set


In [11]:
%%pyspark
train_data = pd.DataFrame()
test_data = pd.DataFrame()

if accessories_data.shape[0] > 55: # len(df) > 10 would also work
    train_data = accessories_data[:55]
    test_data = accessories_data[55:]

In [None]:
%%pyspark
display(test_data)

## Train

<p style="font-size:16px; color:#117d30;">
1. Instantiate an AutoMLConfig object. 
<p style="font-size:16px; color:#117d30;">
2. The configuration below defines the settings and data used to run the experiment. 


## Set AutoML Configuration Parameters

<p style="font-size:16px; color:#117d30;">
    The forecast horizon is the number of periods into the future that the model should predict. 

<p style="font-size:16px; color:#117d30;">
    It is generally recommended that users set forecast horizons to less than 100 time periods

<p style="font-size:16px; color:#117d30;">
    Furthermore, AutoML's memory use and computation time increases in proportion to the length of the horizon, so consider carefully how this value is set. 

<p style="font-size:16px; color:#117d30;">
    If a long horizon forecast really is necessary, consider aggregating the series to a coarser time scale.


In [13]:
%%pyspark
automl_settings = {
   'time_column_name':'Date',
   'max_horizon': 25
}

In [14]:
%%pyspark
automl_config = AutoMLConfig( 
                            #forecasting for time-series tasks
                            task='forecasting',
                            #measuere for evaluating the performance of the models
                            primary_metric='normalized_root_mean_squared_error',
                            #Maximum amount of time in minutes that the experiment take before it terminates.
                            experiment_timeout_minutes=15,
                            enable_early_stopping=True,
                            training_data=train_data,
                            label_column_name='Accessories_count',
                            #Rolling Origin Validation is used to split time-series in a temporally consistent way.
                            n_cross_validations=4,
                            # Flag to enble early termination if the score is not improving in the short term.
                            enable_ensembling=False,
                            verbosity=logging.INFO,
                            **automl_settings)

In [15]:
%%pyspark
subscription_id='49d66a68-7c00-43c3-93ae-602ee60e1eb6'
resource_group='CDP-VISION-DEMO-RG'
workspace_name='Auto-ML-2'
ws = Workspace(subscription_id = subscription_id, resource_group = resource_group, workspace_name = workspace_name)
ws.write_config()
ws = Workspace.from_config()
experiment = Experiment(ws, "Department_Visit_Count_20thFeb")

## Run The Experiment
<p style="font-size:16px; color:#117d30;">
Automated ML runs more than 25 Machine Learning Algorithms and grades them according to performance.


In [None]:
%%pyspark
local_run = experiment.submit(automl_config, show_output=True)

## Retrieve the Best Model


In [17]:
%%pyspark
best_run, fitted_model = local_run.get_output()

In [None]:
%%pyspark
print(best_run)

In [None]:
%%pyspark
print(fitted_model)

# Evaluate the Model Performance
<p style="font-size:16px; color:#117d30;">Here we have used Root Mean Squared Error (RMSE) for evaluation.</p>


In [20]:
%%pyspark
test_labels = test_data.pop("Accessories_count").values
predict_labels = fitted_model.predict(test_data)
actual_labels = test_labels.flatten()

In [None]:
%%pyspark
rmse = math.sqrt(mean_squared_error(actual_labels,predict_labels))
rmse

In [None]:
%%pyspark
sum_actuals = sum_errors = 0

for actual_val, predict_val in zip(actual_labels,predict_labels):
    abs_error = actual_val - predict_val
    if abs_error < 0:
        abs_error = abs_error * -1

    sum_errors = sum_errors + abs_error
    sum_actuals = sum_actuals + actual_val

mean_abs_percent_error = sum_errors / sum_actuals
print("Model MAPE:")
print(mean_abs_percent_error)
print()
print("Model Accuracy:")
print(1 - mean_abs_percent_error)

In [None]:
%%pyspark
predicted_value = fitted_model.predict(test_data)
actual_value = test_labels.flatten() 
actual_value = actual_value.tolist()
predicted_value = predicted_value.tolist()
output_df= pd.DataFrame({'actual_value':actual_value,'predicted_value':predicted_value})
output_df['Error_Rate %'] = 100*((output_df['actual_value']-output_df['predicted_value'])/(output_df['actual_value']))
display(output_df)

In [None]:
%%pyspark
future_date =  pd.date_range(start='2015-12-1', end='2015-12-5')
future_data = pd.DataFrame({'Date':future_date, 'Accessories_count':0})
display(future_data)

In [None]:
%%pyspark
future_data['Date'] = pd.to_datetime(future_data['Date']).dt.strftime('%Y-%m-%d')
display(future_data)

## Making future prediction using model that performs best


In [None]:
%%pyspark
future_value = fitted_model.predict(future_data)
temp_df =  pd.DataFrame({'Accessories_count':future_value})
temp_df['Accessories_count'] = temp_df['Accessories_count'].round(0)
future_data['Accessories_Customer_count'] = temp_df['Accessories_count']
future_data.drop({'Accessories_count'},axis=1,inplace=True)
future_data['Date'] = pd.to_datetime(future_data['Date']).dt.strftime('%Y-%m-%d')
display(future_data)

In [27]:
%%pyspark
output = spark.createDataFrame(future_data)

## **Registering Model**


In [None]:
%%pyspark
#register model
#model_name = "my_model_20th"
description = "Forecast Model"
tags = None
model = local_run.register_model(description = description, tags = tags)
local_run.model_id

In [29]:
%%pyspark
#saving scoring and conda file
script_file_name = 'inference/score.py'
conda_env_file_name = 'inference/env.yml'
#/content/azureml_automl.log
best_run.download_file('outputs/scoring_file_v_1_0_0.py', 'inference/score.py')
best_run.download_file('outputs/conda_env_v_1_0_0.yml', 'inference/env.yml')

## Checking the progress of the experiment in Azure portal
<p style="font-size:16px; color:#117d30;">Here the URL is retrieved by the following command.</p>



In [None]:
%%pyspark
print(local_run.get_portal_url())

## Consuming REST EndPoint API


In [None]:
%%pyspark
import urllib.request
import json
import os
import ssl
import pprint

def allowSelfSignedHttps(allowed):
    # bypass the server certificate verification on client side
    if allowed and not os.environ.get('PYTHONHTTPSVERIFY', '') and getattr(ssl, '_create_unverified_context', None):
        ssl._create_default_https_context = ssl._create_unverified_context

allowSelfSignedHttps(True) # this line is needed if you use self-signed certificate in your scoring service.

data = {
    "Inputs": {
          "WebServiceInput0":
          [
              {
                    'Date': "2020-04-20T00:00:00Z",
                    'Accessories_count': "18",
                    'Entertainment_count': "28",
                    'Gaming': "5",
                    'Kids': "14",
                    'Mens': "14",
                    'Phone_and_GPS': "36",
                    'Womens': "30",
              },
          ],
    },
    "GlobalParameters":  {
    }
}

body = str.encode(json.dumps(data))

url = 'http://13.68.210.52:80/api/v1/service/retail-realtime-inference/score'
api_key = 'wRx4zkptjIjKq41jPFmw9UO3LfZ7CLE6' # Replace this with the API key for the web service
headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key)}

req = urllib.request.Request(url, body, headers)

try:
    response = urllib.request.urlopen(req)

    result = response.read()
    #print(result)
    pprint.pprint(json.loads(result))
except urllib.error.HTTPError as error:
    print("The request failed with status code: " + str(error.code))

    # Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure
    print(error.info())
    print(json.loads(error.read().decode("utf8", 'ignore')))
