## Introduction

This notebook is designed to be self-guided, broken out into a number of steps to help illustrate the process of developing a ML model.

In [None]:
!pip install prophet
import pyspark.sql.functions as F
from pyspark.sql.types import *

In [None]:
# configuration for downloads and stock symbol to analyze

# url to source tar file
FULL_URL = "https://fabricrealtimelab.blob.core.windows.net/public/AbboCost_Stock_History/stockhistory-2023-2024.tgz?sp=r&st=2023-11-26T23:59:09Z&se=2027-11-27T07:59:09Z&spr=https&sv=2022-11-02&sr=b&sig=70w%2BT6ZVGpdTd6YJr%2FzPhKUFk9JYJ2ezu6%2BBBr9ahxc%3D"
# lakehouse location -- assumes default lakehouse
LAKEHOUSE_FOLDER = "/lakehouse/default"

# filename and data folders
TAR_FILE_NAME = "stockhistory-2023-2024.tgz"
DATA_FOLDER = "Files/stockhistory/raw"

TAR_FILE_PATH = f"/{LAKEHOUSE_FOLDER}/{DATA_FOLDER}/tar/"
CSV_FILE_PATH = f"/{LAKEHOUSE_FOLDER}/{DATA_FOLDER}/csv/"

# specify the stock symbol to analyze - WHO, WHAT, IDK, WHY, BCUZ, TMRW, TDY, IDGD
# recommended: WHO, BCUZ, IDGD

# STOCK_SYMBOL = "IDGD"
# STOCK_SYMBOL = "BCUZ"
STOCK_SYMBOL = "WHO"

## Step 1: Download the stock history source data files

Normally, this data would be ingested continuously into our lakehouse. In the interest of time, we've generated enough data to train from and do some experimentation. This was generated with the same engine (https://aka.ms/fabricrealtimelab); the stock generator is largely random and will vary by installation, but there are certain trends that should be present.

The cell will check for the data and only download/extract if the data does not exist.

In [None]:
import os

if not os.path.exists(LAKEHOUSE_FOLDER):
    # add a lakehouse if the notebook has no default lakehouse
    # a new notebook will not link to any lakehouse by default
    raise FileNotFoundError(
        "Lakehouse not found, please add a lakehouse for the notebook."
    )
else:
    # verify whether or not the required files are already in the lakehouse, and if not, download and unzip
    if not os.path.exists(f"{TAR_FILE_PATH}{TAR_FILE_NAME}"):
        os.makedirs(TAR_FILE_PATH, exist_ok=True)
        os.system(f"wget '{FULL_URL}' -O {TAR_FILE_PATH}{TAR_FILE_NAME}")

        #todo: better file checking
        os.makedirs(CSV_FILE_PATH, exist_ok=True)
        os.system(f"tar -zxvf {TAR_FILE_PATH}{TAR_FILE_NAME} -C {CSV_FILE_PATH}")

## Step 2: Read the CSV files into a DataFrame

In [None]:
# read the CSV files, {year}/{month}/{day}.csv

df_stocks = (
    spark.read.format("csv")
    .option("header", "true")
    .load(f"{DATA_FOLDER}/csv/*/*/*.csv")
)

df_stocks.tail(8)

In [None]:
# remove all but specified stock symbol
# individual models can be built for each stock

df_stocks = df_stocks.select("*").where(
    'symbol == "' + STOCK_SYMBOL + '"'
)

df_stocks.tail(4)

In [None]:
# strictly speaking, we don't need to sort the dataframe, 
# but it can help for exploration of the data 

df_stocks = df_stocks.sort("timestamp")
df_stocks.tail(4)

In [None]:
# include only historical data when building model

import datetime

currentdate = datetime.datetime.utcnow()
currentdate = currentdate.replace(hour=0, minute=0, second=0, microsecond=0)

# to manually specify a cutoff date in the data, specify the date below:
# currentdate = "2023-11-27 00:00:00"

df_stocks_history = df_stocks.select("*").where(
    'timestamp < "' + str(currentdate) + '"')

df_stocks_history.tail(4)

In [None]:
# convert to a pandas dataframe, and rename the columns to 'ds' and 'y'  
# for time and label/outcome columns

import pandas as pd

dfstocks_pd = df_stocks_history.toPandas()

# rename the columns as expected by Prophet (ds and y)
dfstocks_pd = dfstocks_pd.rename(columns={'timestamp': 'ds'})
dfstocks_pd = dfstocks_pd.rename(columns={'price': 'y'})

# verify max/min timestamps in the dataframe, as the tail/head data may not be in order
print('Min: ', dfstocks_pd['ds'].min())
print('Max: ', dfstocks_pd['ds'].max())

## Step 3: Train the model
In developing a model, we'll use [Prophet](https://facebook.github.io/prophet/) developed by Facebook's Core Data Science team. Prophet is ideal for forecasting time series data. Prophet excels at simplicity, so this is an ideal starting point as it limits any feature engineering and variables. 

In [None]:
# Prophet variables

changepoint_prior_scale = 0.05
changepoint_range = 0.95
seasonality_prior_scale = 10
weekly_seasonality = 5

In [None]:
from prophet import Prophet
from prophet.plot import add_changepoints_to_plot

m = Prophet(changepoint_prior_scale = changepoint_prior_scale, 
    changepoint_range = changepoint_range, 
    seasonality_prior_scale = seasonality_prior_scale,
    weekly_seasonality=weekly_seasonality)
m.fit(dfstocks_pd)
future = m.make_future_dataframe(periods=60*24*7, freq='min', include_history = False)
forecast = m.predict(future)
fig = m.plot(forecast)

In [None]:
fig = m.plot(forecast)
a = add_changepoints_to_plot(fig.gca(), m, forecast)

In [None]:
fig2 = m.plot_components(forecast)

In [None]:
# show the first few rows of the forecast
# yhat is predicted value of y at the given ds

forecast.head()

In [None]:
import plotly.express as px

fig = px.line(forecast, x='ds', y='yhat')
fig.update_layout(title="Trend of predicted (yhat) over time", showlegend=True)
fig.show()

## Step 4: Cross Validate

The purpose of cross validation is to simplify the process of separating training and test data. This allows us to test many points in time for accuracy, and also allows us to include all data in our model.

For cross validation, the initial period parameter is used to train the cross validation model, forecasting for the specified horizon. The next validation will occur over the next specified period.

So, if we wanted to validate the most recent 2 weeks, we'd specify the number of days in our training set minus 14 days, then specify a horizon and period of 7 days each. This will result in 2 validation forecasts: one for last week, and one for the week prior.

More information on cross validation is available in the [Prophet docs](https://facebook.github.io/prophet/docs/diagnostics.html).

In [None]:
# calculate the number of days to validate

from datetime import datetime

minDate = datetime.strptime(dfstocks_pd['ds'].min(), '%Y-%m-%d %H:%M:%S') 
maxDate = datetime.strptime(dfstocks_pd['ds'].max(), '%Y-%m-%d %H:%M:%S') 

numDays = (maxDate - minDate).days
numDaysToValidate = numDays - 14

print(numDaysToValidate)

In [None]:
from prophet.diagnostics import cross_validation
from prophet.diagnostics import performance_metrics

df_cv = cross_validation(m, initial=f"{numDaysToValidate} days", period="7 days", horizon="7 days")

In [None]:
df_cv.tail(4)

In [None]:
# generate metrics using the default rolling window (10%)

from prophet.diagnostics import performance_metrics
df_p = performance_metrics(df_cv)
df_p.head()

In [None]:
# generate metrics using all data (100%)

from prophet.diagnostics import performance_metrics
df_p = performance_metrics(df_cv, rolling_window=1)
df_p.head()

In [None]:
# get the statistics for storing in the model
# this becomes part of the model's metadata

mse = df_p['mse'][0]
mae = df_p['mae'][0]
rmse = df_p['rmse'][0]
mape = df_p['mape'][0]

print('mse:', mse)
print('mae:', mae)
print('rmse:', rmse)
print('mape:', mape)

In [None]:
# plot the mean absolute percent error (mape)

from prophet.plot import plot_cross_validation_metric
fig = plot_cross_validation_metric(df_cv, metric='mape')

In [None]:
# plot the standard deviation or root mean square deviation (rmse)

from prophet.plot import plot_cross_validation_metric
fig = plot_cross_validation_metric(df_cv, metric='rmse')

In [None]:
# routine for testing multiple combinations of parameters

# by default we won't run this, but this is a way we can test multiple
# parameters and find the most optimal results

# more info at:
# https://facebook.github.io/prophet/docs/diagnostics.html

import itertools
import numpy as np
import pandas as pd

param_grid = {  
    'changepoint_prior_scale': [0.001, 0.1, 0.5],
    'seasonality_prior_scale': [0.01, 0.1, 1.0, 10.0, 15],
    'weekly_seasonality': [3, 5]
}

test_parameters = False # change to True to run series of tests. 

if test_parameters:
    daysToValidate = numDays - 7

    # generate all combinations of parameters
    all_params = [dict(zip(param_grid.keys(), v)) for v in itertools.product(*param_grid.values())]
    rmses = []  # Ssore the RMSEs for each params here

    # use cross validation to evaluate all parameters
    for params in all_params:
        
        m = Prophet(**params).fit(dfstocks_pd)  # fit model with given params
        df_cv = cross_validation(m, initial=f"{daysToValidate} days", period="7 days", horizon="7 days")
        df_p = performance_metrics(df_cv, rolling_window=1)
        rmses.append(df_p['rmse'].values[0])

    # Find the best parameters
    tuning_results = pd.DataFrame(all_params)
    tuning_results['rmse'] = rmses
    print(tuning_results)

## Step 5: Log and load model with MLflow

MLflow assists with managing ML workflows. We can create a new experiment for each stock (as an example) and then add each run to the experiment. We can also log all of the parameters and metrics with each run, allowing us to see and compare different models. This is part of our operationalizing process. When the run is logged in the experiment, the run is given a URI that can be used to load the model later; however, it's also possible to interact with MLflow visually or programmatically to load/inspect models.

In [None]:
# setup mlflow with an experiment

import mlflow

EXPERIMENT_NAME = STOCK_SYMBOL + "-stock-prediction"
mlflow.set_experiment(EXPERIMENT_NAME)


In [None]:
from mlflow.models.signature import infer_signature

model_name = f"{EXPERIMENT_NAME}-model"
with mlflow.start_run() as run:
    mlflow.autolog()

    mlflow.prophet.log_model(m, model_name, registered_model_name=model_name,
        signature=infer_signature(future, forecast))

    mlflow.log_params({"changepoint_prior_scale": changepoint_prior_scale })
    mlflow.log_params({"changepoint_range": changepoint_range })
    mlflow.log_params({"seasonality_prior_scale": seasonality_prior_scale })

    mlflow.log_metrics({"mse":mse})
    mlflow.log_metrics({"mae":mae})
    mlflow.log_metrics({"rmse":rmse})
    mlflow.log_metrics({"mape":mape})

    model_uri = f"runs:/{run.info.run_id}/{model_name}"

    print("Model saved in run %s" % run.info.run_id)
    print(f"Model URI: {model_uri}")

## Step 6: Load the model and generate predictions

In this step, we'll load the model from MLflow and create a new prediction for the next week. 

Because this is a simulation for demo purposes, we already have the future data (or at least, one possibility of future data as the algorithm is highly random). This allows us to compare what the model predicts vs actual. We'll combine the predicted dataset with the actual dataset, and plot the outcome to compare.

In [None]:
import mlflow

loaded_model = mlflow.prophet.load_model(model_uri)

In [None]:
# establish begin/end dates for prediction

import datetime
from datetime import timedelta

currentdate = datetime.datetime.utcnow()
currentdate = currentdate.replace(hour=0, minute=0, second=0, microsecond=0)
enddate = currentdate + datetime.timedelta(days=7)

print(f'Beginning of forecast: {currentdate}')
print(f'End of forecast: {enddate}')

In [None]:
# load all of the 'future' data -- 
# will be used to compare prediction

df_stocks_future = df_stocks.select("*").where(
    'timestamp >= "' + str(currentdate) + '" and ' +
    'timestamp < "' + str(enddate) + '"')

df_stocks_future.tail(4)

In [None]:
import pandas as pd

# create a new dataframe to hold the predictions
# copy the timestamp from the future dataframe for convenience
# this new dataframe should only have timestamps for our prediction, 
# and will be labelled as 'ds'

df_stocks_future_pd = df_stocks_future.toPandas()
dfstocks_predict = df_stocks_future_pd[['timestamp']].copy()

# rename timestamp to ds as expected by prophet
dfstocks_predict = dfstocks_predict.rename(columns={'timestamp': 'ds'})

print('Min: ', dfstocks_predict['ds'].min())
print('Max: ', dfstocks_predict['ds'].max())
dfstocks_predict.head()

In [None]:
# optionally, can use make_future_dataframe in Prophet to make a suitable df

test_df = loaded_model.make_future_dataframe(periods=60*24*7, freq='min', include_history = False)
test_df.head()

In [None]:
# predict by passing in the dataframe with timestamps to forecast

forecast = loaded_model.predict(dfstocks_predict)
forecast.head()

In [None]:
# combine forecast and df_stocks_future_pd

df_stocks_future_pd['timestamp'] = pd.to_datetime(df_stocks_future_pd['timestamp'])
forecast['ds'] = pd.to_datetime(forecast['ds'])

metric_df = forecast.set_index('ds')[['yhat']].join(df_stocks_future_pd.set_index('timestamp').price).reset_index()
metric_df.head()

In [None]:
import plotly.express as px
import plotly.graph_objects as go

metric_df['ds'] = pd.to_datetime(metric_df['ds'])

fig = go.Figure()
fig.add_trace(go.Scatter(x=metric_df['ds'], y=metric_df['price'], name='Actual', line=dict(color='blue', width=1)))
fig.add_trace(go.Scatter(x=metric_df['ds'], y=metric_df['yhat'], name='Predicted', line=dict(color='red', width=3)))

fig.update_layout(title="Predicted vs Actual", showlegend=True)
fig.show()

In [None]:
# optionally, compute metrics using sklearn to see how they compare to our model

from sklearn.metrics import mean_squared_error, mean_absolute_error

future_mse = round(mean_squared_error(metric_df.price, metric_df.yhat),3)
future_mae = round(mean_absolute_error(metric_df.price, metric_df.yhat),3)
future_rmse = round(mean_squared_error(metric_df.price, metric_df.yhat, squared=False),3)

print(f'mse: {future_mse} (Model: {mse})')
print(f'mae: {future_mae} (Model: {mae})')
print(f'rmse: {future_rmse} (Model: {rmse})')

## Additional MLflow
MLflow can be interacted with programmatically to load and inspect models. The example below will show all experiments and runs within each experiment.

In [None]:
import mlflow
import pandas as pd

experiments = mlflow.search_experiments()
for exp in experiments:
    print(f'{exp.name} ({exp.experiment_id})')
    runs_df = mlflow.search_runs(exp.experiment_id)
    display(runs_df)