# Time Series Forecasting in ML.NET and Azure ML notebooks

In this sample, learn how to run time series forecasting in a Jupyter notebook. We will read in data from a csv file, do some exploratory plots, fit a regression model, and fit a more sophisticated Singular Spectrum Analysis (SSA) forecaster.

## Prerequisites

### Setting up a C# Notebook in Azure Machine Learning

>Note: These instructions only apply if you intend to run this notebook in Azure Machine Learning. You can also run this notebook on your local machine by following the [instructions at the dotnet interactive GitHub repo](https://github.com/dotnet/interactive#how-to-install-net-interactive)

1. Go to ml.azure.com. Select your subscription and machine learning workspace.
1. Open up the "Notebooks" tab on the lefthand side of the page
1. Create a compute instance if you have not already, or select an existing one from the dropdown menu.
1. Open a notebook file with an extension of .ipynb
1. From the dropdown menu in the top right, choose "JupyterLab."
1. Open a new terminal window within JupyterLab.
1. Follow the instructions [here](https://docs.microsoft.com/en-us/dotnet/core/install/linux-package-manager-ubuntu-1604) to register a Microsoft product key and install .NET Core 3.1.
1. Install dotnet interactive by running `dotnet tool install -g --add-source "https://dotnet.myget.org/F/dotnet-try/api/v3/index.json" dotnet-interactive`
1. Create a symlink between the installed location of dotnet interactive and your local bin directory: `sudo ln -s /home/azureuser/.dotnet/tools/dotnet-interactive /usr/local/bin/dotnet-interactive`
1. Set your dotnet root directory: `export DOTNET_ROOT=$(dirname $(realpath $(which dotnet)))`
1. Install the jupyter kernel: `dotnet interactive jupyter install`
1. Verify the installation by doing `jupyter kernelspec list`. You should see ".net-fsharp" and ".net-csharp" listed as kernels.

### Clone this repository

If you are running this notebook in Azure ML integrated notebooks, you will already have the Git command line utility installed. First, go to the correct directory by doing `cd /Users/<user_name>`. Then `git clone git@github.com:gvashishtha/time-series-mlnet.git`. You should then see this notebook in the lefthand side of your screen.

### Install MKL

If you are running ML.NET for the first time on an Ubuntu machine (like Azure Machine Learning notebooks), please [follow these instructions](https://docs.microsoft.com/dotnet/machine-learning/how-to-guides/install-extra-dependencies#linux) to download the required dependencies.

## Reference required assemblies

These assemblies will be used later on in the notebook, so let's just install them now.

In [169]:
// ML.NET Nuget packages installation
#r "nuget:Microsoft.ML,1.5.0-preview2"
#r "nuget:Microsoft.ML.Mkl.Components,1.5.0-preview2"
#r "nuget:Microsoft.ML.TimeSeries,1.5.0-preview2"
    
//Install XPlot package
#r "nuget:XPlot.Plotly,2.0.0"

// Install data analysis package
#r "nuget:Microsoft.Data.Analysis,0.2.0"
    

## Define data classes

These data classes define `TemperaturePoint`, which will store individual records in our input data and also a `TemperatureParsed` class which is the result of some preprocessing that will happen later.

In [170]:
using Microsoft.ML;
using Microsoft.ML.Data;
public class TemperaturePoint
{
    [LoadColumn(0)]
    public string Date;

    [LoadColumn(1)]
    [ColumnName("Label")]
    public float MinTemp;

}

public class TemperatureDate
{
    public DateTime Date;
    public float MinTemp;
}

public class TemperatureParsed
{
    public float MinTemp;
    public int Month;
    public int Year;
    public int Day;
    public DateTime Date;
    public float DaysSinceStart;
    public float Cos;
}

## Initial data exploration

Let's plot two years' (730 days) worth of the data and also get some details like the number of rows and the minimum date. As you can see, the data start on January 1, 1981, and temperatures follow a sinusoidal pattern, peaking near the end of January and reaching a minimum in late June. Presumably we are looking at temperatures for somewhere in the southern hemisphere.

We can also display some of the data as a table.

In [171]:
using XPlot.Plotly;

MLContext mlContext = new MLContext(seed: 0);
IDataView trainDataView = mlContext.Data.LoadFromTextFile<TemperaturePoint>("daily-minimum-temperatures-in-me.csv", hasHeader: true, separatorChar: ',');

// extract minimum date for later processing
IEnumerable<string> dateColumn = trainDataView.GetColumn<string>("Date").ToList();
DateTime minDate = DateTime.Parse(dateColumn.Min());

// Convert dates from strings to datetimes
Action<TemperaturePoint, TemperatureDate> mapping = (input, output) =>
    {
        output.MinTemp = input.MinTemp;
        output.Date = DateTime.Parse(input.Date);
    };

var estimator = mlContext.Transforms.CustomMapping(mapping, null)
    .AppendCacheCheckpoint(mlContext);


var model = estimator.Fit(trainDataView);
var initialData = model.Transform(trainDataView);

//Extract some data into arrays for plotting:
 
int numberOfRows = 730;
float[] temps = initialData.GetColumn<float>("MinTemp").Take(numberOfRows).ToArray();
DateTime[] dates = initialData.GetColumn<DateTime>("Date").Take(numberOfRows).ToArray();

Graph.Scattergl[] scatters = {
    new Graph.Scattergl()
    {
        x = dates,
        y = temps
    }
};
var chart = Chart.Plot(
    scatters
);

chart.Width = 600;
chart.Height = 600;
display(chart);

int totalRows = dateColumn.Count();
display(minDate);
display(totalRows);

public static List<TemperaturePoint> Head(MLContext mlContext, IDataView dataView, int numberOfRows = 4)
{
    string msg = string.Format("DataView: Showing {0} rows with the columns", numberOfRows.ToString());
    display(msg);
          
    var rows = mlContext.Data.CreateEnumerable<TemperaturePoint>(dataView, reuseRowObject: false)
                    .Take(numberOfRows)
                    .ToList();
    
    return rows;
}

display(h4("Showing a few rows from training DataView:"));

var fewRows = Head(mlContext, trainDataView, 60);
display(fewRows);

DataView: Showing 60 rows with the columns

index,Date,MinTemp
0,1/1/1981,20.7
1,1/2/1981,17.9
2,1/3/1981,18.8
3,1/4/1981,14.6
4,1/5/1981,15.8
5,1/6/1981,15.8
6,1/7/1981,15.8
7,1/8/1981,17.4
8,1/9/1981,21.8
9,1/10/1981,20.0


## Feature engineering

The thing that really stands out about the data when seen visually is the sinusoidal shape of the trend. Let's use that idea to construct our first model. If the data are truly a cosine wave, they can be modeled the same as any other cosine wave, with the funcrtion written as $A\cdot(\cos(B\cdot t+C))+D$, where $A$ is the amplitude of the wave, $2\pi/B$ is the period of wave, $-C$ is the location of a maximum in the wave, and $D$ is the location of the midpoint of the wave. Later on, we will fit a linear regression model. Linear regressions of one variable will fit models of the form $y=mx+b$, meaning that if we can provide the values of $B$ and $C$ above, the linear regression model will figure out $A$ and $D$ for us.

Luckily, the plot and our own common sense give us enough information to determine $B$ and $C$, just by looking. Specifically, we know that temperatures vary in a yearly pattern of 365 days, meaning the wave will have a period of 365 days, so $B=\dfrac{2\pi}{365}$. And we can see from the table and graph that the peak of the wave occurs around January 31st, meaning we can set $C=-30$.

Putting this all together, let's use another [CustomMapping transform](https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.custommappingcatalog.custommapping?view=ml-dotnet) to compute this engineered feature.

In [135]:
using System;
using System.Collections.Generic;

Action<TemperaturePoint, TemperatureParsed> mapping = (input, output) =>
    {
        const string DATETIME_FORMAT = "MM/dd/yyyy";
        output.MinTemp = input.MinTemp;
        
        DateTime result = DateTime.Parse(input.Date);
        output.Day = result.Day;
        output.Month = result.Month;
        output.Year = result.Year;
        output.Date = result;
        output.DaysSinceStart = (result-minDate).Days;
        output.Cos = (float) Math.Cos( (double) ((((2 * Math.PI)/365) * (output.DaysSinceStart-30))));

    };

var estimator = mlContext.Transforms.CustomMapping(mapping, null)
                .Append(mlContext.Transforms.Concatenate(outputColumnName: "DaysSince",
                                                         inputColumnNames: new[] { "DaysSinceStart" }))
                .Append(mlContext.Transforms.Concatenate(outputColumnName: "CosVector",
                                                         inputColumnNames: new[] { "Cos" }))
                .AppendCacheCheckpoint(mlContext);


var model = estimator.Fit(trainDataView);
var transformedData = model.Transform(trainDataView);


//Extract some data into arrays for plotting
int numberOfRows = 730;
float[] temps = transformedData.GetColumn<float>("MinTemp").Take(numberOfRows).ToArray();
DateTime[] dates = transformedData.GetColumn<DateTime>("Date").Take(numberOfRows).ToArray();
float[] cos = transformedData.GetColumn<float>("Cos").Take(numberOfRows).ToArray();

Graph.Scattergl[] scatters = {
    new Graph.Scattergl()
    {
        x = dates,
        y = temps,
        name = "Original data"
    },
    new Graph.Scattergl()
    {
        x = dates,
        y = cos,
        name = "Cosine function"
    }
};
var chart = Chart.Plot(
    scatters
);

chart.Width = 600;
chart.Height = 600;
display(chart);


## Train-test split

As expected, that cosine wave is looking pretty good! It's sitting at the wrong "height" on our graph (parameter $D$ above) and its amplitude is too small (parameter $A$ above), but we will take care of those later. Now let's talk about our train-test split.

With machine learning tasks, it's important to reserve some of your data as "test" data. If you train your model on all the data that's available, you run the risk of "overfitting," or predicting trends really well in the data you have seen, but not generalizing well to new data. We want to be sure that when we put this model in the real world, it will be robust and useful for predicting future temperatures. 

A standard rule of thumb is to use 80% of data as training data and 20% as test data. For some machine learning tasks, you can simply randomly split the data into groups of 80% for training and 20% for testing. For time series data, however, that's not a valid approach, because it could result in a situation where some of the training data points are from later time points than the test data points. In this case, you have what is known as "leakage," where the model is aware of the future before it happens. All is not lost, however! We need to simply draw the cut line at a specific point in time, treating all data after that point as test data, and all prior data as training data.

In [173]:
int numTrain = (int) (0.8 * totalRows);
display(numTrain);
IDataView trainData = mlContext.Data.FilterRowsByColumn(transformedData, "DaysSinceStart", upperBound: numTrain);
IDataView testData = mlContext.Data.FilterRowsByColumn(transformedData, "DaysSinceStart", lowerBound: numTrain);

## Fitting a regression model

OK, now that we've done our train-test split, let's actually fit a model! The code below fits an [online least squares model](https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.mklcomponentscatalog.ols?view=ml-dotnet#definition) to the data of interest, plotting the model predictions next to the test data and outputting some evaluation metrics.


In [174]:
using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;

var forecastingPipeline = mlContext.Regression.Trainers.Ols(
    labelColumnName: "MinTemp", 
    featureColumnName: "CosVector");
    
var forecaster = forecastingPipeline.Fit(trainData);

// Use trained model to make inferences on test data
IDataView testDataPredictions = forecaster.Transform(testData);

PlotPredictions(testDataPredictions, outputColumns: new string[] { "Score" } );

// Extract model metrics and get RSquared
RegressionMetrics trainedModelMetrics = mlContext.Regression.Evaluate(
    testDataPredictions,
    labelColumnName: "MinTemp",
    scoreColumnName: "Score");

double rmse = trainedModelMetrics.RootMeanSquaredError;
double mae = trainedModelMetrics.MeanAbsoluteError;
display($"Mean Absolute Error:{mae:F3}\n Root Mean Squared Error: {rmse:F3}");

// Helper function for plotting predictions next to actual data
static void PlotPredictions(
        IDataView testDataPredictions,
        string[] outputColumns,
        bool isVector = false) {
    int numberOfRows = 730;
    float[] temps = testDataPredictions.GetColumn<float>("MinTemp").Take(numberOfRows).ToArray();
    DateTime[] dates = testDataPredictions.GetColumn<DateTime>("Date").Take(numberOfRows).ToArray();
    float[][] predictions = new float[outputColumns.Length][];

    for (int j = 0; j < outputColumns.Length; j++) {
        if (isVector) {
            var tmp = testDataPredictions.GetColumn<float[]>(outputColumns[j]).Take(numberOfRows).ToArray();
            predictions[j] = tmp[0];
        }
        else {
            predictions[j] = testDataPredictions.GetColumn<float>(outputColumns[j]).Take(numberOfRows).ToArray();

        }
    }

    Graph.Scattergl[] scatters = new Graph.Scattergl[outputColumns.Length + 1];
    
    for (int i = 0; i < outputColumns.Length + 1; i++) {
        scatters[i] = new Graph.Scattergl();
    }
    scatters[0].x = dates;
    scatters[0].y = temps;
    scatters[0].name = "Actual";
    

    for (int j = 0; j < outputColumns.Length; j++) {
        int scat_ind = j + 1;
        scatters[scat_ind].x = dates;
        scatters[scat_ind].y = predictions[j];
        scatters[scat_ind].name = outputColumns[j];
        scatters[scat_ind].fill = "tonexty";
    }

    var chart = Chart.Plot(scatters);
    chart.Width = 600;
    chart.Height = 600;
    display(chart);
}

Mean Absolute Error:1.997
 Root Mean Squared Error: 2.574

## Use the SSA Forecasting Transformer

The regression model gave us plausible results, but unfortunately a simple cosine wave wouldn't work for any data where the period of the wave is changing, or where there is an upward or downward trend over time (for example, due to climate change, temperature data might trend upwards).

ML.NET has a built-in method called the SSA Forecasting Transformer ([read more about it here](https://docs.microsoft.com/dotnet/api/microsoft.ml.timeseriescatalog.forecastbyssa?view=ml-dotnet)) that does some fancy linear algebra in order to extract more sophisticated data from your time series. Let's try that out and see if we can get lower Mean Absolute Error and lower Root Mean Squared Error than we got above. Read more about these metrics at the [bike rental forecasting tutorial](https://docs.microsoft.com/dotnet/machine-learning/tutorials/time-series-demand-forecasting#evaluate-the-model).

### A note on the choice of parameters

`windowSize` needs to contain the largest seasonality that you are interested to model. For example, if the input time-series is known to have weekly and monthly (30-day) seasonalities and it is sampled daily, then one needs to make sure that `windowSize` > 30. If the same data also exhibits annual seasonality (365-day) but you are _not_ interested in modeling that then `windowSize` does _not_ need to be greater than 365. For numerical stability, make sure that 2 < `windowSize` < `trainSize`/2. Try experimenting with the window size below and see what happens.

`seriesLength` simply needs to be set to some value that is at least `windowSize+1`, as it represents the number of points to keep in a buffer when doing prediction.

`trainSize` should always be the length of your training data, or if you are unable to calculate this, then the largest number that your performance constraints will allow for, and at least 2 times your `windowSize`.

`horizon` should be the number of points you intend to generate when calling the `Predict()` method of your model.

`confidenceLevel`, as confidence level goes down, the distance between the lower and upper bounds will get narrower. There is a tradeoff here: you will get a narrower range of predictions but you will be less sure about those bounds. If you set a confidence level very close to 1, you will likely get lower and upper bounds that diverge widely.

In [175]:
using Microsoft.ML.Transforms.TimeSeries;

// See explanation of parameters: 
var forecastingPipeline = mlContext.Forecasting.ForecastBySsa(
    outputColumnName: "ForecastTemp",
    inputColumnName: "MinTemp",
    windowSize: 365,
    seriesLength: 366,
    trainSize: numTrain,
    horizon: (totalRows-numTrain),
    confidenceLevel: 0.95f,
    confidenceLowerBoundColumn: "LowerBoundTemp",
    confidenceUpperBoundColumn: "UpperBoundTemp");

var forecaster = forecastingPipeline.Fit(trainData);

// Use trained model to make inferences on test data
IDataView testDataPredictions = forecaster.Transform(testData);
string[] seriesNames = new string[] { "UpperBoundTemp", "ForecastTemp", "LowerBoundTemp" };
PlotPredictions(
    testDataPredictions,
    outputColumns: seriesNames,
    isVector: true);

Evaluate(testData, forecaster, mlContext);

static void Evaluate(IDataView testData, ITransformer model, MLContext mlContext)
{
    IDataView predictions = model.Transform(testData);
    IEnumerable<float> actual =
    mlContext.Data.CreateEnumerable<TemperatureParsed>(testData, true)
        .Select(observed => observed.MinTemp);
    IEnumerable<float> forecast =
    mlContext.Data.CreateEnumerable<ModelOutput>(predictions, true)
        .Select(prediction => prediction.ForecastTemp[0]);
    
    var metrics = actual.Zip(forecast, (actualValue, forecastValue) => actualValue - forecastValue);
    var MAE = metrics.Average(error => Math.Abs(error)); // Mean Absolute Error
    var RMSE = Math.Sqrt(metrics.Average(error => Math.Pow(error, 2))); // Root Mean Squared Error
    
    Console.WriteLine("Evaluation Metrics");
    Console.WriteLine("---------------------");
    Console.WriteLine($"Mean Absolute Error: {MAE:F3}");
    Console.WriteLine($"Root Mean Squared Error: {RMSE:F3}\n");
}

public class ModelOutput
{
    public float[] ForecastTemp { get; set; }

    public float[] LowerBoundTemp { get; set; }

    public float[] UpperBoundTemp { get; set; }
}

Evaluation Metrics
---------------------
Mean Absolute Error: 1.963
Root Mean Squared Error: 2.491



## Train model on all data, predict future values

Now that we are happy with the model's performance, let's train it on _all_ available data so we can use it to make predictions. We don't get any metrics this time, because we are predicting values for time points we haven't yet seen. Looks pretty good, right!

In [176]:
var entireForecaster = forecastingPipeline.Fit(transformedData);

var forecastEngine = forecaster.CreateTimeSeriesEngine<TemperatureParsed, ModelOutput>(mlContext);

ModelOutput forecast = forecastEngine.Predict();


int numberOfRows = 730;
float[] temps = testDataPredictions.GetColumn<float>("MinTemp").Take(numberOfRows).ToArray();
DateTime[] dates = testDataPredictions.GetColumn<DateTime>("Date").Take(numberOfRows).ToArray();

DateTime[] newDates = new DateTime[forecast.ForecastTemp.Length];
newDates[0] = dates[dates.Length-1].AddDays(1);

for (int i = 1; i < forecast.ForecastTemp.Length; i++) {
    newDates[i] = newDates[i-1].AddDays(1);
}

Graph.Scattergl[] scatters = {
    new Graph.Scattergl() {
        x = newDates,
        y = forecast.UpperBoundTemp,
        fill = "tonexty",
        name = "Upper bound"
    },
    new Graph.Scattergl() {
        x = newDates,
        y = forecast.ForecastTemp,
        fill = "tonexty",
        name = "Forecast"
    },
    new Graph.Scattergl() {
        x = newDates,
        y = forecast.LowerBoundTemp,
        fill = "tonexty",
        name = "Lower bound"
    }
};


var chart = Chart.Plot(scatters);
chart.Width = 600;
chart.Height = 600;
display(chart);

## Conclusion and next steps

In this notebook, you learned how to do time series forecasting in ML.NET with Jupyter notebooks. We initially used linear regression with an engineered feature, but we were able to improve performance by relying on ML.NET's SSA forecaster.

To learn more about C# and Jupyter Notebooks, [check out this GitHub repo](https://github.com/dotnet/interactive#how-to-install-net-interactive).

To see another example of using ML.NET in Jupyter, [check out this blog](https://devblogs.microsoft.com/cesardelatorre/using-ml-net-in-jupyter-notebooks/).

To learn about using DataFrames in C#, [check out this blog](https://devblogs.microsoft.com/dotnet/an-introduction-to-dataframe/).

To get started with Model Builder in Visual Studio, [try this getting started tutorial](https://dotnet.microsoft.com/learn/ml-dotnet/get-started-tutorial/intro).