## Linear Regression
(With comments and codes from the Nicolas Vandepu's book "Data Science for Supply Chain Forecasting")

We will transform automobile sales data into a pivot table format. We will start by importing data from a specified CSV file URL, then combine the 'Year' and 'Month' columns to create a 'Period' column, converting it into a period format for cleaner and more meaningful indexing. The core function, import_data, takes the file path as an argument and returns a DataFrame where the 'Make' of cars is indexed, and sales data is aggregated monthly. This organized pivot table format is particularly useful for further analysis and visualization in sales trend studies.

#### Data Transformation: Pivot Table Creation

In [1]:
import pandas as pd

def import_data():
    try:
        data = pd.read_csv(file_path)
        data['Period'] = data['Year'].astype(str) + '-' + data['Month'].astype(str).str.zfill(2)
        df = pd.pivot_table(data=data, values=['Quantity'], index='Make', columns='Period', aggfunc='sum', fill_value=0)
        return df
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

file_path = "https://supchains.com/wp-content/uploads/2021/07/norway_new_car_sales_by_make1.csv"
df = import_data()

if df is not None:
    print(df.head())

             Quantity                                                          \
Period        2007-01 2007-02 2007-03 2007-04 2007-05 2007-06 2007-07 2007-08   
Make                                                                            
Alfa Romeo         16       9      21      20      17      21      14      12   
Aston Martin        0       0       1       0       4       3       3       0   
Audi              599     498     682     556     630     498     562     590   
BMW               352     335     365     360     431     477     403     348   
Bentley             0       0       0       0       0       1       0       0   

                              ...                                          \
Period       2007-09 2007-10  ... 2016-04 2016-05 2016-06 2016-07 2016-08   
Make                          ...                                           
Alfa Romeo        15      10  ...       3       1       2       1       6   
Aston Martin       0       0  ...       0  

#### Training and Test Sets Creation

Now that we have our dataset with the proper formatting, we can create a function datasets() to populate a training and a test set.

In [2]:
import numpy as np

def datasets(df, X_len=12, y_len=1, test_loops=12):
    D = df.values
    rows, periods = D.shape
    
    # Training set creation
    loops = periods + 1 - X_len - y_len
    train = []
    for col in range(loops):
        train.append(D[:, col:col + X_len + y_len])
    train = np.vstack(train)
    X_train, Y_train = np.split(train, [-y_len], axis=1)
    
    # Test set creation
    if test_loops > 0:
        X_train, X_test = np.split(X_train, [-rows * test_loops], axis=0)
        Y_train, Y_test = np.split(Y_train, [-rows * test_loops], axis=0)
    else:
        X_test = D[:, -X_len:]
        Y_test = np.full((X_test.shape[0], y_len), np.nan)
    
    # Formatting required for scikit-learn
    if y_len == 1:
        Y_train = Y_train.ravel()
        Y_test = Y_test.ravel()
        
    return X_train, Y_train, X_test, Y_test

#### We can now easily call our new function datasets(df) as well as import_data().

We obtain the datasets we need to feed our machine learning algorithm (X_train and Y_train) and the datasets we need to test it (X_test and Y_test)

In [3]:
# Import data
df = import_data()

# Create training and test sets using the datasets function
X_train, Y_train, X_test, Y_test = datasets(df, X_len=12, y_len=1, test_loops=12)

Note: We can change y_len if we want to forecast multiple periods at once.

#### Let’s create a linear regression forecast benchmark. 
We want to have an indication of what a simple model could do in order to compare its accuracy against our more complex models.

In [4]:
from sklearn.linear_model import LinearRegression

# Ensure the number of samples matches in X_train and Y_train
if X_train.shape[0] != Y_train.shape[0]:
    print("Number of samples in X_train and Y_train do not match!")
else:
    # Create and train the Linear Regression model
    reg = LinearRegression() 
    reg.fit(X_train, Y_train)
    
    # Generate predictions for the training and test sets
    Y_train_pred = reg.predict(X_train)
    Y_test_pred = reg.predict(X_test)

In [5]:
# Import the LinearRegression class from sklearn's linear_model module.
from sklearn.linear_model import LinearRegression

reg = LinearRegression() # Create an instance of the LinearRegression class.
reg.fit(X_train, Y_train) # Fit the linear regression model to the training data.

# Generate predictions for the training set using the fitted model.
Y_train_pred = reg.predict(X_train) # Y_train_pred will contain the model's prediction for the dependent variable based on X_train.

# Generate predictions for the test set using the fitted model.
Y_test_pred = reg.predict(X_test)   # Y_test_pred will contain the model's prediction for the dependent variable based on X_test.
                                    # It's essential for evaluating the model's performance on unseen data.

#### We can now create a KPI function kpi_ML()
It will display the accuracy of our model.

Here, we use a DataFrame in order to print the various KPI in a structured way.

In [6]:
def kpi_ML(Y_train, Y_train_pred, Y_test, Y_test_pred, name=""):

    # Create a DataFrame to store Key Performance Indicators (KPIs) for Machine Learning models.
    df = pd.DataFrame(columns=['MAE', 'RMSE', 'Bias'], index=['Train', 'Test'])
    
    # Set the name of the index to the provided 'name' parameter, typically the model name.
    df.index.name = name

    # Calculate and assign the MAE, RMSE, and Bias for the training dataset.
    df.loc['Train', 'MAE'] = 100 * np.mean(np.abs(Y_train - Y_train_pred)) / np.mean(Y_train)
    df.loc['Train', 'RMSE'] = 100 * np.sqrt(np.mean((Y_train - Y_train_pred)**2)) / np.mean(Y_train)
    df.loc['Train', 'Bias'] = 100 * np.mean((Y_train - Y_train_pred)) / np.mean(Y_train)

    # Calculate and assign the MAE, RMSE, and Bias for the test dataset.
    df.loc['Test', 'MAE'] = 100 * np.mean(np.abs(Y_test - Y_test_pred)) / np.mean(Y_test)
    df.loc['Test', 'RMSE'] = 100 * np.sqrt(np.mean((Y_test - Y_test_pred)**2)) / np.mean(Y_test)
    df.loc['Test', 'Bias'] = 100 * np.mean((Y_test - Y_test_pred)) / np.mean(Y_test)

    # Round the values in the DataFrame to one decimal place for easier reading and interpretation.
    df = df.astype(float).round(1)
    
    # Print the DataFrame to display the calculated KPIs.
    print(df)

In [7]:
# Example usage of the function
kpi_ML(Y_train, Y_train_pred, Y_test, Y_test_pred, name='Regression')

             MAE  RMSE  Bias
Regression                  
Train       17.8  43.9  -0.0
Test        17.8  43.7   1.6


**Attention Point:**
Do not worry if the MAE and RMSE of your dataset is much above the example benchmark presented here. In some projects, Nicolas Vandeput has seen MAE as high as 80 to 100%, and RMSE well above 500%. Again, we use a linear regression benchmark precisely to get an order of magnitude of the complexity of a dataset.

On a different dataset, with a longer forecast horizon (and more seasonality), linear regressions might not be up to the challenge.

#### Make predictions for the future
Future Forecast
We can now change our hat from data scientist—working with training and test sets to evaluate models—to demand planner—using a model to populate a baseline forecast.

We will create a future forecast—the forecast to be used by the supply chain— by using our datasets() function and set test_loops to 0. The function will then return X_test filled-in with the latest demand observations— thus we will be able to use it to predict future demand. Moreover, as we do not keep data aside for the test set, the training dataset (X_train and Y_train) will include the whole historical demand. This will be helpful, as we will use as much training data as possible to feed the model.

In [8]:
# Make predictions for the future
X_future = df.values[:, -X_train.shape[1]:]
future_predictions = reg.predict(X_future)
future_predictions = future_predictions.reshape(-1, 1)

# Create a DataFrame for the future forecast
forecast = pd.DataFrame(data=future_predictions, index=df.index, columns=["Future_Predicted"])

# Print or use the 'future_forecast' DataFrame as needed
print(forecast.head())

              Future_Predicted
Make                          
Alfa Romeo            6.413065
Aston Martin          1.279442
Audi                650.220440
BMW                1262.958337
Bentley               1.465567
