## Adding New Features to the Training Set
(With comments and codes from the Nicolas Vandepu's book "Data Science for Supply Chain Forecasting") ->>https://supchains.com/books/#book1

For many businesses, historical demand is not the only—or main—factor that drives future sales. Other internal and external factors drive the demand as well. You might sell more or less depending on the weather, the GDP growth, unemployment rate, loan rates, and so on. These external factors (external, as The demand can also be driven by company decisions: price changes, promotions, marketing budget, or another product’s sales. As these factors result from business decisions, we will call them internal factors.

We will show both historical demand and historical GDP growth side by side. This will allow the tool to understand that, historically, the sales could have been high or low due to favorable or unfavorable GDP growth.

In practice, when creating our training dataset, we’ll add the GDP growth next to the historical demand in X_train. See the implementation of X_train and Y_train in

### Data Transformation for Time Series Analysis
Let's defines a function called import_data that automates the process of importing and transforming car sales data from a CSV file into a pandas DataFrame formatted for time series analysis. The CSV file, located at the provided URL, contains monthly car sales data organized by the car make and model.

In [1]:
import pandas as pd

# Define the import_data function
def import_data():
    data = pd.read_csv(file_path)
    data['Period'] = data['Year'].astype(str) + '-' + data['Month'].astype(str).str.zfill(2)
    df = pd.pivot_table(data=data, values=['Quantity'], index='Make', columns='Period', aggfunc='sum', fill_value=0)
    return df

# URL of the CSV file
file_path = "https://supchains.com/wp-content/uploads/2021/07/norway_new_car_sales_by_make1.csv"

# Create the DataFrame using the import_data function
df = import_data()

# Now 'df' contains the data from the provided URL in the desired format.

# Print the DataFrame
print(df.head())

             Quantity                                                          \
Period        2007-01 2007-02 2007-03 2007-04 2007-05 2007-06 2007-07 2007-08   
Make                                                                            
Alfa Romeo         16       9      21      20      17      21      14      12   
Aston Martin        0       0       1       0       4       3       3       0   
Audi              599     498     682     556     630     498     562     590   
BMW               352     335     365     360     431     477     403     348   
Bentley             0       0       0       0       0       1       0       0   

                              ...                                          \
Period       2007-09 2007-10  ... 2016-04 2016-05 2016-06 2016-07 2016-08   
Make                          ...                                           
Alfa Romeo        15      10  ...       3       1       2       1       6   
Aston Martin       0       0  ...       0  

### Retrieving and Structuring Economic Data for Time Series Analysis
Now we are going to fetch economic data using an API and then structure it into a pandas DataFrame for time series analysis. The specific data retrieved is the volume of the gross domestic product (GDP) across various years, from 2006 to 2017.
Here's what the script does in detail:

- It imports the necessary requests module for making HTTP requests and pandas for data manipulation.
- It specifies the URL for the API endpoint that will provide the economic data.
- It constructs a JSON payload that defines the parameters for the API request. This includes specifying the macroeconomic indicator for GDP ("bnpb.nr23_9"), the content code for the volume ("Volum"), and the range of years of interest.
- It sends a POST request to the API with the payload and parses the JSON response.
- It extracts the GDP values and corresponding years from the response data.
- It creates a pandas DataFrame, X_exo, with two columns: 'Year' and 'GDP', where 'Year' is converted to a datetime format.
- It sets the 'Year' column as the index of the DataFrame and formats the index to represent periods (e.g., '2006-01', '2006-02', etc.).
- Finally, it prints the resulting DataFrame, which is now ready for further time series analysis tasks, such as forecasting or trend analysis.

In [2]:
import requests
import pandas as pd

# URL for the API request
url = "https://data.ssb.no/api/v0/en/table/09189/"

# JSON payload for the request
payload = {
    "query": [
        {
            "code": "Makrost",
            "selection": {
                "filter": "item",
                "values": ["bnpb.nr23_9"]
            }
        },
        {
            "code": "ContentsCode",
            "selection": {
                "filter": "item",
                "values": ["Volum"]
            }
        },
        {
            "code": "Tid",
            "selection": {
                "filter": "item",
                "values": [str(year) for year in range(2006, 2018)]
            }
        }
    ],
    "response": {
        "format": "json-stat2"
    }
}

# Making the API request
response = requests.post(url, json=payload)
data = response.json()

# Extracting the values and years
values = data['value']
years = [year for year in data['dimension']['Tid']['category']['label']]

# Creating a DataFrame
X_exo = pd.DataFrame({
    'Year': years,
    'GDP': values
})

# Setting the year as index and formatting it to '%Y-%m'
X_exo['Year'] = pd.to_datetime(X_exo['Year'], format='%Y')
X_exo.set_index('Year', inplace=True)
X_exo.index = X_exo.index.to_period('M')

# Displaying the DataFrame
print(X_exo)

         GDP
Year        
2006-01  2.5
2007-01  2.9
2008-01  0.5
2009-01 -1.9
2010-01  0.8
2011-01  1.1
2012-01  2.7
2013-01  1.0
2014-01  2.0
2015-01  1.9
2016-01  1.2
2017-01  2.5


### Machine Learning Data Preparation for Time Series Forecasting with Exogenous Variables
The function datasets_exo will be used to prepare datasets from a pandas DataFrame for time series forecasting in machine learning models that will include the exogenous variable X_exo. The function is tailored to work with time series data where the goal is to predict future values based on past observations and additional external factors.
The key operations performed by the function include:

- **Data Conversion:** The input DataFrame df, which contains time series data, is converted into a NumPy array D for more efficient numerical operations.

- **Exogenous Variables Preparation:** The array of exogenous variables X_exo is created by repeating the given exogenous data across all time series rows.

- **Month Extraction:** The function assumes the DataFrame columns include month information in their names. It extracts this information, turning it into a numerical array that indicates the time within the series.

- **Training Set Creation:** The function constructs the training set by looping through the time series data and creating "windows" of observations of length x_len for the input features, along with an additional y_len for the target variable.

- **Test Set Creation:** Depending on the test_loops parameter, the function can also create a test set by separating out the last few sequences of data. If no test set is requested, it prepares a dataset for future forecasting, using dummy values as placeholders for the unknown future values.

- **Scikit-Learn Compatibility:** The target arrays are reshaped to be compatible with scikit-learn's expected input format, especially when the target sequence length y_len is one, requiring a 1D array.

This practice will allow us to evaluate the model's performance on unseen data and to simulate making future predictions.

In [3]:
import numpy as np

def datasets_exo(df, X_exo, x_len=12, y_len=1, test_loops=12):
    # Convert DataFrame to numpy array
    D = df.values
    rows, periods = D.shape  # Get the number of rows and columns from the DataFrame

    # Prepare the exogenous variables by repeating them for each row
    X_exo = np.repeat(np.reshape(X_exo, [1, -1]), rows, axis=0)

    # Prepare the month variables by repeating them for each row
    X_months = np.repeat(np.reshape([int(col[-2:]) for col in df.columns], [1, -1]), rows, axis=0)

    # Training set creation
    loops = periods + 1 - x_len - y_len  # Determine the number of loops for creating the training set
    train = []  # Initialize the training set list

    for col in range(loops):
        m = X_months[:, col:col+x_len].reshape(-1, 1)  # month
        exo = X_exo[:, col:col+x_len].reshape(-1, 1)  # exogenous data
        d = D[:, col:col+x_len+y_len]  # target data
        train.append(np.hstack([m, exo, d]))  # Combine the month, exogenous data, and target data

    train = np.vstack(train)  # Stack the training data vertically
    X_train, Y_train = np.split(train, [-y_len], axis=1)  # Split the training data into features and target

    # Test set creation
    if test_loops > 0:
        X_train, X_test = np.split(X_train, [-rows*test_loops], axis=0)
        Y_train, Y_test = np.split(Y_train, [-rows*test_loops], axis=0)
    else:  # No test set: X_test is used to generate the future forecast
        X_test = np.hstack([m[:, -1].reshape(-1, 1), X_exo[:, -x_len:], D[:, -x_len:]])
        Y_test = np.full((X_test.shape[0], y_len), np.nan)  # Dummy values

    # Formatting required for scikit-learn
    if y_len == 1:
        Y_train = Y_train.ravel()
        Y_test = Y_test.ravel()

    return X_train, Y_train, X_test, Y_test

From here, you can simply use this function to generate the new train and test arrays. These can then be used in the various models.