## Categorical Features
**(With comments and codes from the Nicolas Vandepu's book "Data Science for Supply Chain Forecasting")**

Most supply chains serve different markets (often through different channels) and have different product families. What if a machine learning model could benefit from these extra pieces of information: Am I selling this to market A? Is this product part of family B?

In [9]:
import pandas as pd

# Define the import_data function
def import_data():
    data = pd.read_csv(file_path)
    data['Period'] = data['Year'].astype(str) + '-' + data['Month'].astype(str).str.zfill(2)
    df = pd.pivot_table(data=data, values=['Quantity'], index='Make', columns='Period', aggfunc='sum', fill_value=0)
    return df

# URL of the CSV file
file_path = "https://supchains.com/wp-content/uploads/2021/07/norway_new_car_sales_by_make1.csv"

# Create the DataFrame using the import_data function
df = import_data()

# Now 'df' contains the data from the provided URL in the desired format.

# Print the DataFrame
print(df.head())

             Quantity                                                          \
Period        2007-01 2007-02 2007-03 2007-04 2007-05 2007-06 2007-07 2007-08   
Make                                                                            
Alfa Romeo         16       9      21      20      17      21      14      12   
Aston Martin        0       0       1       0       4       3       3       0   
Audi              599     498     682     556     630     498     562     590   
BMW               352     335     365     360     431     477     403     348   
Bentley             0       0       0       0       0       1       0       0   

                              ...                                          \
Period       2007-09 2007-10  ... 2016-04 2016-05 2016-06 2016-07 2016-08   
Make                          ...                                           
Alfa Romeo        15      10  ...       3       1       2       1       6   
Aston Martin       0       0  ...       0  

 
We will update our datasets() function so that it can properly use a categorical input. 

The idea is that we will flag the categorical columns in the historical dataset df based on their names. We can easily identify categorical columns by using the prefix_sep.

Prepares datasets for machine learning models with categorical variables included.
    
**Parameters:**
- df (DataFrame): The input data frame containing the time series and categorical data.
- x_len (int): The length of the input sequence.
- y_len (int): The length of the output sequence.
- test_loops (int): The number of loops to be used for testing.
- cat_name (str): The string identifier for categorical columns.
   
**Returns:**
- X_train (numpy.ndarray): The training data set for input features.
- Y_train (numpy.ndarray): The training data set for output features.
- X_test (numpy.ndarray): The test data set for input features.
- Y_test (numpy.ndarray): The test data set for output features.

In [10]:
import numpy as np

def datasets_cat(df, x_len=12, y_len=1, test_loops=12, cat_name='_'):
    # Select columns that contain categorical data
    col_cat = [col for col in df.columns if cat_name in col]
    
    # Separate historical demand data and categorical data
    D = df.drop(columns=col_cat).values  # Historical demand
    C = df[col_cat].values  # Categorical info
    
    # Determine the shape of the historical data
    rows, periods = D.shape

    # Training set creation
    loops = periods + 1 - x_len - y_len
    train = []
    
    # Loop to create training sequences
    for col in range(loops):
        train.append(D[:, col:col+x_len+y_len])
    
    # Stack sequences vertically
    train = np.vstack(train)
    
    # Split sequences into input (X) and output (Y) features
    X_train, Y_train = np.split(train, [-y_len], axis=1)
    
    # Concatenate categorical data with input features
    X_train = np.hstack((np.vstack([C]*loops), X_train))

    # Test set creation
    if test_loops > 0:
        # Split the data into training and test sets if test_loops are specified
        X_train, X_test = np.split(X_train, [-rows*test_loops], axis=0)
        Y_train, Y_test = np.split(Y_train, [-rows*test_loops], axis=0)
    else:
        # No test set: X_test is used to generate the future forecast
        X_test = np.hstack((C, D[:, -x_len:]))
        Y_test = np.full((X_test.shape[0], y_len), np.nan)  # Dummy value for Y_test

    # Formatting required for scikit-learn
    if y_len == 1:
        Y_train = Y_train.ravel()
        Y_test = Y_test.ravel()

    return X_train, Y_train, X_test, Y_test

Let’s now use one-hot label encoding to differentiate each brand. We can retrieve the brand easily from the index, and then one-hot encode it using the pandas function pd.get_dummies().

Using one-hot label encoding (otherwise known as dummification) rather than integer encoding will allow you to remove useless features and only keep the meaningful ones, thus reducing overfitting. It is then preferred not to initially remove one random dummy column from the one-hot labels, but rather to wait for analysis only to remove the meaningless ones.

In [11]:
# Import data using the function defined above
df = import_data()
# One-hot encode the 'Brand' column and return a DataFrame with new columns for each brand
df['Brand'] = df.index  # Set 'Brand' column as the index of the DataFrame
df = pd.get_dummies(df, columns=['Brand'], prefix_sep='_')  # Apply one-hot encoding

# Prepare the training and test datasets using the datasets_cat function
X_train, Y_train, X_test, Y_test = datasets_cat(df, x_len=12, y_len=1, test_loops=12, cat_name='_')

Now that we have proper training and test sets, we can pursue our data science process with a proper feature optimization.