### Hackathon Sample Dataset for Practice

The problem of the company is very complex. It is not often the case that we have such complex and informative cases. Before thinking about the methodology, you need to think about the strategy. The retailers have fixed spaces to stock inventory. They tend to buy a fixed number of items from each apparel manufacturer. If they ordered 500 parkas from XYZ last year (different models, colors, and sizes), you would expect that the number would be around 450 and 550 this year. Therefore, aggregate demand is expected to be less uncertain. While predicting the aggregate demand, you can also use advance demand information as we discussed several times in classes. After having aggregate demand, you must calculate choice probabilities. You need to develop a choice model to reach the SKU-level demand forecasts. Suppose that a retailer sells three SKUs (A, B, and C). The aggregate demand is predicted to be 100 units in April. For that season, the choice probabilities are 0.3, 0.25, and 0.45 for the SKUs, respectively. Then, SKU-level demand predictions will be 30, 25, and 45 units for A, B, and C. To get choice probabilities, you must develop a classification model (e.g., logistic regression). The variables such as price and season can be added to the choice model. Here you need to pay attention to decision variables. Once you develop the choice model, for example, you have price and season as explanatory variables. If you want to predict demand for next month, season information is straightforward from the month. But price is a decision variable. For a given SKU, you may consider the average price as the expected price for the sales next month. Ideally companies make pricing decisions by utilizing such models. Therefore, the model you are developing has an impact on the revenue management. Finally, you need to have a strong understanding how to handle and group data. Aggregate demand must be formed from the transactional dataset. You can ideally group demand monthly. But if you group demand in a single dimension, you would not use the advance demand information. If you group demand by delivery month and a binary variable indicating if the demand is advance and urgent, that would be useful. The good thing with the choice model is that we can develop the choice model with transactional data.

________________________________________________________________________________________________________________________________________________

XYZ Sportswear (company name was disguised due to NDA) is world-renowned company that
sells high-quality parkas to 45 countries (the real company name was disguised). Most sales
occur in the western European and Russian markets. However, XYZ is not actively present in
these locations so the company sells the products to vendors. Some of the vendors are directly
linked to big sporting goods retailers (e.g., Decathlon). The vendors place the orders well in
advance of their requested delivery date. For each order, they may select more than one item
in different quantities. If a vendor places an order for 3 different parkas in some quantities,
for example, the order will appear in three different rows in the dataset such that each line
corresponds to an order code/product code pair. The orders are stored in a dataset which
includes 11 variables. Each variable corresponds to one column. There are 2421 rows in the
dataset, covering time between 2009 and 2011.
To improve the accuracy of demand forecasts, the demand planners determined two issues
that needs to be addressed by an external project team. First, the company needs a choice
model to understand if the choice probabilities are affected by seasonality. Second, a model
to estimate demand depending on the advance orders is needed.

________________________________________________________________________________________________________________________________________________

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, accuracy_score

In [2]:
df= pd.read_csv('/Users/pratiksha/Documents/Predictive Modelling/TEST_XYZ_Sportswear_Orders_Dataset.csv')

In [3]:
df.head(10)

Unnamed: 0,Order Date,Requested Delivery Date,Customer Country Code,Product Code,Description,Order Type,Customer Order Code,Value,Currency,Items,Route
0,01.01.2009,18.02.2009,DE,PK003,Basic Parka,VO,COD00001,269.87,EUR,8,East
1,01.01.2009,10.02.2009,FR,PK001,Premium Parka,VO,COD00002,170.34,EUR,2,West
2,01.01.2009,17.01.2009,ES,PK002,Basic Parka,VO,COD00003,61.09,EUR,7,West
3,03.01.2009,28.01.2009,IT,PK003,Advanced Parka,VO,COD00004,251.18,EUR,6,North
4,03.01.2009,18.01.2009,DE,PK004,Premium Parka,VO,COD00005,153.33,EUR,5,West
5,03.01.2009,14.01.2009,ES,PK001,Premium Parka,VO,COD00006,196.87,EUR,2,West
6,03.01.2009,27.02.2009,ES,PK003,Advanced Parka,VO,COD00007,90.85,EUR,3,West
7,03.01.2009,08.01.2009,DE,PK003,Premium Parka,VO,COD00008,164.53,EUR,9,West
8,04.01.2009,25.01.2009,IT,PK004,Premium Parka,VO,COD00009,110.15,EUR,6,East
9,04.01.2009,15.02.2009,RU,PK004,Advanced Parka,VO,COD00010,124.43,EUR,6,West


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2421 entries, 0 to 2420
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Order Date               2421 non-null   object 
 1   Requested Delivery Date  2421 non-null   object 
 2   Customer Country Code    2421 non-null   object 
 3   Product Code             2421 non-null   object 
 4   Description              2421 non-null   object 
 5   Order Type               2421 non-null   object 
 6   Customer Order Code      2421 non-null   object 
 7   Value                    2421 non-null   float64
 8   Currency                 2421 non-null   object 
 9   Items                    2421 non-null   int64  
 10  Route                    2421 non-null   object 
dtypes: float64(1), int64(1), object(9)
memory usage: 208.2+ KB


In [5]:
df.describe()

Unnamed: 0,Value,Items
count,2421.0,2421.0
mean,173.197071,7.023957
std,71.542403,4.982971
min,50.16,1.0
25%,111.96,3.0
50%,172.83,6.0
75%,234.46,9.0
max,299.99,23.0


### choice model to understand if the choice probabilities are affected by seasonality

In [7]:
df['Order Date'] = pd.to_datetime(df['Order Date'], format='%d.%m.%Y')
df['Requested Delivery Date'] = pd.to_datetime(df['Requested Delivery Date'], format='%d.%m.%Y')
df['Month'] = df['Order Date'].dt.month

In [8]:
df.head(10)

Unnamed: 0,Order Date,Requested Delivery Date,Customer Country Code,Product Code,Description,Order Type,Customer Order Code,Value,Currency,Items,Route,Month
0,2009-01-01,2009-02-18,DE,PK003,Basic Parka,VO,COD00001,269.87,EUR,8,East,1
1,2009-01-01,2009-02-10,FR,PK001,Premium Parka,VO,COD00002,170.34,EUR,2,West,1
2,2009-01-01,2009-01-17,ES,PK002,Basic Parka,VO,COD00003,61.09,EUR,7,West,1
3,2009-01-03,2009-01-28,IT,PK003,Advanced Parka,VO,COD00004,251.18,EUR,6,North,1
4,2009-01-03,2009-01-18,DE,PK004,Premium Parka,VO,COD00005,153.33,EUR,5,West,1
5,2009-01-03,2009-01-14,ES,PK001,Premium Parka,VO,COD00006,196.87,EUR,2,West,1
6,2009-01-03,2009-02-27,ES,PK003,Advanced Parka,VO,COD00007,90.85,EUR,3,West,1
7,2009-01-03,2009-01-08,DE,PK003,Premium Parka,VO,COD00008,164.53,EUR,9,West,1
8,2009-01-04,2009-01-25,IT,PK004,Premium Parka,VO,COD00009,110.15,EUR,6,East,1
9,2009-01-04,2009-02-15,RU,PK004,Advanced Parka,VO,COD00010,124.43,EUR,6,West,1


In [9]:
le = LabelEncoder()
df['Season'] = df['Month'].apply(lambda x: 'Spring' if x in [3, 4, 5] else ('Summer' if x in [6, 7, 8] else ('Fall' if x in [9, 10, 11] else 'Winter')))
df['Season_encoded'] = le.fit_transform(df['Season'])

In [10]:
df.head(10)

Unnamed: 0,Order Date,Requested Delivery Date,Customer Country Code,Product Code,Description,Order Type,Customer Order Code,Value,Currency,Items,Route,Month,Season,Season_encoded
0,2009-01-01,2009-02-18,DE,PK003,Basic Parka,VO,COD00001,269.87,EUR,8,East,1,Winter,3
1,2009-01-01,2009-02-10,FR,PK001,Premium Parka,VO,COD00002,170.34,EUR,2,West,1,Winter,3
2,2009-01-01,2009-01-17,ES,PK002,Basic Parka,VO,COD00003,61.09,EUR,7,West,1,Winter,3
3,2009-01-03,2009-01-28,IT,PK003,Advanced Parka,VO,COD00004,251.18,EUR,6,North,1,Winter,3
4,2009-01-03,2009-01-18,DE,PK004,Premium Parka,VO,COD00005,153.33,EUR,5,West,1,Winter,3
5,2009-01-03,2009-01-14,ES,PK001,Premium Parka,VO,COD00006,196.87,EUR,2,West,1,Winter,3
6,2009-01-03,2009-02-27,ES,PK003,Advanced Parka,VO,COD00007,90.85,EUR,3,West,1,Winter,3
7,2009-01-03,2009-01-08,DE,PK003,Premium Parka,VO,COD00008,164.53,EUR,9,West,1,Winter,3
8,2009-01-04,2009-01-25,IT,PK004,Premium Parka,VO,COD00009,110.15,EUR,6,East,1,Winter,3
9,2009-01-04,2009-02-15,RU,PK004,Advanced Parka,VO,COD00010,124.43,EUR,6,West,1,Winter,3


In [11]:
features = ['Value', 'Season_encoded'] 

In [12]:
target = 'Items'

In [13]:
X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.2, random_state=42)

In [14]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [15]:
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

In [16]:
predictions = model.predict(X_test_scaled)

In [21]:
predicted_aggregate_demand = 100 #example in his comment

In [18]:
probability_estimates = model.predict_proba(X_test_scaled)

In [19]:
for i in range(5):
    print(f"Instance {i + 1}: {probability_estimates[i]}")

Instance 1: [0.11495762 0.09336051 0.08639033 0.08448194 0.09682669 0.09705018
 0.10228786 0.07976639 0.08053585 0.01153292 0.01385866 0.01705854
 0.01742618 0.02159725 0.02141702 0.01996295 0.01037732 0.01521337
 0.00699592 0.00422553 0.00040648 0.00332697 0.0009435 ]
Instance 2: [0.10777749 0.07954566 0.07149207 0.07493791 0.08828522 0.08850769
 0.10349438 0.07745145 0.06809823 0.01890086 0.02431415 0.02481437
 0.02067878 0.02722488 0.02938492 0.02561943 0.01946711 0.01772501
 0.01417238 0.00714258 0.00287649 0.0064983  0.00159064]
Instance 3: [0.09877529 0.06088383 0.05763418 0.06175846 0.07525973 0.07259956
 0.09597174 0.07059941 0.05358529 0.02773918 0.03808225 0.03263595
 0.02631157 0.03465585 0.03662997 0.03110454 0.0342205  0.01909732
 0.02603223 0.01335859 0.02179098 0.00953302 0.00174055]
Instance 4: [0.08951861 0.06626822 0.05262559 0.06248874 0.07512123 0.07880889
 0.10001997 0.06984621 0.05402823 0.03039833 0.04201381 0.03502936
 0.01951413 0.02927204 0.03890347 0.03026288

In [20]:
print("Accuracy:", accuracy_score(y_test, predictions))
print("Classification Report:\n", classification_report(y_test, predictions))

Accuracy: 0.10103092783505155
Classification Report:
               precision    recall  f1-score   support

           1       0.12      0.62      0.20        61
           2       0.00      0.00      0.00        31
           3       0.00      0.00      0.00        35
           4       0.00      0.00      0.00        49
           5       0.00      0.00      0.00        44
           6       0.18      0.05      0.08        41
           7       0.07      0.18      0.10        49
           8       0.00      0.00      0.00        45
           9       0.00      0.00      0.00        32
          10       0.00      0.00      0.00         8
          11       0.00      0.00      0.00        10
          12       0.00      0.00      0.00         7
          13       0.00      0.00      0.00         5
          14       0.00      0.00      0.00        11
          15       0.00      0.00      0.00        14
          16       0.00      0.00      0.00         8
          17       0.00    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [22]:
# Assuming 'predicted_aggregate_demand' is the predicted aggregate demand for a specific time period (e.g., 100 units)
sku_level_demand_predictions = (probability_estimates.T * predicted_aggregate_demand).T

In [23]:
print(f"SKU-level Demand Predictions: {sku_level_demand_predictions}")

SKU-level Demand Predictions: [[11.495762    9.33605126  8.63903332 ...  0.04064825  0.33269692
   0.09435023]
 [10.77774908  7.95456608  7.14920674 ...  0.2876486   0.64983024
   0.159064  ]
 [ 9.87752867  6.08838288  5.76341833 ...  2.1790983   0.95330227
   0.17405501]
 ...
 [10.90146924  5.40820168  6.30466636 ...  3.03743574  0.53458311
   0.06708047]
 [10.57353817  8.08442003  7.02388331 ...  0.27168022  0.71176047
   0.18518477]
 [ 8.14964165  7.03509929  4.82115141 ...  1.25158921  2.24865501
   0.73462937]]


### predicting demand but doesn't fully make sense

In [24]:
# Assuming 'Month' is a column representing the month in your DataFrame
# Create a feature for the next month
df['Next_Month'] = df['Month'] + 1
df['Next_Month_Season'] = df['Next_Month'].apply(lambda x: 'Spring' if x in [3, 4, 5] else ('Summer' if x in [6, 7, 8] else ('Fall' if x in [9, 10, 11] else 'Winter')))

# Include the new feature in your model
features = ['Season_encoded', 'Price', 'Next_Month_Season']


In [26]:
# Assuming 'Order Date' and 'Requested Delivery Date' are datetime columns
# Create a binary variable indicating if the demand is advance and urgent
df['Advance_Demand'] = (df['Requested Delivery Date'] - df['Order Date']).dt.days > 30  # Adjust the threshold as needed

# Group demand by delivery month and the binary variable
grouped_demand = df.groupby(['Month', 'Advance_Demand'])['Items'].sum().reset_index()


In [28]:
# Assuming 'predicted_aggregate_demand' is your predicted aggregate demand
df['Dynamic_Price'] = df['Value'] * (1 + (df['Items'] / predicted_aggregate_demand - 1) * 0.1)  # Adjust the multiplier as needed


### LR model to estimate demand depending on the advance orders

In [30]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder

# Assuming 'df' is your DataFrame
# Convert 'Advance_Demand' to binary encoding
le = LabelEncoder()
df['Advance_Demand_Encoded'] = le.fit_transform(df['Advance_Demand'])

# Features for the model
features = ['Season_encoded', 'Value', 'Advance_Demand_Encoded']

# Target variable
target = 'Items'

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.2, random_state=42)

# Create and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions on the test set
predictions = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print("Mean Squared Error:", mse)
print("R-squared:", r2)


Mean Squared Error: 23.236772195627246
R-squared: 0.05441585178014541


### Decision Tree

In [31]:
from sklearn.tree import DecisionTreeRegressor

# Create and train the decision tree model
decision_tree_model = DecisionTreeRegressor()
decision_tree_model.fit(X_train, y_train)

# Predictions on the test set
dt_predictions = decision_tree_model.predict(X_test)


In [32]:
from sklearn.metrics import mean_squared_error, r2_score

# Evaluate decision tree model
dt_mse = mean_squared_error(y_test, dt_predictions)
dt_r2 = r2_score(y_test, dt_predictions)

print("Decision Tree Model:")
print("Mean Squared Error:", dt_mse)
print("R-squared:", dt_r2)


Decision Tree Model:
Mean Squared Error: 34.202577319587625
R-squared: -0.3918204589427474


### Random forest

In [33]:
from sklearn.ensemble import RandomForestRegressor

# Create and train the random forest model
random_forest_model = RandomForestRegressor(n_estimators=100, random_state=42)
random_forest_model.fit(X_train, y_train)

# Predictions on the test set
rf_predictions = random_forest_model.predict(X_test)


In [34]:
# Evaluate random forest model
rf_mse = mean_squared_error(y_test, rf_predictions)
rf_r2 = r2_score(y_test, rf_predictions)

print("Random Forest Model:")
print("Mean Squared Error:", rf_mse)
print("R-squared:", rf_r2)


Random Forest Model:
Mean Squared Error: 22.83060547626528
R-squared: 0.07094417198442104


### Gradient Boosting (XGBoost)

In [35]:
import xgboost as xgb

# Create and train the XGBoost model
xgb_model = xgb.XGBRegressor(objective="reg:squarederror", n_estimators=100, random_state=42)
xgb_model.fit(X_train, y_train)

# Predictions on the test set
xgb_predictions = xgb_model.predict(X_test)


In [36]:
# Evaluate XGBoost model
xgb_mse = mean_squared_error(y_test, xgb_predictions)
xgb_r2 = r2_score(y_test, xgb_predictions)

print("XGBoost Model:")
print("Mean Squared Error:", xgb_mse)
print("R-squared:", xgb_r2)


XGBoost Model:
Mean Squared Error: 22.223115942470923
R-squared: 0.09566500965194846


### Support Vector Machines

In [37]:
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create and train the SVM model
svm_model = SVR(kernel='linear')
svm_model.fit(X_train_scaled, y_train)

# Predictions on the test set
svm_predictions = svm_model.predict(X_test_scaled)


In [38]:
# Evaluate SVM model
svm_mse = mean_squared_error(y_test, svm_predictions)
svm_r2 = r2_score(y_test, svm_predictions)

print("SVM Model:")
print("Mean Squared Error:", svm_mse)
print("R-squared:", svm_r2)


SVM Model:
Mean Squared Error: 23.640094772389563
R-squared: 0.03800326951635746
