# Truck Allocation Forecast Model

## Background and Objective
The goal of this project is to build a model that predicts the number of trucks required for daily shipments in a logistics optimization context. Specifically, the objective is to predict the number of trucks needed per day, which helps in streamlining the logistics process and improving delivery efficiency.

## Logic and Formulas

### 1. Shipment Units Calculation
The number of shipment units is calculated using the following formulas:

#### 1.1 Single Shipment Units

$$S_{\text{single}} = U_d \times R_{\text{single}}$$

where:

- $U_d$ is the daily shipment volume (in units)
- $R_{\text{single}}$ is the single shipment ratio

#### 1.2 Multi Shipment Units

$$S_{\text{multi}} = \frac{U_d \times R_{\text{multi}}}{U_{\text{multi}}}$$

where:

- $U_{\text{multi}}$ is the average units per multi shipment
- $R_{\text{multi}}$ is the multi shipment ratio

### 2. Email and Box Shipment Ratios
The total shipment units are divided into email and box shipments, calculated based on the following ratios:

#### 2.1 Email Shipments

$$S_M = S_{\text{total}} \times S_{M\text{ratio}}$$

#### 2.2 Box Shipments

$$S_B = S_{\text{total}} \times S_{B\text{ratio}}$$

where:

- $S_M$ is the total email shipments
- $S_B$ is the total box shipments
- $S_{\text{total}}$ is the total shipment units
- $S_{M\text{ratio}}$ is the email shipment ratio
- $S_{B\text{ratio}}$ is the box shipment ratio

### 3. Truck Allocation Calculation
The total number of trucks required is calculated by dividing the email and box shipments by their respective cargo capacities, then summing the results and dividing by the truck capacity. The final truck allocation is calculated as follows:

$$\text{Total Trucks} = \left\lceil \frac{\frac{S_M}{S_{M\text{capacity}}} + \frac{S_B}{S_{B\text{capacity}}}}{\text{cargo\_per\_truck}} \right\rceil$$

where:

- $S_{M\text{capacity}} = 400$ is the email shipment capacity
- $S_{B\text{capacity}} = 75$ is the box shipment capacity
- $\text{cargo\_per\_truck} = 22$ is the cargo capacity per truck

### 4. Moving Average
A 7-day moving average of the truck numbers is added as a feature to improve the model's accuracy. The moving average of truck numbers is calculated as follows:

$$\text{Moving Average of Trucks} = \frac{1}{7}\sum_{i=t-6}^t T_i$$

where:

- $T_i$ is the truck number for day $i$
- $t$ is the current day

## Models Used

### 1. Linear Regression
Linear regression assumes a linear relationship between the explanatory variables and the target variable. While simple and interpretable, linear regression may struggle to capture complex nonlinear relationships in the data.

$$y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_nX_n$$

where:

- $y$ is the target variable (truck number)
- $X_1, X_2, \ldots, X_n$ are the explanatory variables (daily shipment volume, shipment ratios, moving average, etc.)

### 2. Random Forest
Random Forest is an ensemble learning method that builds multiple decision trees and aggregates their predictions. It is capable of capturing complex nonlinear relationships between the features and the target variable.

The Random Forest model involves the following hyperparameters:

- $\text{max\_depth} = \text{None}$
- $\text{min\_samples\_split} = 2$
- $\text{min\_samples\_leaf} = 2$
- $\text{n\_estimators} = 50$

These parameters were optimized using GridSearch.

## Results
Linear Regression MAE: 0.59

Random Forest MAE (after hyperparameter tuning): 1.93

## Next Steps

### 1. Model Improvement
Further improvements to the Random Forest model by refining the data preprocessing steps or exploring other models (e.g., Gradient Boosting) to improve accuracy.

### 2. Data Collection
Use real-world data to assess the model's accuracy. Incorporating external data, such as weather forecasts or public holiday schedules, may improve prediction accuracy.

### 3. Model Evaluation
Evaluate the model using additional metrics such as RMSE and $R^2$, to gain a more comprehensive understanding of the model's performance.

## Explanation for the Models and Methods

### Linear Regression
Linear regression assumes a simple relationship between input features and the target variable. However, in real-world problems, relationships are often more complex and nonlinear. Linear regression provides a baseline to compare more complex models like Random Forest.

### Random Forest
Random Forest is an ensemble technique that helps reduce overfitting compared to a single decision tree by averaging the predictions of multiple trees. This makes it more robust and suitable for complex datasets with nonlinear relationships. The model's performance is further enhanced by hyperparameter tuning, which helps find the best settings for the trees.

### Moving Average
The moving average of the truck numbers over the past 7 days is used to capture trends and seasonal effects. This feature can help improve prediction accuracy by providing the model with information on recent trends.

## Why We Chose These Approaches
The linear regression model was chosen to provide a baseline comparison. Random Forest, with its ability to capture complex patterns in data, was selected as a more powerful model. Hyperparameter tuning was essential for optimizing the Random Forest model's performance. 

In [4]:
import math
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline

# Create sample data
data = {
    "Date": pd.date_range(start="2024-04-01", periods=30, freq='D'),
    "U_d": np.random.randint(180000, 220000, size=30),
    "R_single": np.random.uniform(0.25, 0.35, size=30),
    "R_multi": np.random.uniform(0.65, 0.75, size=30),
    "U_multi": np.random.uniform(2.0, 2.2, size=30),
    "S_M_ratio": np.full(30, 0.4),
    "S_B_ratio": np.full(30, 0.6),
    "Weather": np.random.choice(["Sunny", "Rainy", "Cloudy"], size=30),
    "Holiday": np.random.choice([0, 1], size=30),
    "Sale_Flag": np.random.choice([0, 1], size=30)
}

df = pd.DataFrame(data)

# Encode categorical variables
df = pd.get_dummies(df, columns=["Weather"], drop_first=True)

# Calculate target variable (number of trucks needed)
def calculate_trucks(row, S_M_capacity=400, S_B_capacity=75, cargo_per_truck=22):
    S_single = row["U_d"] * row["R_single"]
    S_multi = (row["U_d"] * row["R_multi"]) / row["U_multi"]
    S_total = S_single + S_multi
    S_M = S_total * row["S_M_ratio"]
    S_B = S_total * row["S_B_ratio"]
    C_total = (S_M / S_M_capacity) + (S_B / S_B_capacity)
    return math.ceil(C_total / cargo_per_truck)

df["Total_Trucks"] = df.apply(calculate_trucks, axis=1)

# Feature Engineering
# Add interaction terms first
df['volume_ratio'] = df['U_d'] * df['R_single']
df['multi_volume'] = df['U_d'] * df['R_multi']

# Add polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df[['U_d', 'R_single', 'R_multi', 'U_multi', 'volume_ratio', 'multi_volume']])
poly_df = pd.DataFrame(poly_features, columns=poly.get_feature_names_out(['U_d', 'R_single', 'R_multi', 'U_multi', 'volume_ratio', 'multi_volume']))
df = pd.concat([df, poly_df], axis=1)

# Add moving average and rolling statistics
df["Moving_Avg_Trucks"] = df["Total_Trucks"].rolling(window=7, min_periods=1).mean()
df['rolling_std'] = df['Total_Trucks'].rolling(window=7, min_periods=1).std()
df['rolling_max'] = df['Total_Trucks'].rolling(window=7, min_periods=1).max()
df['rolling_min'] = df['Total_Trucks'].rolling(window=7, min_periods=1).min()

# Fill NaN values with appropriate values
df["Moving_Avg_Trucks"] = df["Moving_Avg_Trucks"].fillna(df["Total_Trucks"])
df['rolling_std'] = df['rolling_std'].fillna(0)  # Fill with 0 for first few days
df['rolling_max'] = df['rolling_max'].fillna(df["Total_Trucks"])
df['rolling_min'] = df['rolling_min'].fillna(df["Total_Trucks"])

# Split features and target variable
X = df.drop(columns=["Date", "Total_Trucks"])
y = df["Total_Trucks"]

# Split into training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train and predict with linear regression model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)
lr_mae = mean_absolute_error(y_test, y_pred_lr)

# Expanded hyperparameter search for Random Forest
rf_param_grid = {
    'n_estimators': [100, 200, 300, 400],
    'max_depth': [None, 15, 20, 25],
    'min_samples_split': [2, 5, 8],
    'min_samples_leaf': [1, 2],
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False],
    'oob_score': [True, False]
}

# Use regular cross-validation
rf_search = GridSearchCV(
    RandomForestRegressor(random_state=42),
    rf_param_grid,
    cv=5,  # Using 5-fold cross-validation
    scoring='neg_mean_absolute_error',
    n_jobs=-1,
    verbose=1
)
rf_search.fit(X_train, y_train)

# Train and predict with best model
best_rf_model = rf_search.best_estimator_
y_pred_rf = best_rf_model.predict(X_test)
rf_mae = mean_absolute_error(y_test, y_pred_rf)

# Feature importance analysis
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': best_rf_model.feature_importances_
})
feature_importance = feature_importance.sort_values('importance', ascending=False)

# Display results
print(f"Linear Regression MAE: {lr_mae}")
print(f"Random Forest MAE (Tuned): {rf_mae}")
print(f"Best Random Forest Parameters: {rf_search.best_params_}")
print("\nTop 10 most important features:")
print(feature_importance.head(10))

Fitting 5 folds for each of 1152 candidates, totalling 5760 fits


1440 fits failed out of a total of 5760.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
1440 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\ensemble\_forest.py", line 448, in fit
    raise ValueError("Out of bag e

Linear Regression MAE: 0.35156280589157757
Random Forest MAE (Tuned): 0.8500000000000014
Best Random Forest Parameters: {'bootstrap': True, 'max_depth': None, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200, 'oob_score': True}

Top 10 most important features:
                      feature  importance
37  volume_ratio multi_volume    0.259284
22           U_d volume_ratio    0.096715
28      R_single multi_volume    0.060593
23           U_d multi_volume    0.060411
19               U_d R_single    0.056122
12                        U_d    0.048254
31       R_multi volume_ratio    0.046713
18                      U_d^2    0.045459
0                         U_d    0.036355
34       U_multi volume_ratio    0.029934


## Model Building and Evaluation with Random Data: Analysis and Insights
# 1. Introduction
In this project, I aimed to build machine learning models to predict outcomes based on a dataset that includes inherent randomness. This randomness reflects real-world uncertainty, and as such, I focused on selecting techniques that are robust to noise and can generalize well in unpredictable scenarios. By doing so, I was able to demonstrate how data scientists can handle uncertain and random data effectively.

# 2. Understanding Randomness in Data
Real-world datasets often contain random or noisy elements that can significantly affect the performance of predictive models. The presence of randomness means that certain patterns might not always hold true across the entire dataset, which can lead to variability in predictions. This type of dataset is common in business contexts, where customer behavior, demand fluctuations, or stock prices are influenced by many random factors.

# 3. Feature Engineering
Feature engineering is essential in this scenario because the goal is to capture as much useful information as possible while minimizing the impact of noise. In this case, I utilized the following techniques:

Polynomial Features: To capture non-linear relationships between features, I used polynomial transformations. This helps in modeling interactions that may not be immediately apparent but can still have significant predictive power.

Interaction Terms: By creating interaction features like volume_ratio and multi_volume, I was able to include combinations of variables that might offer insights into the underlying patterns of the data.

These features were added to the dataset to provide more predictive power, taking into account both linear and non-linear relationships.

# 4. Model Selection and Evaluation
Given the random nature of the data, it was important to use models that can handle uncertainty and noise effectively. I tested two popular models:

- Linear Regression:
Linear regression is a simple model that assumes a linear relationship between the features and the target. Despite its simplicity, it provided a baseline for model performance and allowed me to understand how well a straightforward approach could work on random data. The Mean Absolute Error (MAE) for linear regression was 0.35, indicating reasonable predictive accuracy on the data with randomness.

- Random Forest (Tuned):
The random forest model, an ensemble learning technique, is particularly well-suited for handling randomness due to its ability to aggregate predictions from multiple decision trees. By tuning hyperparameters like the number of estimators and the maximum depth of trees, I improved the model's ability to generalize and deal with random fluctuations in the data. The MAE for the tuned random forest was 0.85, which was higher than the linear regression model. This reflects that even with more complexity, the model was still unable to fully capture the underlying patterns in the data due to the randomness.

# 5. Randomness and Overfitting
One of the key challenges when dealing with random data is overfitting. Overfitting occurs when the model captures not only the true underlying patterns but also the noise in the data. This results in poor generalization to new, unseen data.

To prevent overfitting:

Cross-validation was used to assess the model's performance on different subsets of the data, ensuring that the model did not simply memorize the training data.

Regularization techniques were employed to penalize overly complex models, helping the model remain simple and generalize better.

# 6. Model Interpretation and Feature Importance
Feature importance analysis is critical when dealing with random data. By identifying the most important features, we can gain insights into what drives the model's predictions. In the random forest model, the top 10 most important features were derived, including interaction terms like volume_ratio multi_volume and individual features like U_d. This helps highlight which features the model found most informative, even in the face of randomness.

# 7. Insights and Takeaways
Handling Randomness: Random data requires special attention, especially regarding feature selection and model complexity. The simple linear model was better at handling noise, while the more complex random forest model struggled due to the inherent randomness.

Bias-Variance Tradeoff: The random forest's higher error rate demonstrates the bias-variance tradeoff. With more complexity comes the risk of overfitting, especially when the data is noisy and doesn't follow consistent patterns.

Model Robustness: Models like random forest are generally more robust to noise, but in this case, the randomness in the dataset led to suboptimal performance.

# 8. Conclusion
This project showcased the importance of understanding and addressing randomness in real-world data. By employing techniques like feature engineering, cross-validation, and careful model selection, I was able to build models that not only accounted for randomness but also performed reasonably well despite the challenges. This approach is essential for any data scientist, as real-world data is rarely perfect, and models must be able to adapt to uncertainty and noise.

