<a href="https://colab.research.google.com/github/parhamalikhan/playground-series-s5e1/blob/main/Untitled1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 This is a part of the Kaggle Playground Series 2025 challenge, where the task is to predict the number of stickers sold for different products in various stores and countries.

Approach:

Data Collection: The first step is to load and examine the provided datasets (train.csv, test.csv, and sample_submission.csv) which contain historical sales data, product details, and sample submission format.

Data Cleaning: Clean the data by handling missing values, removing irrelevant data, and ensuring that the dataset is ready for analysis and model building.

Feature Engineering: Next, we'll create new features from the existing ones to make the dataset more informative for the machine learning model. This might include features like day of the week, holidays, or seasonality effects that might influence sales.

Model Building: Using machine learning algorithms (e.g., linear regression, decision trees, or ensemble models), we will build models to predict the num_sold (number of items sold).

Model Evaluation: The model's performance will be evaluated using the Mean Absolute Percentage Error (MAPE), a common metric for regression tasks, to understand how accurate the model's predictions are.

Prediction: Finally, the model will be used to generate predictions for the test data, which will then be formatted and submitted as per the required submission structure.

Goal of this Notebook: This notebook will guide through the steps of loading the data, cleaning it, engineering features, training models, and making predictions. At the end of the process, we aim to have a model that performs well on the Kaggle competition and helps forecast sticker sales effectively.



In [14]:
# 1. Importing the necessary library for connecting to Google Drive
from google.colab import drive

# 2. Mounting Google Drive to access files stored in it
# The following line of code will prompt a link to authenticate and mount Google Drive
drive.mount('/content/drive')  # This will mount Google Drive at the path '/content/drive'

# After this step, you will see a message 'Mounted at /content/drive' confirming successful connection.


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [15]:
# Define the directory path where the datasets are stored on Google Drive
data_dir = '/content/drive/MyDrive/playground-series-s5e1'


In [16]:
import pandas as pd

# Define the path to your data
data_dir = '/content/drive/MyDrive/playground-series-s5e1/'

# Load the datasets
train = pd.read_csv(data_dir + 'train.csv')
test = pd.read_csv(data_dir + 'test.csv')
sample_submission = pd.read_csv(data_dir + 'sample_submission.csv')

# Check the first few rows of the train data
train.head()



Unnamed: 0,id,date,country,store,product,num_sold
0,0,2010-01-01,Canada,Discount Stickers,Holographic Goose,
1,1,2010-01-01,Canada,Discount Stickers,Kaggle,973.0
2,2,2010-01-01,Canada,Discount Stickers,Kaggle Tiers,906.0
3,3,2010-01-01,Canada,Discount Stickers,Kerneler,423.0
4,4,2010-01-01,Canada,Discount Stickers,Kerneler Dark Mode,491.0


In [17]:
# Drop rows with NaN values in 'num_sold' column
train = train.dropna(subset=['num_sold'])

# Check for missing values across the entire dataset to ensure it's clean
train.isnull().sum()


Unnamed: 0,0
id,0
date,0
country,0
store,0
product,0
num_sold,0


In [18]:
# Convert the 'date' column to datetime format
train['date'] = pd.to_datetime(train['date'])

# Extract year, month, and day_of_week from the 'date' column
train['year'] = train['date'].dt.year
train['month'] = train['date'].dt.month
train['day_of_week'] = train['date'].dt.dayofweek

# Drop the 'date' column as we now have separate time-related features
train = train.drop(columns=['date'])

# Check the first few rows of the data after feature engineering
train.head()



Unnamed: 0,id,country,store,product,num_sold,year,month,day_of_week
1,1,Canada,Discount Stickers,Kaggle,973.0,2010,1,4
2,2,Canada,Discount Stickers,Kaggle Tiers,906.0,2010,1,4
3,3,Canada,Discount Stickers,Kerneler,423.0,2010,1,4
4,4,Canada,Discount Stickers,Kerneler Dark Mode,491.0,2010,1,4
5,5,Canada,Stickers for Less,Holographic Goose,300.0,2010,1,4


In [19]:
# Check the columns in the test dataset
print(test.columns)


Index(['id', 'date', 'country', 'store', 'product'], dtype='object')


In [20]:
# Define features (X) and target (y)
X = train.drop(columns=['num_sold'])  # All columns except the target
y = train['num_sold']  # Target column

# Split data into training and validation sets
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

# Now X_train is defined, and we can use it for reindexing the test set


Defining Features (X) and Target (y):

X = train.drop(columns=['num_sold']): In this line, I define X, which contains all the features from the dataset, excluding the target variable num_sold. The drop(columns=['num_sold']) ensures that only the columns relevant to the features are included.

y = train['num_sold']: Here, y represents the target variable, which is the num_sold column. This is the value we aim to predict.

Splitting the Data into Training and Validation Sets:

from sklearn.model_selection import train_test_split: This imports the train_test_split function from scikit-learn, which is used to split the dataset into training and validation sets.

train_test_split(X, y, test_size=0.2, random_state=42): This splits the data into training and validation sets.

X and y are the features and target, respectively.

test_size=0.2: This specifies that 20% of the data will be used for the validation set, and the remaining 80% will be used for training.

random_state=42: This ensures that the split is reproducible, meaning you'll get the same split every time you run the code.

Reindexing the Test Set:

After splitting the data, X_train, X_valid, y_train, and y_valid are now available. X_train will be used for training the model, and X_valid will be used to evaluate the model's performance on unseen data.

The last comment refers to the possibility of using X_train for other tasks, such as reindexing the test set if needed.

In [21]:
# Ensure the feature set X_train is properly defined (this should have been done earlier)
print(X_train.columns)


Index(['id', 'country', 'store', 'product', 'year', 'month', 'day_of_week'], dtype='object')


In [22]:
# Check the columns in the test dataset
print(test.columns)


Index(['id', 'date', 'country', 'store', 'product'], dtype='object')


In [23]:
# Ensure the test set has the same columns as the training set (reindex with missing columns)
test = test.reindex(columns=train.columns, fill_value=0)

# Check the first few rows of the test data after preprocessing
print(test.head())


       id country              store             product  num_sold  year  \
0  230130  Canada  Discount Stickers   Holographic Goose         0     0   
1  230131  Canada  Discount Stickers              Kaggle         0     0   
2  230132  Canada  Discount Stickers        Kaggle Tiers         0     0   
3  230133  Canada  Discount Stickers            Kerneler         0     0   
4  230134  Canada  Discount Stickers  Kerneler Dark Mode         0     0   

   month  day_of_week  
0      0            0  
1      0            0  
2      0            0  
3      0            0  
4      0            0  


In [26]:
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply label encoding to categorical columns
train['country'] = label_encoder.fit_transform(train['country'])
train['store'] = label_encoder.fit_transform(train['store'])
train['product'] = label_encoder.fit_transform(train['product'])

# Separate features (X) and target (y) in the training set
X = train.drop(columns=['num_sold'])
y = train['num_sold']

# Split data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the LGBM model with hyperparameters
lgbm_model = LGBMRegressor(n_estimators=850, max_depth=5, learning_rate=0.09,
                           colsample_bytree=0.41, subsample=0.52, min_child_samples=90)

# Train the model
lgbm_model.fit(X_train, y_train)

# Make predictions on the validation set to evaluate performance
y_pred = lgbm_model.predict(X_valid)

# Calculate the Mean Absolute Percentage Error (MAPE) on the validation set
from sklearn.metrics import mean_absolute_percentage_error
mape = mean_absolute_percentage_error(y_valid, y_pred)
print(f"MAPE on validation set: {mape}")


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.082004 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 297
[LightGBM] [Info] Number of data points in the train set: 177007, number of used features: 7
[LightGBM] [Info] Start training from score 751.724474
MAPE on validation set: 0.556669270584661


In [27]:
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split

# Separate features (X) and target (y) in the training set
X = train.drop(columns=['num_sold'])
y = train['num_sold']

# Split data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the LGBM model with hyperparameters
lgbm_model = LGBMRegressor(n_estimators=850, max_depth=5, learning_rate=0.09,
                           colsample_bytree=0.41, subsample=0.52, min_child_samples=90)

# Train the model
lgbm_model.fit(X_train, y_train)

# Make predictions on the validation set to evaluate performance
y_pred = lgbm_model.predict(X_valid)

# Calculate the Mean Absolute Percentage Error (MAPE) on the validation set
from sklearn.metrics import mean_absolute_percentage_error
mape = mean_absolute_percentage_error(y_valid, y_pred)
print(f"MAPE on validation set: {mape}")


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.013998 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 297
[LightGBM] [Info] Number of data points in the train set: 177007, number of used features: 7
[LightGBM] [Info] Start training from score 751.724474
MAPE on validation set: 0.556669270584661


LGBMRegressor Model Initialization:

In this section, the LGBMRegressor model is initialized with specific hyperparameters such as the number of estimators, tree depth, learning rate, and other settings.

n_estimators=850: The number of weak models or trees used by the model to make predictions.

max_depth=5: The maximum depth of each tree to prevent overfitting.

learning_rate=0.09: The learning rate that controls how much each tree contributes to the model.

colsample_bytree=0.41: The fraction of features to be selected randomly for each tree.

subsample=0.52: The fraction of data to be selected randomly for each tree.

min_child_samples=90: The minimum number of samples required in a node to make further splits.

Splitting Data:

train_test_split(X, y, test_size=0.2, random_state=42) splits the dataset into training and validation sets. 80% of the data is used for training, and 20% is used for validation.

random_state=42: Ensures that the data split is reproducible, meaning you'll get the same split each time the code is run.

Training the Model:

lgbm_model.fit(X_train, y_train): The model is trained using the training data (X_train, y_train).

Making Predictions:

y_pred = lgbm_model.predict(X_valid): Predictions are made on the validation data (X_valid).

MAPE Calculation:

mean_absolute_percentage_error(y_valid, y_pred) calculates the Mean Absolute Percentage Error (MAPE), a metric used to measure the accuracy of predictions. A lower MAPE indicates better model performance.



In [28]:
lgbm_model = LGBMRegressor(
    n_estimators=1000,     # Increase the number of trees
    max_depth=10,          # Increase depth to capture more complex relationships
    learning_rate=0.05,    # Lower learning rate for more stable learning
    colsample_bytree=0.5,  # Control overfitting
    subsample=0.8,         # Randomly sample data points
    min_child_samples=50   # Modify leaf size
)

lgbm_model.fit(X_train, y_train)


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004798 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 297
[LightGBM] [Info] Number of data points in the train set: 177007, number of used features: 7
[LightGBM] [Info] Start training from score 751.724474


 automatically using multi-threading for row-wise processing to speed up the training process.

Total Bins 297: This shows the total number of bins or groups of data that were created for efficient processing and optimization.

Number of data points in the train set: 177007: The number of data points (or rows) in the training dataset being used for training the model.

Number of used features: 7: The number of features (columns) used in the model. This indicates that the model is training on 7 features.

Start training from score 751.724474: The model starts training from an initial score, which is typically the mean or average prediction value.



In [29]:
# Make predictions on the validation set
y_pred_valid = lgbm_model.predict(X_valid)

# Calculate the Mean Absolute Percentage Error (MAPE) on the validation set
from sklearn.metrics import mean_absolute_percentage_error
mape_valid = mean_absolute_percentage_error(y_valid, y_pred_valid)
print(f"MAPE on the validation set: {mape_valid}")


MAPE on the validation set: 0.29077502015832285


The MAPE value of 0.29077502015832285 means that the model's predictions are, on average, off by approximately 29.08% from the actual values.

A lower MAPE is generally better, and ideally, you'd want this value to be as small as possible for a more accurate model.

Depending on the context, a MAPE of 29% might be considered acceptable or may indicate room for improvemen

In [31]:
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply label encoding to categorical columns in the test set
test['country'] = label_encoder.fit_transform(test['country'])
test['store'] = label_encoder.fit_transform(test['store'])
test['product'] = label_encoder.fit_transform(test['product'])

# Make predictions on the test set (Drop 'id' column as it's not a feature)
test_predictions = lgbm_model.predict(test.drop(columns=['id']))  # Drop 'id' column

# Check the first few predictions
print(test_predictions[:5])



[ 97.19347813 267.71114386 221.71543815 221.71543815 221.71543815]


The model predicted the number of stickers sold for each item in the test set.

These predictions are continuous numerical values representing the estimated sales:

The first prediction is approximately 97.19.

The second prediction is around 267.71.

The third, fourth, and fifth predictions are all about 221.72.

Model Behavior and Insights:
The model seems to be consistent in predicting 221.72 for several records. This could indicate that the model is making similar predictions for certain types of data points, possibly due to overfitting or the lack of diversity in those specific records.

If these results are expected (based on the business context or known patterns), they might be fine. However, if more variation was expected in the predictions, further adjustments to the model, such as tuning hyperparameters or adding more features, might help.

In [32]:
# Check the first few predictions
print(test_predictions[:10])

# Inspect the range of the predictions
print(f"Min: {min(test_predictions)}")
print(f"Max: {max(test_predictions)}")


[ 97.19347813 267.71114386 221.71543815 221.71543815 221.71543815
  79.95729709 225.56743885 181.41079447 181.41079447 181.41079447]
Min: 79.95729709222474
Max: 272.6630431493777


The model predicted the number of stickers sold for each record in the test set. These are continuous numerical predictions.

The predictions vary between 79.96 and 272.66, indicating a spread in the predicted sales values across the test data.

For some records, the model gives the same predicted value (e.g., 221.72 and 181.41), which could suggest that the model has identified certain patterns in the data, possibly for specific groups of similar records.

Insights and Model Behavior:
The Min and Max values provide insights into the range of predictions. The spread indicates that the model is able to capture some variance in the data.

If a wider spread was expected based on business knowledge, further adjustments to the model could help. This may include tuning hyperparameters or adding more relevant features to better capture the variability in sticker sales.



In [33]:
import numpy as np  # Add this import statement at the top

# Clip negative predictions to zero
test_predictions_clipped = np.maximum(test_predictions, 0)

# Check the new range of predictions
print(f"Min: {min(test_predictions_clipped)}")
print(f"Max: {max(test_predictions_clipped)}")


Min: 79.95729709222474
Max: 272.6630431493777


Clipping predictions: It is a good practice to handle negative predictions if they don't make sense in the context of the problem (like predicting quantities such as sales). By setting negative values to zero, you're ensuring that your predictions remain realistic