## ðŸ“˜ Notebook Overview

This notebook presents a complete predictive modeling analysis aimed at explaining and predicting a product **rating** variable using supervised learning techniques.

This is the second part of the assignment, where we integrate a GLM to to some prediction modelling. The first part can be seen in the notebook lin_reg_1.ipynb. This first part serves as a baseline and provides an initial understanding of the relationship between the explanatory variables and the response.

In the second part, a **Generalized Linear Model (GLM)** is estimated to extend the linear regression framework. Given that the response variable is strictly positive and exhibits right-skewness, a **Gamma distribution with a log link function** is employed. The same dataset, preprocessing pipeline, and train/test split are retained to ensure a fair and consistent comparison between models.

For both approaches, model interpretation, predictive performance, and uncertainty quantification through prediction intervals are discussed. The notebook concludes with a comparison of the two modeling strategies and a summary of the main findings.


## ðŸ“¦ Imports


This section includes all the Python libraries required for data manipulation, visualization, modeling, and evaluation throughout the notebook.

The main libraries used are:
- **NumPy** and **Pandas** for numerical computations and data handling.
- **Matplotlib / Seaborn** for exploratory data analysis and visualization.
- **scikit-learn** for preprocessing, train/test splitting, baseline models, and performance metrics.
- **statsmodels** for statistical modeling, in particular for the implementation of the Generalized Linear Model (GLM).

All imports are grouped at the beginning of the notebook to improve readability and reproducibility.

The data is imported from the same folder where this notebook is placed. The data is divided into train and test sets (only those datpoints that have the target variable, as some of them are missing)

In [None]:
# Basic imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# statsmodels imports
import statsmodels.api as sm
import statsmodels.formula.api as smf

# sklearn imports
# split
from sklearn.model_selection import train_test_split, KFold

# impute
from sklearn.impute import SimpleImputer

# pipeline and column transformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# feature extraction
from sklearn.preprocessing import TargetEncoder, StandardScaler, OneHotEncoder, FunctionTransformer
from category_encoders import TargetEncoder as SafeTargetEncoder

# metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


In [49]:
df = pd.read_csv('BigBasket Products.csv', sep = ',')

In [50]:
# Separate data with missing ratings for later prediction
test_no_target = df[df['rating'].isnull()].copy()
df = df[df['rating'].notnull()].copy()

print("="*60)
print("DATA SEPARATION")
print("="*60)
print(f"Dataset with rating (df): {df.shape[0]} rows")
print(f"Dataset without rating (test_no_target): {test_no_target.shape[0]} rows")
print(f"Total: {df.shape[0] + test_no_target.shape[0]} rows")


DATA SEPARATION
Dataset with rating (df): 18929 rows
Dataset without rating (test_no_target): 8626 rows
Total: 27555 rows


In [51]:
# Split data into training (70%) and testing (30%)
df_train, df_test = train_test_split(df, test_size=0.3, random_state=42)

print("="*60)
print("TRAIN-TEST SPLIT")
print("="*60)
print(f"Training set: {df_train.shape[0]} rows ({100*df_train.shape[0]/(df_train.shape[0]+df_test.shape[0]):.1f}%)")
print(f"Test set: {df_test.shape[0]} rows ({100*df_test.shape[0]/(df_train.shape[0]+df_test.shape[0]):.1f}%)")
print(f"Total: {df_train.shape[0] + df_test.shape[0]} rows")

TRAIN-TEST SPLIT
Training set: 13250 rows (70.0%)
Test set: 5679 rows (30.0%)
Total: 18929 rows


## ðŸ§¹ Preprocessing


Prior to model estimation, the dataset was preprocessed to ensure compatibility with the considered models and to improve predictive performance.

The main preprocessing steps include:
- Defining the target variable (**rating**) and the set of input variables.
- Handling categorical variables through appropriate encoding techniques.
- Applying feature scaling where necessary.
- Splitting the data into **training** and **test** sets to allow for an out-of-sample evaluation of model performance.

This same preprocessing pipeline is also used for the first model.


In [65]:
X_train = df_train.drop('rating', axis=1)
y_train = df_train['rating']
X_test = df_test.drop('rating', axis=1)
y_test = df_test['rating']

In [53]:
# Define steps for numerical features
numeric_transformer = Pipeline(steps=[
    ('log_transform', FunctionTransformer(np.log1p, validate=True)),
    ('scaler', StandardScaler())
])

# Define steps for categorical features (Target Encoding)
categorical_transformer = TargetEncoder(smooth='auto') 

# Define steps for one-hot features
ohe_transformer = OneHotEncoder(handle_unknown='ignore', drop='first')

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, ['sale_price', 'market_price']),
        ('target_cat', categorical_transformer, ['brand', 'sub_category']),
        ('ohe_cat', ohe_transformer, ['category']) 
    ],
    remainder='drop'
)

# Now, apply to your data:
X_train_preprocessed = pd.DataFrame(preprocessor.fit_transform(X_train, y_train))
X_test_preprocessed = pd.DataFrame(preprocessor.transform(X_test))

## ðŸ§  Modeling


In this section, a **Generalized Linear Model (GLM)** is estimated to model the rating variable and to extend the linear regression framework presented previously.

The same preprocessed feature matrices used in the linear regression analysis are retained. Since `statsmodels` does not automatically include an intercept term, a constant column is explicitly added to both the training and test feature matrices. Indexes are also reset to ensure proper alignment between the response variable and the design matrix.

The model is specified using a **Gamma distribution** with a **log link function**, which is appropriate for strictly positive and right-skewed response variables. Model parameters are estimated using the **Iteratively Reweighted Least Squares (IRLS)** algorithm.

After fitting the model on the training data, parameter estimates and inference statistics are examined. Finally, predictions are generated on the test set to allow for an out-of-sample evaluation of predictive performance.


In [54]:
X_train_glm = sm.add_constant(X_train_preprocessed)
X_train_glm = X_train_glm.reset_index(drop=True)
y_train_glm = y_train.reset_index(drop=True)

X_test_glm = sm.add_constant(X_test_preprocessed)
X_test_glm = X_test_glm.reset_index(drop=True)
y_test_glm = y_test.reset_index(drop=True)


In [55]:
glm_gamma = sm.GLM(
    y_train_glm,
    X_train_glm,
    family=sm.families.Gamma(sm.families.links.log())
)

glm_gamma_results = glm_gamma.fit()




In [56]:
glm_gamma_results.summary()


0,1,2,3
Dep. Variable:,rating,No. Observations:,13250.0
Model:,GLM,Df Residuals:,13237.0
Model Family:,Gamma,Df Model:,12.0
Link Function:,log,Scale:,0.034123
Method:,IRLS,Log-Likelihood:,-17233.0
Date:,"Sat, 13 Dec 2025",Deviance:,653.01
Time:,14:39:58,Pearson chi2:,452.0
No. Iterations:,9,Pseudo R-squ. (CS):,0.1002
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,0.4009,0.049,8.247,0.000,0.306,0.496
0,0.0003,0.009,0.035,0.972,-0.017,0.018
1,-0.0123,0.009,-1.337,0.181,-0.030,0.006
2,0.1335,0.005,27.634,0.000,0.124,0.143
3,0.1138,0.012,9.313,0.000,0.090,0.138
4,-0.0173,0.013,-1.289,0.197,-0.044,0.009
5,-0.0077,0.011,-0.731,0.465,-0.028,0.013
6,-0.0016,0.014,-0.119,0.905,-0.028,0.025
7,-0.0075,0.011,-0.670,0.503,-0.030,0.014


In [None]:
y_pred_test_glm = glm_gamma_results.predict(X_test_glm)

## ðŸ“Š Analysis and Conclusions

In [None]:
mae_glm = mean_absolute_error(y_test_glm, y_pred_test_glm)
mse_glm = mean_squared_error(y_test_glm, y_pred_test_glm)

mae_glm, mse_glm


(0.46285998015838353, 0.5047628274858941)

In [63]:
pred_glm = glm_gamma_results.get_prediction(X_test_glm)
pred_summary = pred_glm.summary_frame(alpha=0.05)

In [64]:
pred_summary

Unnamed: 0,mean,mean_se,mean_ci_lower,mean_ci_upper
0,3.700570,0.016779,3.667828,3.733603
1,4.313474,0.026658,4.261541,4.366040
2,4.006404,0.018691,3.969938,4.043205
3,3.963245,0.026625,3.911403,4.015774
4,3.555119,0.021661,3.512917,3.597829
...,...,...,...,...
5674,4.031204,0.014193,4.003482,4.059118
5675,4.261949,0.038474,4.187205,4.338029
5676,3.922421,0.021054,3.881373,3.963904
5677,3.845409,0.017438,3.811384,3.879738


This notebook explored the use of supervised learning techniques to model and predict a product rating variable. 

The linear regression model provided a strong and interpretable baseline, capturing the main relationships between the explanatory variables and the response. Building on this, a GLM was estimated to better reflect the distributional characteristics of the rating variable.

Given that the response is strictly positive and right-skewed, a **Gamma GLM with a log link** was employed. This specification allows for a more appropriate probabilistic modeling of the data while retaining a linear structure in the predictors.

From a predictive perspective, the GLM achieved a **Mean Absolute Error (MAE) of approximately 0.46** and a **Mean Squared Error (MSE) of approximately 0.50** on the test set. These results are comparable to those obtained with the linear regression model, indicating that both approaches capture similar predictive information from the available features.

Although the GLM does not lead to a substantial improvement in point prediction accuracy, it provides additional benefits. In particular, the GLM framework allows for the construction of **prediction intervals** for the expected rating. The resulting confidence intervals are relatively narrow, suggesting **stable and well-calibrated predictions across the test set**.

Overall, this analysis highlights that improvements in model specification do not necessarily translate into large gains in predictive accuracy when a simpler model already captures most of the signal. Nevertheless, the GLM offers a more statistically appropriate and interpretable framework for modeling the rating variable, reinforcing the robustness of the conclusions drawn from the linear regression analysis.
