# Superstore Business Intelligence Project  
## Notebook 04: Regression Modeling & Discount Impact Quantification  

### Objective

This notebook builds a regression model to quantify:

- Impact of discount on profit  
- Effect of quantity on profitability  
- Operational influence (shipping days)  
- Regional structural effects  

We move from statistical testing → predictive modeling.

In [23]:
import pandas as pd
# import numpy as np
import statsmodels.api as sm

In [24]:
df = pd.read_csv("../data/superstore_enriched.csv")
df.head()

Unnamed: 0,Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,...,Discount,Profit,Shipping_Days,Late_Shipments,Profit_Margin,Loss_Flag,Discount_Bucket,Order_Year,Order_Month,Order_Month_Name
0,1,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,0.0,41.9136,3,0,0.16,0,No Discount,2016,11,November
1,2,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,0.0,219.582,3,0,0.3,0,No Discount,2016,11,November
2,3,CA-2016-138688,2016-06-12,2016-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,0.0,6.8714,4,0,0.47,0,No Discount,2016,6,June
3,4,US-2015-108966,2015-10-11,2015-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,0.45,-383.031,7,1,-0.4,1,High,2015,10,October
4,5,US-2015-108966,2015-10-11,2015-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,0.2,2.5164,7,1,0.1125,0,Low,2015,10,October


In [25]:
# Create dummy variables for Region
df_model = pd.get_dummies(df, columns=['Region'], drop_first=True)

# Define dependent and independent variables
X = df_model[['Discount', 'Quantity', 'Shipping_Days',
              'Region_East', 'Region_South', 'Region_West']]

y = df_model['Profit']

# Add constant
X = sm.add_constant(X)

In [27]:
X = X.astype(float)
y = y.astype(float)

model = sm.OLS(y, X).fit()
print(model.summary()) 

                            OLS Regression Results                            
Dep. Variable:                 Profit   R-squared:                       0.054
Model:                            OLS   Adj. R-squared:                  0.053
Method:                 Least Squares   F-statistic:                     94.23
Date:                Mon, 23 Feb 2026   Prob (F-statistic):          1.38e-115
Time:                        00:30:04   Log-Likelihood:                -68437.
No. Observations:                9994   AIC:                         1.369e+05
Df Residuals:                    9987   BIC:                         1.369e+05
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const            55.0884      8.495      6.485

### Regression Interpretation Guide

Focus on:

1. Coefficients (coef column)
   - Negative Discount coefficient → Profit decreases as discount increases.
   - Quantity coefficient → Effect of bulk sales.
   - Shipping_Days → Operational impact.

2. P-values
   - p < 0.05 → Statistically significant predictor.

3. R-squared
   - Measures model explanatory power.

4. Regional coefficients
   - Show structural profit differences compared to baseline region.

### Check Multicollinearity (VIF)

In [29]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i)
                   for i in range(X.shape[1])]

vif_data

Unnamed: 0,Feature,VIF
0,const,13.879025
1,Discount,1.059746
2,Quantity,1.000829
3,Shipping_Days,1.001457
4,Region_East,1.639026
5,Region_South,1.452121
6,Region_West,1.710897


## Notebook 04 Summary

Key Outcomes:

- Built an OLS regression model to quantify profit drivers.
- Evaluated discount elasticity on profit.
- Assessed operational and regional structural effects.
- Verified multicollinearity using VIF.

This model provides numerical estimates of business decision impact.

Next Step:

Notebook 05 → Distribution Modeling  
(Understanding statistical behavior of Sales & Profit.)