# Superstore Business Intelligence Project  
## Notebook 03: Statistical Inference & Hypothesis Testing  

### Objective

This notebook applies statistical hypothesis testing to validate:

- Overall profitability significance  
- Discount impact on profit  
- Regional profit differences  

We move from observation → statistical validation.

In [2]:
import pandas as pd
import numpy as np
from scipy import stats

In [4]:
df = pd.read_csv("../data/superstore_enriched.csv")
df.head()

Unnamed: 0,Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,...,Discount,Profit,Shipping_Days,Late_Shipments,Profit_Margin,Loss_Flag,Discount_Bucket,Order_Year,Order_Month,Order_Month_Name
0,1,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,0.0,41.9136,3,0,0.16,0,No Discount,2016,11,November
1,2,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,0.0,219.582,3,0,0.3,0,No Discount,2016,11,November
2,3,CA-2016-138688,2016-06-12,2016-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,0.0,6.8714,4,0,0.47,0,No Discount,2016,6,June
3,4,US-2015-108966,2015-10-11,2015-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,0.45,-383.031,7,1,-0.4,1,High,2015,10,October
4,5,US-2015-108966,2015-10-11,2015-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,0.2,2.5164,7,1,0.1125,0,Low,2015,10,October


____________________________________________________________
## Is Mean Profit Significantly Different From Zero?
___________________________________________________________

### Test 1: One-Sample t-Test on Profit

H0: Mean Profit = 0  
H1: Mean Profit ≠ 0  

This tests whether the business is statistically profitable.

In [5]:
t_stat, p_value = stats.ttest_1samp(df['Profit'], 0)

print("T-Statistics:", t_stat)
print("P-value:", p_value)

T-Statistics: 12.229268666920722
P-value: 3.8023808162524127e-34


### Interpretation:

If p-value < 0.05 → Reject H0  
Meaning average profit is statistically different from zero.

______________________________________________
## Does High Discount Reduce Profit? 
______________________________________________

### Test 2: Two-Sample t-Test (High vs Low Discount)

Group 1: Discount ≤ 0.2  
Group 2: Discount > 0.2  

H0: Mean Profit (Low Discount) = Mean Profit (High Discount)  
H1: They are different

In [7]:
low_discount = df[df['Discount'] <= 0.2]['Profit']
high_discount = df[df['Discount'] > 0.2]['Profit']

t_stat2, p_value2 = stats.ttest_ind(low_discount, high_discount, equal_var=False)

print("T-Statistic:", t_stat2)
print("P-Value:", p_value2)

T-Statistic: 16.1409994038947
P-Value: 2.2951802592338373e-54


### Interpretation:

If p-value < 0.05 → Discount level significantly impacts profit.

This provides statistical evidence for pricing strategy decisions.

_____________________________________________________
## Regional Profit Difference (ANOVA)
____________________________________________________

### Test 3: One-Way ANOVA on Region

H0: Mean Profit is equal across regions  
H1: At least one region differs

In [8]:
regions = [group['Profit'].values for name, group in df.groupby('Region')]

f_stat, p_value3 = stats.f_oneway(*regions)

print("F-Statistic:", f_stat)
print("P-Value:", p_value3)

F-Statistic: 2.6224781547278115
P-Value: 0.04889160022168425


### Interpretation:

If p-value < 0.05 → Regional profitability differs significantly.

This validates regional efficiency differences observed in EDA.

_____________________________________________
### Confidence Interval for Mean Profit
_____________________________________________

In [10]:
mean_profit = df['Profit'].mean()
std_profit = df['Profit'].std()
n = len(df)

confidence_interval = stats.t.interval(
    0.95,
    df = n-1,
    loc = mean_profit,
    scale = std_profit/np.sqrt(n)
)

print("Mean Profit:", mean_profit)
print("95% Confidence Interval: ", confidence_interval)

Mean Profit: 28.65689630778467
95% Confidence Interval:  (24.06354817167874, 33.2502444438906)


## Notebook 03 Summary

Statistical validation results:

- Average profit tested against zero.
- Discount intensity impact evaluated.
- Regional profit differences tested using ANOVA.
- 95% confidence interval computed for mean profit.

This notebook converts business observations into statistically validated insights.

Next Step:

Notebook 04 → Regression Modeling  
(Quantifying impact of Discount, Quantity, and Region on Profit.)