# CS 203 Assignment 10

## Team Members
- Nishchay Bhutoria (23110222)
- Srivaths P (23110321)

## Imports

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import ks_2samp
from statsmodels.stats.proportion import proportions_ztest

## Part 1: A/B Testing using Ad Click Prediction

### 1. Load the dataset into a pandas DataFrame.

In [2]:
df = pd.read_csv('datasets/ad_click_dataset.csv')
display(df.head(), df.shape)

Unnamed: 0,id,full_name,age,gender,device_type,ad_position,browsing_history,time_of_day,click
0,670,User670,22.0,,Desktop,Top,Shopping,Afternoon,1
1,3044,User3044,,Male,Desktop,Top,,,1
2,5912,User5912,41.0,Non-Binary,,Side,Education,Night,1
3,5418,User5418,34.0,Male,,,Entertainment,Evening,1
4,9452,User9452,39.0,Non-Binary,,,Social Media,Morning,0


(10000, 9)

### 2. Perform necessary data cleaning and preprocessing:
- Handle missing values
- Convert categorical columns  (e.g., gender, ad_position)


In [3]:
df = df.dropna()

display(df.head(), df.shape)

categorical_columns = df.select_dtypes(include=['object', 'category']).columns
for col in categorical_columns:
    df[col] = df[col].astype('category').cat.codes

display(df.head(), df.shape)

Unnamed: 0,id,full_name,age,gender,device_type,ad_position,browsing_history,time_of_day,click
17,188,User188,56.0,Female,Tablet,Bottom,News,Morning,1
25,4890,User4890,43.0,Male,Tablet,Bottom,Education,Afternoon,1
33,4985,User4985,37.0,Male,Mobile,Top,News,Evening,0
52,9888,User9888,49.0,Male,Mobile,Top,News,Morning,1
102,8201,User8201,59.0,Female,Desktop,Bottom,Social Media,Morning,0


(816, 9)

Unnamed: 0,id,full_name,age,gender,device_type,ad_position,browsing_history,time_of_day,click
17,188,45,56.0,0,2,0,2,2,1
25,4890,186,43.0,1,2,0,0,0,1
33,4985,193,37.0,1,1,2,2,1,0
52,9888,436,49.0,1,1,2,2,2,1
102,8201,361,59.0,0,0,0,4,2,0


(816, 9)

### 3. Split the dataset into two groups:
- Group A: Users with ad_position = 0 (Top)
- Group B: Users with ad_position = 1  (Bottom)

In [4]:
group_A = df[df['ad_position'] == 2]
group_B = df[df['ad_position'] == 0]

### 4. Use the statsmodel’s `proportions_ztest` function to perform an independent two-sample z-test between Group A and Group B.

In [5]:
clicks_A = group_A['click'].sum()
clicks_B = group_B['click'].sum()

total_A = group_A.shape[0]
total_B = group_B.shape[0]

count = np.array([clicks_A, clicks_B])
nobs = np.array([total_A, total_B])
z_stat, p_val = proportions_ztest(count, nobs)

### 5. Print the following:
- The z-score
- The p-value

In [6]:
print(f"Z-score: {z_stat:.4f}")
print(f"P-value: {p_val:.4f}")

alpha = 0.05
if p_val < alpha:
    print("There is a statistically significant difference in click-through rates between the two ad positions.")
else:
    print("There is no statistically significant difference in click-through rates between the two ad positions.")

Z-score: -1.1365
P-value: 0.2557
There is no statistically significant difference in click-through rates between the two ad positions.


### 6. Interpret the result: Is there a statistically significant difference in click-through rates between the two groups? Justify your answer. 

#### Interpretation

We performed a two-sample z-test to compare the click-through rates (CTR) of two user groups based on their ad position:

- **Group A**: Users who saw the ad at the **top** (`ad_position = 2`)
- **Group B**: Users who saw the ad at the **bottom** (`ad_position = 0`)

##### Hypotheses:

- **Null Hypothesis $(H_0)$**: There is **no difference** in click-through rates between ads shown at the top and bottom positions.
- **Alternative Hypothesis $(H_1)$**: There **is a difference** in click-through rates between ads shown at the top and bottom positions.

##### Test Results:

- **Z-score**: -1.1365  
- **P-value**: 0.2557

At a significance level of $α = 0.05$, the p-value is **greater than 0.05**, which means we **fail to reject the null hypothesis**.

##### Conclusion:

There is **no statistically significant difference** in click-through rates between ads placed at the top and bottom of the page. This suggests that **ad position (top vs bottom)** may **not have a meaningful effect** on user engagement in this case.


## Part 2: Covariate Shift Detection Using Air Quality Data

### 1. You are provided with 3 datasets via this Google Drive link:
- train.csv
- test1.csv
- test2.csv

### 2. Load all three datasets using `pandas`.

In [7]:
train = pd.read_csv('datasets/train.csv')
test1 = pd.read_csv('datasets/test1.csv')
test2 = pd.read_csv('datasets/test2.csv')

display(train.head(), train.shape)
display(test1.head(), test1.shape)
display(test2.head(), test2.shape)

def fix_commas(df):
    return df.apply(lambda col: col.map(lambda x: float(str(x).replace(',', '.')) if isinstance(x, str) and ',' in x else x))

train = fix_commas(train)
test1 = fix_commas(test1)
test2 = fix_commas(test2)

display(train.head(), train.shape)
display(test1.head(), test1.shape)
display(test2.head(), test2.shape)

Unnamed: 0.1,Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Unnamed: 15,Unnamed: 16
0,1849,26/05/2004,19.00.00,-200,1130.0,-200.0,227,1368.0,-200.0,933.0,-200.0,1709.0,1269.0,267,195,6754,,
1,2533,24/06/2004,07.00.00,12,1030.0,-200.0,69,851.0,102.0,824.0,68.0,1700.0,983.0,219,570,14742,,
2,3047,15/07/2004,17.00.00,32,1164.0,-200.0,203,1306.0,259.0,648.0,198.0,1886.0,1218.0,355,191,10888,,
3,805,13/04/2004,07.00.00,39,1496.0,524.0,191,1272.0,328.0,667.0,130.0,2011.0,1399.0,110,642,8398,,
4,2962,12/07/2004,04.00.00,-200,780.0,-200.0,18,568.0,24.0,1200.0,34.0,1331.0,501.0,199,513,11803,,


(3200, 18)

Unnamed: 0.1,Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Unnamed: 15,Unnamed: 16
0,3123,18/07/2004,21.00.00,12,1067.0,-200.0,90,938.0,102.0,825.0,99.0,1520.0,912.0,297,248,10160,,
1,877,16/04/2004,07.00.00,45,1657.0,523.0,232,1384.0,352.0,579.0,109.0,2176.0,1600.0,128,710,10428,,
2,3457,01/08/2004,19.00.00,14,1037.0,-200.0,80,900.0,75.0,817.0,95.0,1584.0,619.0,331,327,16200,,
3,1494,12/05/2004,00.00.00,17,1122.0,-200.0,87,926.0,105.0,805.0,88.0,1619.0,1174.0,169,588,11250,,
4,713,09/04/2004,11.00.00,26,-200.0,262.0,-2000,-200.0,219.0,-200.0,121.0,-200.0,-200.0,-200,-200,-200,,


(800, 18)

Unnamed: 0.1,Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Unnamed: 15,Unnamed: 16
0,8500,27/02/2005,22.00.00,10,875.0,-200.0,21,594.0,128.0,1079.0,105.0,793.0,451.0,45,480,4085,,
1,8501,27/02/2005,23.00.00,13,943.0,-200.0,39,703.0,169.0,950.0,119.0,870.0,581.0,43,486,4069,,
2,8502,28/02/2005,00.00.00,16,947.0,-200.0,38,697.0,215.0,913.0,150.0,878.0,698.0,40,500,4115,,
3,8503,28/02/2005,01.00.00,10,865.0,-200.0,18,566.0,111.0,1119.0,94.0,797.0,423.0,40,529,4338,,
4,8504,28/02/2005,02.00.00,6,823.0,-200.0,10,503.0,60.0,1268.0,56.0,755.0,332.0,40,510,4200,,


(800, 18)

Unnamed: 0.1,Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Unnamed: 15,Unnamed: 16
0,1849,26/05/2004,19.00.00,-200.0,1130.0,-200.0,22.7,1368.0,-200.0,933.0,-200.0,1709.0,1269.0,26.7,19.5,0.6754,,
1,2533,24/06/2004,07.00.00,1.2,1030.0,-200.0,6.9,851.0,102.0,824.0,68.0,1700.0,983.0,21.9,57.0,1.4742,,
2,3047,15/07/2004,17.00.00,3.2,1164.0,-200.0,20.3,1306.0,259.0,648.0,198.0,1886.0,1218.0,35.5,19.1,1.0888,,
3,805,13/04/2004,07.00.00,3.9,1496.0,524.0,19.1,1272.0,328.0,667.0,130.0,2011.0,1399.0,11.0,64.2,0.8398,,
4,2962,12/07/2004,04.00.00,-200.0,780.0,-200.0,1.8,568.0,24.0,1200.0,34.0,1331.0,501.0,19.9,51.3,1.1803,,


(3200, 18)

Unnamed: 0.1,Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Unnamed: 15,Unnamed: 16
0,3123,18/07/2004,21.00.00,1.2,1067.0,-200.0,9.0,938.0,102.0,825.0,99.0,1520.0,912.0,29.7,24.8,1.016,,
1,877,16/04/2004,07.00.00,4.5,1657.0,523.0,23.2,1384.0,352.0,579.0,109.0,2176.0,1600.0,12.8,71.0,1.0428,,
2,3457,01/08/2004,19.00.00,1.4,1037.0,-200.0,8.0,900.0,75.0,817.0,95.0,1584.0,619.0,33.1,32.7,1.62,,
3,1494,12/05/2004,00.00.00,1.7,1122.0,-200.0,8.7,926.0,105.0,805.0,88.0,1619.0,1174.0,16.9,58.8,1.125,,
4,713,09/04/2004,11.00.00,2.6,-200.0,262.0,-200.0,-200.0,219.0,-200.0,121.0,-200.0,-200.0,-200.0,-200.0,-200.0,,


(800, 18)

Unnamed: 0.1,Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Unnamed: 15,Unnamed: 16
0,8500,27/02/2005,22.00.00,1.0,875.0,-200.0,2.1,594.0,128.0,1079.0,105.0,793.0,451.0,4.5,48.0,0.4085,,
1,8501,27/02/2005,23.00.00,1.3,943.0,-200.0,3.9,703.0,169.0,950.0,119.0,870.0,581.0,4.3,48.6,0.4069,,
2,8502,28/02/2005,00.00.00,1.6,947.0,-200.0,3.8,697.0,215.0,913.0,150.0,878.0,698.0,4.0,50.0,0.4115,,
3,8503,28/02/2005,01.00.00,1.0,865.0,-200.0,1.8,566.0,111.0,1119.0,94.0,797.0,423.0,4.0,52.9,0.4338,,
4,8504,28/02/2005,02.00.00,0.6,823.0,-200.0,1.0,503.0,60.0,1268.0,56.0,755.0,332.0,4.0,51.0,0.42,,


(800, 18)

### 3. For each test dataset (`test1.csv` and `test2.csv`), compare it with `train.csv` using the **Kolmogorov–Smirnov test** (`scipy.stats.ks_2samp`). Perform the KS test on the **NO2(GT)** column to identify whether there are any distributional differences. 

In [8]:
train_no2 = train['NO2(GT)'].dropna()
test1_no2 = test1['NO2(GT)'].dropna()
test2_no2 = test2['NO2(GT)'].dropna()

ks_stat_1, p_val_1 = ks_2samp(train_no2, test1_no2)
ks_stat_2, p_val_2 = ks_2samp(train_no2, test2_no2)

print("KS Test Results for NO2(GT):")

print("\nTrain vs Test1:")
print(f"KS Statistic = {ks_stat_1:.4f}")
print(f"P-value      = {p_val_1:.4f}")
print("No significant difference" if p_val_1 >= 0.05 else "Significant difference")

print("\nTrain vs Test2:")
print(f"KS Statistic = {ks_stat_2:.4f}")
print(f"P-value      = {p_val_2:.4f}")
print("No significant difference" if p_val_2 >= 0.05 else "Significant difference")


KS Test Results for NO2(GT):

Train vs Test1:
KS Statistic = 0.0191
P-value      = 0.9722
No significant difference

Train vs Test2:
KS Statistic = 0.4075
P-value      = 0.0000
Significant difference


#### KS Test Result for `NO2(GT)`

- **Train vs Test1**:  
  - KS Statistic $= 0.0191$, P-value $= 0.9722 \implies$ **No significant difference**
  
- **Train vs Test2**:  
  - KS Statistic $= 0.4075$, P-value $= 0.0000 \implies$ **Significant difference**

This supports the conclusion that `NO2(GT)`'s distribution has changed in `test2.csv`, indicating **covariate shift**.

### 4. Report the KS statistic and p-value for each feature.

In [9]:
numeric_cols = train.select_dtypes(include=['float64', 'int64']).columns
numeric_cols = [col for col in numeric_cols if not col.lower().startswith('unnamed')]

results = []

for col in numeric_cols:
    train_col = train[col].dropna()
    test1_col = test1[col].dropna()
    test2_col = test2[col].dropna()

    ks_stat_1, p_val_1 = ks_2samp(train_col, test1_col)
    ks_stat_2, p_val_2 = ks_2samp(train_col, test2_col)

    results.append({
        'Feature': col,
        'KS Statistic (test1)': ks_stat_1,
        'P-value (test1)': p_val_1,
        'KS Statistic (test2)': ks_stat_2,
        'P-value (test2)': p_val_2
    })

results_df = pd.DataFrame(results)
display(results_df)

Unnamed: 0,Feature,KS Statistic (test1),P-value (test1),KS Statistic (test2),P-value (test2)
0,PT08.S1(CO),0.032813,0.490012,0.1275,1.651177e-09
1,NMHC(GT),0.012812,0.999921,0.227187,1.981349e-29
2,C6H6(GT),0.020938,0.938445,0.1425,8.893534e-12
3,PT08.S2(NMHC),0.021562,0.92354,0.141875,1.118677e-11
4,NOx(GT),0.0175,0.988482,0.524062,4.125921e-162
5,PT08.S3(NOx),0.034375,0.430351,0.322813,1.428152e-59
6,NO2(GT),0.019062,0.972194,0.4075,7.2019980000000005e-96
7,PT08.S4(NO2),0.02,0.957373,0.597187,1.349834e-214
8,PT08.S5(O3),0.028125,0.685568,0.136563,7.544642e-11


### 6. Determine which of the two test datasets (`test1.csv` or `test2.csv`) exhibits a covariate shift relative to the training dataset (`train.csv`). Use the results of the Kolmogorov–Smirnov test to support your answer.

#### Conclusion: Covariate Shift Detection

Using the Kolmogorov–Smirnov test, we compared each test dataset with the training dataset across all numeric features.

- **`test1.csv`**:
  - For all features, the p-values are greater than $0.05$.
  - This means we fail to reject the null hypothesis and conclude that there is no significant distributional shift.

- **`test2.csv`**:
  - Many features show very low p-values $(p < 0.05)$ — including `NO2(GT)`, `NOx(GT)`, `PT08.S4(NO2)`, and others.
  - This provides strong evidence of distributional differences between `test2.csv` and `train.csv`.

Therefore, `test2.csv` exhibits covariate shift relative to the training dataset, while `test1.csv` does not.