# CS 203 Assignment 10

## Team Members
- Nishchay Bhutoria (23110222)
- Srivaths P (23110321)

## Imports

In [81]:
import pandas as pd
import numpy as np
from scipy.stats import ks_2samp
from statsmodels.stats.proportion import proportions_ztest

## Part 1: A/B Testing using Ad Click Prediction

### 1. Load the dataset into a pandas DataFrame.

In [90]:
df = pd.read_csv('datasets/ad_click_dataset.csv')
df.head()

Unnamed: 0,id,full_name,age,gender,device_type,ad_position,browsing_history,time_of_day,click
0,670,User670,22.0,,Desktop,Top,Shopping,Afternoon,1
1,3044,User3044,,Male,Desktop,Top,,,1
2,5912,User5912,41.0,Non-Binary,,Side,Education,Night,1
3,5418,User5418,34.0,Male,,,Entertainment,Evening,1
4,9452,User9452,39.0,Non-Binary,,,Social Media,Morning,0


### 2. Perform necessary data cleaning and preprocessing:
- Handle missing values
- Convert categorical columns  (e.g., gender, ad_position)


In [91]:
df = df.dropna(subset=['ad_position'])

categorical_columns = df.select_dtypes(include=['object', 'category']).columns
for col in categorical_columns:
    df[col] = df[col].astype('category').cat.codes

df.head()

Unnamed: 0,id,full_name,age,gender,device_type,ad_position,browsing_history,time_of_day,click
0,670,2088,22.0,-1,0,2,3,0,1
1,3044,750,,1,0,2,-1,-1,1
2,5912,1783,41.0,2,-1,1,0,3,1
5,5942,1794,,2,-1,0,4,1,1
6,7808,2497,26.0,0,0,2,-1,-1,1


### 3. Split the dataset into two groups:
- Group A: Users with ad_position = 2 (Top)
- Group B: Users with ad_position = 0 (Bottom)

In [84]:
group_A = df[df['ad_position'] == 2]
group_B = df[df['ad_position'] == 0]

### 4. Use the statsmodel’s `proportions_ztest` function to perform an independent two-sample z-test between Group A and Group B.

In [85]:
clicks_A = group_A['click'].sum()
clicks_B = group_B['click'].sum()

total_A = group_A.shape[0]
total_B = group_B.shape[0]

count = np.array([clicks_A, clicks_B])
nobs = np.array([total_A, total_B])
z_stat, p_val = proportions_ztest(count, nobs)

### 5. Print the following:
- The z-score
- The p-value

In [86]:
print(f"Z-score: {z_stat:.8f}")
print(f"P-value: {p_val:.8f}")

alpha = 0.05
if p_val < alpha:
    print("There is a statistically significant difference in click-through rates between the two ad positions.")
else:
    print("There is no statistically significant difference in click-through rates between the two ad positions.")

Z-score: -4.06421541
P-value: 0.00004819
There is a statistically significant difference in click-through rates between the two ad positions.


### 6. Interpret the result: Is there a statistically significant difference in click-through rates between the two groups? Justify your answer. 

Yes, there is a statistically significant difference in click-through rates between the two groups. The p-value of ~0.00005 is less than the significance level of 0.05, indicating that the observed difference in click-through rates is unlikely to have occurred by random chance. This suggests that the ad position has a significant impact on user engagement.

## Part 2: Covariate Shift Detection Using Air Quality Data

### 1. You are provided with 3 datasets via this Google Drive link:
- train.csv
- test1.csv
- test2.csv

### 2. Load all three datasets using `pandas`.

In [87]:
train = pd.read_csv('datasets/train.csv')
test1 = pd.read_csv('datasets/test1.csv')
test2 = pd.read_csv('datasets/test2.csv')


def fix_commas(df):
    return df.apply(lambda col: col.map(lambda x: float(str(x).replace(',', '.')) if isinstance(x, str) and ',' in x else x))


train = fix_commas(train)
test1 = fix_commas(test1)
test2 = fix_commas(test2)

train.shape, test1.shape, test2.shape

((3200, 18), (800, 18), (800, 18))

### 3. For each test dataset (`test1.csv` and `test2.csv`), compare it with `train.csv` using the **Kolmogorov–Smirnov test** (`scipy.stats.ks_2samp`). Perform the KS test on the **NO2(GT)** column to identify whether there are any distributional differences. 

In [88]:
train = train[train['NO2(GT)'] >= 0]
test1 = test1[test1['NO2(GT)'] >= 0]
test2 = test2[test2['NO2(GT)'] >= 0]

train_no2 = train['NO2(GT)'].dropna()
test1_no2 = test1['NO2(GT)'].dropna()
test2_no2 = test2['NO2(GT)'].dropna()

ks_stat_1, p_val_1 = ks_2samp(train_no2, test1_no2)
ks_stat_2, p_val_2 = ks_2samp(train_no2, test2_no2)

print("KS Test Results for NO2(GT):")

print("\nTrain vs Test1:")
print(f"KS Statistic = {ks_stat_1:.8f}")
print(f"P-value      = {p_val_1:.8f}")
print("No significant difference" if p_val_1 >= 0.05 else "Significant difference")

print("\nTrain vs Test2:")
print(f"KS Statistic = {ks_stat_2:.8f}")
print(f"P-value      = {p_val_2:.8f}")
print("No significant difference" if p_val_2 >= 0.05 else "Significant difference")


KS Test Results for NO2(GT):

Train vs Test1:
KS Statistic = 0.01706222
P-value      = 0.99713782
No significant difference

Train vs Test2:
KS Statistic = 0.36885364
P-value      = 0.00000000
Significant difference


Train vs Test1: There is no significant difference in the distribution of NO2(GT) values between the training and test1 datasets, as indicated by the high p-value (~0.9971).

Train vs Test2: There is a significant difference in the distribution of NO2(GT) values between the training and test2 datasets, as indicated by the low p-value (~0.0000).

### 4. Report the KS statistic and p-value for each feature.

In [89]:
numeric_cols = train.select_dtypes(include=['float64', 'int64']).columns
numeric_cols = [col for col in numeric_cols if not col.lower().startswith('unnamed')]

results = []

for col in numeric_cols:
    train_col = train[col].dropna()
    test1_col = test1[col].dropna()
    test2_col = test2[col].dropna()

    ks_stat_1, p_val_1 = ks_2samp(train_col, test1_col)
    ks_stat_2, p_val_2 = ks_2samp(train_col, test2_col)

    results.append({
        'Feature': col,
        'KS Statistic (test1)': ks_stat_1,
        'P-value (test1)': p_val_1,
        'KS Statistic (test2)': ks_stat_2,
        'P-value (test2)': p_val_2
    })

results_df = pd.DataFrame(results)
results_df

Unnamed: 0,Feature,KS Statistic (test1),P-value (test1),KS Statistic (test2),P-value (test2)
0,PT08.S1(CO),0.037426,0.437186,0.108878,9.736907e-07
1,NMHC(GT),0.018621,0.991175,0.261244,4.978435e-37
2,C6H6(GT),0.022405,0.94697,0.165641,4.767387e-15
3,PT08.S2(NMHC),0.022405,0.94697,0.167671,2.072857e-15
4,NOx(GT),0.018128,0.993617,0.487814,1.986449e-132
5,PT08.S3(NOx),0.040097,0.352329,0.309296,5.487765e-52
6,NO2(GT),0.017062,0.997138,0.368854,2.531724e-74
7,PT08.S4(NO2),0.021764,0.958119,0.600455,3.376137e-206
8,PT08.S5(O3),0.028462,0.772064,0.114048,2.372149e-07


### 6. Determine which of the two test datasets (`test1.csv` or `test2.csv`) exhibits a covariate shift relative to the training dataset (`train.csv`). Use the results of the Kolmogorov–Smirnov test to support your answer.

Kolmogorov-Smirnov (KS) tests indicate a significant covariate shift between the training data and test2, but not test1. Most features in test2 showed high KS statistics and p-values < 0.05 compared to the training set.

The NO2(GT) feature highlights this difference. The comparison between train and test2 showed a significant distributional divergence (KS=0.3689, p=0), while the comparison between train and test1 indicated high similarity (KS=0.0171, p=0.9971).