In [0]:
# Step 1: Set the storage account name
storage_account = "stcampaigntp"  # update if your storage account is different

# Step 2: Mount access using secret
spark.conf.set(
  f"fs.azure.account.key.{storage_account}.dfs.core.windows.net",
  dbutils.secrets.get(scope="local-scope", key="storage-account-key")
)

### STEP 1:Load and Feature Engineer Full Dataset 

## Business Context:

To enable scalable and efficient uplift modeling, we start by preparing a clean and structured version of the raw Criteo ad dataset. The original data includes 12 continuous features (f0–f11) along with treatment and outcome variables. However, to streamline feature handling and align with naming conventions used in marketing and experimentation analytics, we rename these features (I1–I12), enforce consistent data types, and remove any corrupted or incomplete records.

This step ensures the data is reliable, interpretable, and ready for downstream modeling and analysis pipelines that support leadership decision-making.

## Technical Details:

Load Data: We read in the curated 1M-row sample (previously stored in the curated Azure Data Lake container) using PySpark for distributed processing.

Rename Features: Columns f0 to f11 are renamed to I1 to I12 to better represent interpretable feature identifiers.

Data Cleaning:

We remove rows with any missing values using .dropna().

We enforce a consistent IntegerType for all columns, including features, treatment indicator, and outcome.

Save Cleaned Dataset: The cleaned and transformed dataset is saved back to the curated container under a new path: criteo-1m-features.

This processing ensures the dataset is clean, typed, and optimized for Spark-based transformations and distributed model training workflows.

In [0]:
# 5A: Load sampled data
df = spark.read.parquet(f"abfss://curated@{storage_account}.dfs.core.windows.net/criteo_uplift_1m")
print(f"✅ Loaded {df.count():,} rows")

# 5B: Feature engineering

# Optional: Rename f0–f11 to meaningful names like I1–I12
new_col_names = {
    'f0': 'I1', 'f1': 'I2', 'f2': 'I3', 'f3': 'I4',
    'f4': 'I5', 'f5': 'I6', 'f6': 'I7', 'f7': 'I8',
    'f8': 'I9', 'f9': 'I10', 'f10': 'I11', 'f11': 'I12'
}

for old_name, new_name in new_col_names.items():
    df = df.withColumnRenamed(old_name, new_name)

# Drop any rows with missing values
df = df.dropna()

# Cast all I-columns and labels to IntegerType
from pyspark.sql.types import IntegerType

for col in df.columns:
    df = df.withColumn(col, df[col].cast(IntegerType()))

# Show schema to confirm
df.printSchema()

# Save processed features to 'curated' container
df.write.mode("overwrite").parquet(f"abfss://curated@{storage_account}.dfs.core.windows.net/criteo-1m-features")
print("✅ Feature-engineered data written to curated/criteo-1m-features")



### Step 2: Why Are We Performing Sampling?

This dataset contains 1 million observations from a real-world ad campaign with a binary treatment variable:
- `treatment = 1`: user was shown the ad
- `treatment = 0`: user was not shown the ad (control group)

However, the treatment distribution is highly **imbalanced**:
- ~85% of the users were treated
- ~15% were in control

#### Business Implication:
In uplift modeling, the model needs to learn **differential responses** between the treatment and control groups. If one group (control) is underrepresented, the model will struggle to accurately estimate the **causal effect** of the treatment (e.g., whether the ad really changed behavior).

#### Technical Implication:
We apply **stratified sampling** to:
1. **Reduce overall data size** for fast iteration and modeling (100k rows instead of 1M)
2. **Balance the treatment and control groups** to avoid bias in uplift estimation

This sampling improves both performance and reliability, making our model better at identifying which users to target in future campaigns.


## You might ask questions like:

“If the original dataset has 85% treated and 15% control, and we use stratified sampling — aren’t we just keeping the same imbalance in the sampled dataset? What’s the point of sampling then?”

# Short answer to this is:

You're right that the class imbalance remains the same after stratified sampling — because we intentionally preserved it. However, the goal of sampling here is not to fix the imbalance, but to reduce dataset size while maintaining the same structure for modeling.

# Long answer with business + technical perspective:
Business Logic:

In real-world marketing or product campaigns (like Criteo’s), most users do get exposed to the treatment (ad), while a smaller set remains in control.

If we try to "balance" this artificially during modeling, the uplift model wouldn’t reflect real campaign distribution.

So we want the model to learn patterns based on realistic exposure rates.

Technical Explanation:

We use stratified sampling to keep the proportion of treated/control the same as the full dataset.

If we didn't do stratified sampling, we might accidentally select too few control users — which would break uplift modeling, as it relies on comparing treated vs control groups.

So:

Imbalance is kept intentionally.

Sampling only reduces size (for faster iteration and less memory use), not for balancing.

Important Note:

If we did want to test a balanced setting (say 50% treated / 50% control) for research purposes, we would do deliberate under-sampling of treated users — but that would be a separate experiment, not a default.

In [0]:
from pyspark.sql.functions import col
import matplotlib.pyplot as plt

# Step 1: Stratified Sampling — 15% from each treatment group
fractions = {0: 0.15, 1: 0.15}
sampled_df = df.sampleBy("treatment", fractions=fractions, seed=42).cache()

# Step 2: Show count by treatment group
sampled_df.groupBy("treatment").count().show()

# Step 3: Select only needed columns
selected_cols = ["treatment", "conversion"] + [c for c in df.columns if c.startswith("feature_")]
sampled_df = sampled_df.select([col(c) for c in selected_cols])

# Step 4: Optional — Create temp SQL view for inspection
sampled_df.createOrReplaceTempView("sampled_criteo")

# Step 5: Convert to Pandas
pandas_df = sampled_df.toPandas()

# Step 6: Plot with value labels
ax = pandas_df['treatment'].value_counts(normalize=True).sort_index().plot(
    kind='bar', 
    title='Treatment Group Distribution (Sampled)',
    color=['orange', 'skyblue']
)
ax.set_xlabel("Treatment Group")
ax.set_ylabel("Proportion")
ax.set_xticklabels(['Control (0)', 'Treatment (1)'], rotation=0)

# Add % labels on bars
for p in ax.patches:
    ax.annotate(f"{p.get_height()*100:.1f}%", 
                (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()


### Step 3: Label engineering

## Business Context:
To train an uplift model, we can’t just use a traditional "conversion" target. We need to define how the treatment affected the conversion, i.e., whether the treatment caused the outcome.

This is done using the 4-way uplift classification:

Class 0: Not Treated → Did Not Convert

Class 1: Not Treated → Converted

Class 2: Treated → Did Not Convert

Class 3: Treated → Converted

These four classes help the model distinguish between natural converters vs treatment-driven converters.

In [0]:
# Assuming your DataFrame has 'treatment' and 'conversion' columns
# Encode uplift label: 2*treatment + conversion → gives 0 to 3
pandas_df["uplift_label"] = 2 * pandas_df["treatment"] + pandas_df["conversion"]


In [0]:
import matplotlib.pyplot as plt

# Step 1: Define uplift label meanings
uplift_label_names = {
    0: 'Control - No Conversion',
    1: 'Control - Conversion',
    2: 'Treated - No Conversion',
    3: 'Treated - Conversion'
}

# Step 2: Count and relabel uplift label distribution
uplift_counts = pandas_df['uplift_label'].value_counts().sort_index()
uplift_counts.index = uplift_counts.index.map(uplift_label_names)

# Step 3: Plot uplift label distribution
plt.figure(figsize=(10, 5))
bars = plt.bar(uplift_counts.index, uplift_counts.values, color='steelblue')
plt.title("Uplift Label Distribution")
plt.ylabel("Number of Users")
plt.xlabel("Uplift Label Category")

# Step 4: Annotate values on bars
for bar in bars:
    height = bar.get_height()
    plt.annotate(f'{height:,}', 
                 xy=(bar.get_x() + bar.get_width() / 2, height),
                 xytext=(0, 3), 
                 textcoords="offset points",
                 ha='center', va='bottom')

plt.xticks(rotation=20)
plt.tight_layout()
plt.show()


In [0]:
# Save the processed pandas dataframe for modeling
pandas_df.to_parquet("pandas_uplift_sampled.parquet", index=False)
print("✅ Saved processed dataframe to pandas_uplift_sampled.parquet")
