# üõ†Ô∏è **Notebook 02: Feature Engineering & Preprocessing**
### **1. Overview**
**Goal**: Prepare the raw data for machine learning models. In the previous EDA step, we identified issues like skewness, outliers, and class imbalance. In this notebook, we will fix those issues and construct new features to improve model performance.

**Key Steps**:

1. Cleaning & Dropping weak features.

2. **Feature Construction**: Creating interaction terms.

3. **Data Splitting**: Stratified Split to handle imbalance.

4. **Transformations**: Log Transform (Skewness) & Winsorizing (Outliers).

5. **Encoding**: Frequency Encoding for high-cardinality columns.

6. **Scaling**: Standardization.

In [1]:
import pandas as pd
import numpy as np 
import os 
import sys 
import pickle

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

if '../src' not in sys.path: 
    sys.path.append('../src')
import config 
import feature_engineering as fe

In [2]:
df = pd.read_csv(config.RAW_DATA_PATH)
print(f"Initial data shape: {df.shape}")
display(df.head())


Initial data shape: (50000, 33)


Unnamed: 0,user_id,age,country,city,reg_days,marketing_source,sessions_30d,sessions_90d,avg_session_duration_90d,median_pages_viewed_30d,...,support_tickets_2024,avg_csat_2024,emails_open_rate_90d,emails_click_rate_90d,review_count_2024,avg_review_stars_2024,rfm_recency,rfm_frequency,rfm_monetary,churn_label
0,U00001,20,Thailand,Bangkok,262,ads_fb,2,4,728.93,4.41,...,1,4.3,0.252,0.029,0,4.46,55,4,80.58,0
1,U00002,34,Indonesia,Jakarta,908,organic,2,6,671.11,7.75,...,0,4.27,0.388,0.023,0,4.79,59,2,49.11,0
2,U00003,31,Indonesia,Surabaya,406,referral,0,3,493.29,2.58,...,0,4.35,0.343,0.014,0,4.59,73,1,11.95,1
3,U00004,23,Malaysia,Johor Bahru,698,ads_fb,0,4,305.83,4.4,...,0,4.54,0.27,0.027,0,4.52,65,1,14.63,1
4,U00005,28,Vietnam,Ho Chi Minh City,650,influencer,1,7,946.16,6.04,...,0,4.04,0.212,0.073,1,4.79,68,5,116.32,1


### **2. Dropping & Cleaning**

**Description**: We start by removing columns that do not hold predictive value or might introduce noise:

- `user_id`: Unique identifier (irrelevant for patterns).

- `marketing_source` & `app_version`: Based on initial screening, these features showed weak correlation with Churn and added unnecessary complexity.

In [3]:
print(f"Dropping initial columns: {config.INITIAL_COLS_TO_DROP}")
df = fe.drop_weak_features(df, config.INITIAL_COLS_TO_DROP)
print(f"Shape after dropping columns: {df.shape}")

Dropping initial columns: ['user_id', 'marketing_source', 'app_version_major']
Dropped ['user_id', 'marketing_source', 'app_version_major']
Shape after dropping columns: (50000, 30)


### **3. Feature Construction (Interaction Features)**

**Description**: Sometimes, a combination of two features tells a better story than each one alone. We create "Interaction Features" based on business logic:

- `satisfaction_x_recency`: Combines CSAT and Days Inactive. A user who is both unhappy and inactive is at much higher risk than a happy inactive user.

- `gmv_per_session`: Acts as a proxy for "Engagement Quality." It tells us how valuable each visit is, rather than just counting total visits.

In [4]:
print("Creating interaction features from Screening Round...")
df = fe.create_interaction_features(df)
print("Screening Round interaction features created.")

Creating interaction features from Screening Round...
Created interact features.
Screening Round interaction features created.


In [5]:
print(df.columns.tolist())
display(df.head())

['age', 'country', 'city', 'reg_days', 'sessions_30d', 'sessions_90d', 'avg_session_duration_90d', 'median_pages_viewed_30d', 'search_queries_30d', 'device_mix_ratio', 'orders_30d', 'orders_90d', 'orders_2024', 'aov_2024', 'gmv_2024', 'category_diversity_2024', 'days_since_last_order', 'discount_rate_2024', 'refunds_count_2024', 'refund_rate_2024', 'support_tickets_2024', 'avg_csat_2024', 'emails_open_rate_90d', 'emails_click_rate_90d', 'review_count_2024', 'avg_review_stars_2024', 'rfm_recency', 'rfm_frequency', 'rfm_monetary', 'churn_label', 'satisfaction_x_recency', 'gmv_per_session_90d']


Unnamed: 0,age,country,city,reg_days,sessions_30d,sessions_90d,avg_session_duration_90d,median_pages_viewed_30d,search_queries_30d,device_mix_ratio,...,emails_open_rate_90d,emails_click_rate_90d,review_count_2024,avg_review_stars_2024,rfm_recency,rfm_frequency,rfm_monetary,churn_label,satisfaction_x_recency,gmv_per_session_90d
0,20,Thailand,Bangkok,262,2,4,728.93,4.41,1,0.861,...,0.252,0.029,0,4.46,55,4,80.58,0,236.5,16.116
1,34,Indonesia,Jakarta,908,2,6,671.11,7.75,8,0.897,...,0.388,0.023,0,4.79,59,2,49.11,0,251.93,7.015714
2,31,Indonesia,Surabaya,406,0,3,493.29,2.58,1,0.917,...,0.343,0.014,0,4.59,73,1,11.95,1,317.55,2.9875
3,23,Malaysia,Johor Bahru,698,0,4,305.83,4.4,4,0.84,...,0.27,0.027,0,4.52,65,1,14.63,1,295.1,2.926
4,28,Vietnam,Ho Chi Minh City,650,1,7,946.16,6.04,8,0.511,...,0.212,0.073,1,4.79,68,5,116.32,1,274.72,14.54


### **4. Stratified Train-Test Split**

**Description**: This is a critical step. We split the data into Train (65%), Validation (15%), and Test (20%).

- **Why Stratify?** Since our Churn rate is 25% (imbalanced), using `stratify=y` ensures that every subset maintains this exact 75:25 ratio.

- **Prevention**: This prevents "Data Leakage" and ensures our evaluation metrics are reliable later on.

In [6]:
X = df.drop(columns=[config.TARGET_VARIABLE])
y = df[config.TARGET_VARIABLE]

X_train_full, X_test, y_train_full, y_test = train_test_split(
    X, y,
    test_size=config.TEST_SET_SIZE,
    random_state=config.RANDOM_STATE,
    stratify=y 
)

print(f"X_train_full shape: {X_train_full.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"Churn rate in y_train_full: {y_train_full.mean():.4f}")
print(f"Churn rate in y_test: {y_test.mean():.4f}")

X_train_full shape: (40000, 31)
X_test shape: (10000, 31)
Churn rate in y_train_full: 0.2500
Churn rate in y_test: 0.2500


In [7]:
X_train, X_val, y_train, y_val = train_test_split(
    X_train_full, y_train_full,
    test_size=config.VALIDATION_SET_SIZE, 
    random_state=config.RANDOM_STATE,
    stratify=y_train_full 
)

print(f"X_train shape: {X_train.shape} (~65%)")
print(f"X_val shape: {X_val.shape} (~15%)")
print(f"X_test shape: {X_test.shape} (~20%)")
print(f"\nChurn rate in y_train: {y_train.mean():.4f}")
print(f"Churn rate in y_val: {y_val.mean():.4f}")
print(f"Churn rate in y_test: {y_test.mean():.4f}")

X_train shape: (32500, 31) (~65%)
X_val shape: (7500, 31) (~15%)
X_test shape: (10000, 31) (~20%)

Churn rate in y_train: 0.2500
Churn rate in y_val: 0.2500
Churn rate in y_test: 0.2500


### **5. Handling Skewness (Log Transformation)**

**Description**: As seen in the EDA histograms, financial features like `gmv_2024` and `sessions_90d` have a "Long Tail" (right-skewed).

- **Technique**: We apply `np.log1p` (Log + 1).

- **Effect**: This compresses the large values, making the distribution more "Normal" (Gaussian-like). This helps linear models (like Logistic Regression) perform significantly better.

In [8]:
print(f"Applying log transformation to: {config.COLS_TO_LOG_TRANSFORM}")
X_train = fe.handle_skewness(X_train.copy(), config.COLS_TO_LOG_TRANSFORM)
X_val = fe.handle_skewness(X_val.copy(), config.COLS_TO_LOG_TRANSFORM)
X_test = fe.handle_skewness(X_test.copy(), config.COLS_TO_LOG_TRANSFORM)
print("Skewness handled.")

Applying log transformation to: ['gmv_2024', 'sessions_90d']
Skewness handled.


### **6. Outlier Handling (Winsorizing)**

**Description**: We have extreme values in `days_since_last_order` and spending.

- **Technique**: Instead of deleting these rows (which loses information), we use Winsorizing. We cap values at the 1st and 99th percentiles.

- **Benefit**: This keeps the data points but prevents extreme outliers from distorting the model's weights / gradients.

In [9]:
numerical_cols = [col for col in config.NUMERICAL_COLS_FOR_OUTLIERS if col in X_train.columns]
print(f"\nHandling outliers for {len(numerical_cols)} numerical columns...")
X_train = fe.handle_outliers(X_train.copy(), numerical_cols)
X_val = fe.handle_outliers(X_val.copy(), numerical_cols)
X_test = fe.handle_outliers(X_test.copy(), numerical_cols)
print("Outliers handled using Winsorizing.")


Handling outliers for 17 numerical columns...


Outliers handled using Winsorizing.


### **7. Categorical Encoding**

**Description**: Columns like country and city have many unique values (High Cardinality).

- **Why not One-Hot?** One-Hot Encoding would create hundreds of new columns, causing the "Curse of Dimensionality."

- **Solution**: We use **Frequency Encoding**. We replace the city name with the percentage of times it appears in the dataset. This preserves the information density without expanding the dataset size.

In [10]:
print(f"\nApplying Frequency Encoding to: {config.HIGH_CARDINALITY_COLS}")
X_train, X_val, X_test = fe.encode_categorical_features(
    X_train.copy(), X_val.copy(), X_test.copy(), config.HIGH_CARDINALITY_COLS
)
print("Categorical features encoded.")
print("Final columns after encoding:", X_train.columns.tolist())


Applying Frequency Encoding to: ['country', 'city']


Categorical features encoded.
Final columns after encoding: ['age', 'reg_days', 'sessions_30d', 'sessions_90d', 'avg_session_duration_90d', 'median_pages_viewed_30d', 'search_queries_30d', 'device_mix_ratio', 'orders_30d', 'orders_90d', 'orders_2024', 'aov_2024', 'gmv_2024', 'category_diversity_2024', 'days_since_last_order', 'discount_rate_2024', 'refunds_count_2024', 'refund_rate_2024', 'support_tickets_2024', 'avg_csat_2024', 'emails_open_rate_90d', 'emails_click_rate_90d', 'review_count_2024', 'avg_review_stars_2024', 'rfm_recency', 'rfm_frequency', 'rfm_monetary', 'satisfaction_x_recency', 'gmv_per_session_90d', 'country_freq', 'city_freq']


### **8. Feature Scaling**

**Description**: Different features have different ranges (e.g., Age is 18-80, GMV is 0-10,000).

- **Technique**: We use `StandardScaler` to shift distributions to have Mean = 0 and Variance = 1.

- **Important**: We `fit` the scaler ONLY on the Training set, and then transform Validation/Test sets. This strictly prevents data leakage.

In [11]:
scaler = StandardScaler()

numerical_cols = X_train.select_dtypes(include=np.number).columns.tolist()

X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_val[numerical_cols] = scaler.transform(X_val[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])
print("Numerical features scaled successfully.")


Numerical features scaled successfully.


In [12]:
scaler_path = os.path.join(config.PROCESSED_DATA_PATH, 'scaler.pkl')
with open(scaler_path, 'wb') as f:
    pickle.dump(scaler, f)
print(f"Scaler saved to: {scaler_path}")

Scaler saved to: ../data/processed/scaler.pkl


In [13]:
os.makedirs(config.PROCESSED_DATA_PATH, exist_ok=True)

X_train.to_csv(os.path.join(config.PROCESSED_DATA_PATH, 'X_train.csv'), index=False)
X_val.to_csv(os.path.join(config.PROCESSED_DATA_PATH, 'X_val.csv'), index=False)
X_test.to_csv(os.path.join(config.PROCESSED_DATA_PATH, 'X_test.csv'), index=False)

y_train.to_csv(os.path.join(config.PROCESSED_DATA_PATH, 'y_train.csv'), index=False, header=True)
y_val.to_csv(os.path.join(config.PROCESSED_DATA_PATH, 'y_val.csv'), index=False, header=True)
y_test.to_csv(os.path.join(config.PROCESSED_DATA_PATH, 'y_test.csv'), index=False, header=True)

print("All processed data artifacts have been saved to:", config.PROCESSED_DATA_PATH)
print("Files in processed data directory:", os.listdir(config.PROCESSED_DATA_PATH))

All processed data artifacts have been saved to: ../data/processed/
Files in processed data directory: ['scaler.pkl', 'X_test.csv', 'X_train.csv', 'X_val.csv', 'y_test.csv', 'y_train.csv', 'y_val.csv']


# üìù **Summary**

**Conclusion:**
We have successfully transformed the raw, messy data into a clean, numerical format ready for machine learning algorithms.

**Key Achievements:**

1. **Addressed Data Quality:** Skewness and outliers have been handled using Log Transformation and Winsorizing, ensuring our model isn't confused by extreme values.
2. **Enriched Information:** New interaction features (e.g., `satisfaction_x_recency`) have been created to capture complex user behaviors that single features might miss.
3. **Prevented Leakage:** The train/val/test split was performed *before* any scaling or encoding to ensure the validity of our evaluation.
4. **Ready for Training:** All categorical variables are encoded, and numericals are scaled.

The processed datasets (`X_train`, `y_train`, etc.) and the `scaler` object have been saved to the `../data/processed/` folder.