<a href="https://colab.research.google.com/github/preethujohnson/Machine-Learning-Project---Engage-to-Value/blob/main/22f1001747_notebook_t22025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: SOME KAGGLE DATA SOURCES ARE PRIVATE
# RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES.
import kagglehub
kagglehub.login()


In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

engage_2_value_from_clicks_to_conversions_path = kagglehub.competition_download('engage-2-value-from-clicks-to-conversions')

print('Data source import complete.')


# MLP Project

**Engage2Value: From Clicks to Conversions**

*Predict purchase value from multi-session digital behavior using ML.*




### All about the project

In today’s digital age, every scroll, click, and visit to a website leaves behind a trail of data. Businesses spend heavily on online ads and marketing campaigns but not all users end up making a purchase. Some just browse, others drop off halfway, and only a few complete a transaction.

This project is about building an intelligent system that can analyze the user behavior across multiple digital touchpoints and predict how much a user might spend.

Using machine learning, I aim to learn patterns from users who purchased in the past and use that knowledge to predict the spending behavior for new users, that is, their **purchase value**.


### Why is this project important?
Accurately predicting purchase value helps businesses to:
* Maximize returns on ad spending
* Target the right audience more effectively
* Allocate resources smarter
* Personalize user experience to increase conversions


### Plan of the Project

I will follow a complete **end-to-end machine learning pipeline**, step by step:

1. **Understand the dataset**
   - Explore and understand the provided training data set. The shape of the dataset,  the features, its type, quality of the data and so on.
  
2. **Clean and prepare the data**
   - Handle the missing values
   - Remobe duplicate values
   - Encode all the categorical variables
   - Scale and impute numerical values as needed
   - Engineer useful features

3. **Perform Exploratory Data Analysis (EDA)**
   - Visualize key relationships and distributions
   - Identify potential drivering factors of purchase behavior

4. **Build and evaluate multiple ML models**
   - Build different ML models like Linear Regression, Tree-based models, XGBoost, LightGBM, Multi-layer Perceptron, etc.
   - Use **R² Score** as the main evaluation metric

5. **Compare models and select the best**
   - Tune hyperparameters
   - Avoid overfitting/underfitting
   - Choose the model that performs best on validation data

6. **Make final predictions and submit**
   - Predict purchase value for unseen users
   - Generate submission file for Kaggle


### Objective

The project objectives are:
- Build a high-performing regression model
- Achieve an R² Score ≥ **0.45** on the Kaggle leaderboard. This score helps us unerstand how well our predicton matches the actual value.
- Use **at least 3 different models** as per project requirement
- Document the entire pipeline clearly and neatly
- Be ready to explain every step during the viva voce


### 📘 Outcome
This project bridges the gap between theory and application. It transforms machine learning from a concept I’ve learned into a tool I can now use to solve real-world problems.

By the end of this project, I will have:
- Completed a real-world, business-oriented ML project
- Gained hands-on experience with data cleaning, EDA, modeling, and evaluation
- Produced a clean, reproducible, and professional Kaggle notebook




**Steps I followed**
- Data loading
- Data Preprocessing
- Exploratory Data Analysis
- Data visualization
- Statistical analysis
- Feature Engineering
- Train-Validation Split
- Building a baseline model
- Hyperparameter Tuning
- Comparison of models
- Predict target value from test data

The libraries allowed to use for this project are the following:

- NumPy - Used for numerical operations, especially arrays, matrices, and linear algebra.
- Pandas - For data manipulation and analysis using dataframes
- Matplotlib - A basic visualization library for plotting graphs like line, bar, scatter, etc.
- Scikit-learn - Core ML library used for models like LogisticRegression, RandomForest, preprocessing, metrics, train-test splitting, pipelines.
- XGBoost - Advanced gradient boosting library. Faster and more accurate than many models.
- Seaborn - Built on top of matplotlib, used for beautiful, statistical visualizations.
- Imblearn - For dealing with imbalanced datasets
- SciPy - Useful for scientific computing, advanced statistical functions, optimization, interpolation, etc.
- Pickle - Python's built-in library for saving/loading ML models or data serialization.
- regex - Python’s regular expressions module for pattern matching and text cleaning.
- Lightgbm - Gradient boosting model like XGBoost, but faster and more memory efficient for large datasets.
- Plotly - Interactive visualizations like 3D plots, dashboards, etc.


All these libraries are preinstalled in the Kaggle Notebook. But need to import them to work with them.



##### R² score and RMSE (Root Mean Squared Error)

R² Score measures the proportion of variance in the target that is explained by the model.

Why it is been used is that it gives a scale-independent evaluation (range: 0 to 1) of how well the model fits the data. It helps quickly understand how much of the variation in purchaseValue is captured by my features.
It complements RMSE well. while RMSE shows the average error, R² shows the goodness of fit.RMSE is a key regression metric that measures the average magnitude of prediction error. It penalizes larger errors more than smaller ones, making it useful when we want to be cautious about large deviations. Lower RMSE values indicate better model performance.

In this project, I used RMSE along with R² to evaluate and compare all regression models.


## My End-to-End ML Project Routine
1. Data Cleaning
- Check duplicates - Remove duplicate rows to avoid bias or redundancy.
- Check for variance - Drop constant or near-constant columns that has no predictive power.
- Check missing values - If >95% missing, drop the column as imputation will not be reliable. Else, decide imputation strategy (mean, median, mode, etc.).
- Check for outliers - Handle skewness (log/sqrt transforms) to make distributions more symmetric.
  
2. Data Preprocessing
- Imputing - fill missing values
Numeric → Mean/Median
Categorical → Mode/Most frequent

- Encoding (for categorical features)
One-hot encoding or label encoding as per need.

- Scaling (normalize/standardize numeric features)
StandardScaler or MinMaxScaler for models sensitive to feature scale (SVM, KNN, etc.).


3. Feature Engineering
Create new features that may improve predictive power. Example: UNIX Time Stamp

4. Model Training & Selection
- Train multiple baseline models (Linear, Tree-based, Boosting, etc.).
- Compare using validation metrics (R², RMSE for regression; F1, precision, recall for classification).
- Select Top 3 best-performing models.
  
5. Hyperparameter Tuning
- Use GridSearchCV / RandomizedSearchCV on the top 3 models.
- Optimize parameters like n_estimators, max_depth (trees), learning_rate (boosting), C, gamma (SVM)
- Re-compare tuned models.
-
6. Final Model Selection
- Pick the best model based on tuned results.
- Train on full training data (train + validation).
- Apply to test data for final predictions.
  
7. Evaluation & Interpretation
- Check performance on test set.
- Analyze feature importance (if applicable).
- Document all steps clearly for reproducibility.


Let me first import all the libraries required and allowed for this project.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import re
import pickle
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from scipy import stats
import xgboost as xgb
import lightgbm as lgb

### Loading the dataset

Once the libraries are imported, let me load my training dataset and view how it is like. My train dataset is named as pjtrn.

In [None]:
pjtrn = pd.read_csv("/kaggle/input/engage-2-value-from-clicks-to-conversions/train_data.csv")

### Know the features of the data set

Now let me get to know more about the dataset. The shape, features, type and all of the data before going with the preprocessing of data.

In [None]:
print("Shape of training data:")
pjtrn.shape

In [None]:
print("The first rows of data looks like: ")
pjtrn.head()

In [None]:
pjtrn.info()

### Know the Data Types of Columns

In [None]:
print("The data types of the columns are:")
pjtrn.dtypes

Let me find out the categorical, numerical and boolean features in the dataset.

In [None]:
categorical_features = pjtrn.select_dtypes(include=['object']).columns.tolist()
print("Categorical Features:\n", categorical_features)

numerical_features = pjtrn.select_dtypes(include=['int64', 'float64']).columns.tolist()
print("\nNumerical Features:\n", numerical_features)

boolean_features = pjtrn.select_dtypes(include=['bool']).columns.tolist()
print("\nBoolean Features:\n", boolean_features)

This looks messy, let me arrage the features in table. I am using the Series command for this.

In [None]:
feature_type_df = pd.DataFrame({
    'Categorical Features': pd.Series(categorical_features),
    'Numerical Features': pd.Series(numerical_features),
    'Boolean Features': pd.Series(boolean_features)
})
feature_type_df

Now, I am going to find the descriptive satistics of the data set. This is needed to do the imputation while preprocessing the data.

In [None]:
pjtrn.describe()

Let me make it a bit more easier to read.

In [None]:
pjtrn.describe().T

With describe() I got the count of values for each column, mean, standard deviation, minimum and maximum value in the column, and the quartile values. The median is missing in this, median is important when we have to impute the numerical values. So, I am finding that.

In [None]:
pjtrn.median(numeric_only=True)

### Data Cleaning
As I now know the features of my data set, now I am gonna clean it for better quality. I will
1. Check for duplicate rows and if any, will drop those rows to avoid redundancy.
2. Check for variance, if no variation (constant values), will drop those columns as they add no predictive power.
3. Check for missing values, if more than 95% missing values, will drop those columns as they cannot be reliably imputed.
4. Check for outliers, transform them if skewed more.

First let me check for duplicates and drop them.

In [None]:
pjtrn.shape

In [None]:
duplicates_count = pjtrn.duplicated().sum()
print(f"Number of duplicate rows: {duplicates_count}")

if duplicates_count > 0:
    pjtrn.drop_duplicates(inplace=True)
    print("Duplicates removed.")
else:
    print("No duplicates found.")
print("Shape of new dataset after removing duplicates: ")
pjtrn.shape

There were 236 duplicate rows which cause redundancy, so I removed them.

Next, let me check for variance.

In [None]:
no_variance_cols = [col for col in pjtrn.columns if pjtrn[col].nunique() <= 1]


print(f"Number of no-variance columns: {len(no_variance_cols)}")
print("No-variance columns:")

for col in no_variance_cols:
    print(col)

pjtrn[no_variance_cols].head()


There are 21 columns which have no variance and add no value to the model training. So I am removing them.

In [None]:
pjtrn.drop(columns=no_variance_cols, inplace=True)
print("Dropped no-variance, constant value columns.")

In [None]:
pjtrn.shape

Next, I am checking the columns with missing values. The columns with more than 95% missing percentage can be removed.

In [None]:
pjtrn.isnull()
print(pjtrn.isnull().sum())
pjtrn.isnull().sum().sum()

In [None]:
missing_pct = pjtrn.isnull().mean() * 100
print("Missing value percentages:\n", missing_pct.sort_values(ascending=False))

high_missing_cols = missing_pct[missing_pct > 95].index
print("\n \n High missing value columns:\n", high_missing_cols)

There are 4 columns with more than 95% missing values. I am removing the columns with more than 95% missing values.

In [None]:
pjtrn.drop(columns=high_missing_cols, inplace=True)
print("Dropped columns with >95% missing values:", list(high_missing_cols))

In [None]:
pjtrn.shape

Next I am gonna check the columns with outliers.

In [None]:
import numpy as np
import pandas as pd

# Select numeric columns only
numeric_cols = pjtrn.select_dtypes(include=[np.number]).columns

# Dictionary to store outlier counts
outlier_summary = {}

for col in numeric_cols:
    Q1 = pjtrn[col].quantile(0.25)
    Q3 = pjtrn[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers_count = ((pjtrn[col] < lower_bound) | (pjtrn[col] > upper_bound)).sum()

    # Store results if there are outliers
    if outliers_count > 0:
        outlier_summary[col] = outliers_count

# Convert to DataFrame for better display
outlier_df = pd.DataFrame(list(outlier_summary.items()), columns=['Column', 'Outlier Count'])

# Show table sorted by number of outliers
outlier_df = outlier_df.sort_values(by='Outlier Count', ascending=False).reset_index(drop=True)
print(outlier_df)


There are 5 columns with outliers. Let me see how these values are spread using boxplots.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Loop through only columns that have outliers
for col in outlier_df['Column']:
    plt.figure(figsize=(6, 4))
    sns.boxplot(x=pjtrn[col])
    plt.title(f"Outlier Visualization for {col}")
    plt.xlabel(col)
    plt.show()

The first 4 graphs showed the outliers cleary. But the 5th graph looked different. So I checked the column values.

In [None]:
pjtrn['gclIdPresent'].head()

In [None]:
pjtrn['gclIdPresent']


gclIdPresent is a binary feature, so the boxplot shows a line at 0 and a single point at 1. The outlier here is not a bad data point but simply the other category. Therefore, no outlier treatment is needed for this column.

But other four columns are skewed, so I will do a log transformation to that.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Identify numeric columns (exclude binary like gclIdPresent)
numeric_cols = pjtrn.select_dtypes(include=[np.number]).columns.tolist()
binary_cols = [col for col in numeric_cols if pjtrn[col].nunique() <= 2]
numeric_cols = [col for col in numeric_cols if col not in binary_cols]

# Function to plot before/after transformation
def plot_before_after(col, transformed):
    fig, axes = plt.subplots(2, 2, figsize=(10, 6))

    # Before - Histogram
    sns.histplot(pjtrn[col], kde=True, ax=axes[0, 0])
    axes[0, 0].set_title(f"{col} - Histogram (Before)")

    # After - Histogram
    sns.histplot(transformed, kde=True, ax=axes[0, 1])
    axes[0, 1].set_title(f"{col} - Histogram (After)")

    # Before - Boxplot
    sns.boxplot(x=pjtrn[col], ax=axes[1, 0])
    axes[1, 0].set_title(f"{col} - Boxplot (Before)")

    # After - Boxplot
    sns.boxplot(x=transformed, ax=axes[1, 1])
    axes[1, 1].set_title(f"{col} - Boxplot (After)")

    plt.tight_layout()
    plt.show()

# 2. Transformation rules
transformed_data = pjtrn.copy()
transformation_summary = []

for col in numeric_cols:
    skewness = pjtrn[col].skew()

    if abs(skewness) > 1:  # High skew
        transformed_data[col] = np.log1p(pjtrn[col])  # log1p handles zero values safely
        transformation_summary.append((col, skewness, "Log Transformation"))
        plot_before_after(col, transformed_data[col])

    elif 0.5 < abs(skewness) <= 1:  # Moderate skew
        transformed_data[col] = np.sqrt(pjtrn[col])
        transformation_summary.append((col, skewness, "Square Root Transformation"))
        plot_before_after(col, transformed_data[col])

    else:
        transformation_summary.append((col, skewness, "No Transformation"))

# 3. Show summary table
transformation_df = pd.DataFrame(transformation_summary, columns=["Column", "Skewness", "Transformation Applied"])
print(transformation_df)


The outliers are also handled now.
Now, the duplicate rows are removed, no variance columns are removed, high misisng value columns are removed and the outliers are transformed.
Now I have the clean data set. Let me see how it is now.

In [None]:
pjtrn.shape

### Getting Clean Dataset

Now, I have the clean dataset without duplicate rows, no variance columns, high misisng value columns and outliers. Now, I can go for the preprocessing of data.

In [None]:
# Replace pjtrn with the transformed data
pjtrn = transformed_data.copy()

# Confirm the shape is the same but data values are updated
print(pjtrn.shape)
pjtrn.head()


## Data Preprocessing
Now that we have clean data, we can do some preprocessing to our date to make it more meaningful and useful for modelling.
1. Imputation (Handling Missing Values in Remaining Columns)
Even after removing high-missing columns, some columns may still have smaller percentages of missing values. I am handling them like this:
- Numerical columns → Fill with mean or median
- Categorical columns → Fill with mode (most frequent value)

2. Encoding (Converting Categorical → Numerical) - Converting categorical (text or label) data into numerical form so machine learning models can process it.
Machine learning models (like Random Forest, XGBoost) can’t handle raw text labels. So they have to be encoded to numerical values.
- One-Hot Encoding → For nominal categories (no order, e.g., “red”, “blue”)
- Label Encoding → For ordinal categories (has order, e.g., “low”, “medium”, “high”)

3. Scaling (Standardizing Numerical Values) - Adjusting numerical feature values to a common range or distribution without changing their relationships.
Scaling is needed for algorithms sensitive to feature magnitude (SVM, KNN, Logistic Regression), but tree-based models (RandomForest, XGBoost) don’t require it.
- Min-Max Scaler - Scales data to a fixed range, usually [0, 1] Preserves shape but compresses values into the range
- Standard Scaler (Z-score Normalization) - Centers data to mean 0 and standard deviation 1.


### Imputing Misisng Values
I am now gonna impute the missing values in the columns with low misisng data.

In [None]:
numerical_cols = pjtrn.select_dtypes(include=['number']).columns.tolist()
categorical_cols = pjtrn.select_dtypes(include=['object', 'category']).columns.tolist()

print("Numerical columns:", numerical_cols)
print("Categorical columns:", categorical_cols)


In [None]:
# I am checking the skewness of columns

skewness = pjtrn[numerical_cols].skew().sort_values(ascending=False)
print("Skewness of numerical columns:")
print(skewness)


In [None]:
# Numerical columns skewness-based imputation
median_impute_cols = ['gclIdPresent', 'sessionNumber', 'purchaseValue']
mean_impute_cols = ['totalHits', 'pageViews', 'sessionStart', 'sessionId', 'date', 'userId']

# Impute median for high-skew columns
for col in median_impute_cols:
    pjtrn[col].fillna(pjtrn[col].median(), inplace=True)

# Impute mean for low/moderate skew columns
for col in mean_impute_cols:
    pjtrn[col].fillna(pjtrn[col].mean(), inplace=True)

# Impute mode for categorical columns
for col in categorical_cols:
    pjtrn[col].fillna(pjtrn[col].mode()[0], inplace=True)

print("Missing values imputed successfully.")


In [None]:
# Check for missing values in each column
missing_counts = pjtrn.isnull().sum()
print(missing_counts[missing_counts > 0])
print("Total missing values in dataset:", pjtrn.isnull().sum().sum())


### Encoding
As all missing values are imputed, next I am gonna encode all the categorical variables.
How to decide which encoding to use depends on the type and nature of your categorical variables:
1. One-Hot Encoding
Use for nominal categorical variables (categories with no order)
Example: colors (red, blue, green), product types, city names
Creates new binary columns for each category

2. Label Encoding
Use for ordinal categorical variables (categories with a clear order)
Example: low, medium, high; education levels (high school < college < masters)
Converts categories to integers (0,1,2,...) preserving order

So, first I have to know what type my categorical columns are.

In [None]:
# Step 1: Identify categorical columns
categorical_cols = pjtrn.select_dtypes(include=['object', 'category']).columns.tolist()
print("Categorical columns found:")
print(categorical_cols)

# Step 2: Display unique values for each categorical column to help decide type
"""for col in categorical_cols:
    unique_vals = pjtrn[col].unique()
    print(f"\nColumn: {col}")
    print(f"Unique values ({len(unique_vals)}): {unique_vals}")"""


From the results, I can see that all columns are nominal and nothing is ordinal. So, the best and safest approach is to use One-Hot Encoding for all of them.

In [None]:
#pjtrn = pd.get_dummies(pjtrn, columns=categorical_cols, drop_first=True)

#print("One-Hot Encoding applied. New shape:", pjtrn.shape)


The number of columns increased dramatically after encoding. Which I dont feel good. So let me try another approach.
1. Check cardinality (unique values) of categorical columns before encoding
If some columns have very high cardinality (like userId or sessionId), encoding them fully isn’t useful.
Drop or exclude high-cardinality ID-like columns (e.g., userId, sessionId) from encoding because they don’t generalize well to unseen data.

2. Use alternative encoding for high-cardinality columns:

Target encoding (mean target value per category)
Frequency encoding (replace category with its frequency)

3. Hashing trick (hash categories into fixed number of bins)
One-Hot Encode only low-cardinality categorical columns and treat high-cardinality columns differently or drop them if not important.

In [None]:
for col in categorical_cols:
    print(f"{col}: {pjtrn[col].nunique()} unique values")


For low cardinality (< 30 unique) columns, use One-Hot Encoding.

For high cardinality (> 30 unique) columns, use Frequency Encoding or Target Encoding. Frequency encoding is a simple way to convert categorical variables into numbers by replacing each category with the frequency (count or proportion) of that category in the dataset.
How it works:
For each category in a column, count how many times it appears in the dataset.

Replace every occurrence of that category with this count (or proportion).

In [None]:
# Lists based on your unique values info
one_hot_cols = ['browser', 'geoCluster', 'trafficSource.campaign', 'geoNetwork.networkDomain',
                'os', 'geoNetwork.subContinent', 'trafficSource.medium', 'deviceType',
                'userChannel', 'geoNetwork.continent']

freq_encode_cols = ['trafficSource.keyword', 'geoNetwork.region', 'trafficSource',
                    'locationCountry', 'geoNetwork.city', 'geoNetwork.metro',
                    'trafficSource.referralPath']

# 1. Frequency Encoding
for col in freq_encode_cols:
    freq = pjtrn[col].value_counts(normalize=True)  # use proportion for scaling
    pjtrn[col] = pjtrn[col].map(freq)

# 2. One-Hot Encoding with drop_first=True to avoid dummy trap
pjtrn = pd.get_dummies(pjtrn, columns=one_hot_cols, drop_first=True)

print("Encoding done. New dataset shape:", pjtrn.shape)


In [None]:
# Check for categorical columns in the dataset
categorical_cols_after_encoding = pjtrn.select_dtypes(include=['object', 'category']).columns.tolist()

if len(categorical_cols_after_encoding) == 0:
    print("No categorical columns remain in the dataset.")
else:
    print("Categorical columns found:", categorical_cols_after_encoding)


## Train-Test Split
Now as I have clean data with no missing values and all values encoded, I can move to the tran-test split of data for modelling.

In [None]:
print(pjtrn.columns.tolist())

purchaseValue is my target column, so I am dropping that column and splitting the train and test at 80% and 20%.

In [None]:
from sklearn.model_selection import train_test_split

# Assuming 'target' is your target column name
X = pjtrn.drop('purchaseValue', axis=1)  # Features
y = pjtrn['purchaseValue']               # Target

# Split 80% train, 20% test with random state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")


## Training the dataset with ML Models
I am using 5 different models.
1. Linear Regression
Linear Regression is the simplest regression model that assumes a linear relationship between input features and the continuous target variable. It fits a line (or hyperplane) that minimizes the sum of squared differences between actual and predicted values.
Why use it:
- Acts as a strong baseline to compare other models against.
- Easy to interpret coefficients.
- Fast to train and requires less computational power.

2. Random Forest Regressor
Random Forest is an ensemble of decision trees where each tree is trained on a random subset of data and features. The final prediction is the average of predictions from all trees.
Why use it:
- Handles nonlinear relationships well.
- Robust to outliers and noise.
- Less prone to overfitting compared to single decision trees.
- Requires minimal data preprocessing (no scaling needed).

3. Decision Tree Regressor
A Decision Tree Regressor splits the data into regions based on feature values by creating a tree of decisions. It predicts the target by averaging the values in each region (leaf node).
Why use it:
- Easy to understand and interpret (visualizable).
- Captures nonlinear relationships.
- Fast to train on moderate-sized datasets.
- Serves as a good simple nonlinear baseline before ensembles.
  
4. XGBoost Regressor
XGBoost (Extreme Gradient Boosting) is a powerful boosting algorithm that builds trees sequentially, where each new tree attempts to correct errors made by previous trees. It uses gradient descent optimization and regularization.
Why use it:
- Often achieves state-of-the-art performance on tabular data.
- Efficient and fast with built-in handling of missing values.
- Supports parallel processing.

5. Support Vector Regressor (SVR)
SVR extends Support Vector Machines to regression problems by fitting the best line within a margin (epsilon) around the data points. It tries to keep errors within this margin, effectively ignoring small errors.
Why use it:
- Effective for datasets with smaller sizes.
- Can model nonlinear relationships using kernels (e.g., RBF kernel).
- Good if you expect sparse data or need robustness to outliers.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
import xgboost as xgb

# Define models with their names
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(random_state=42),
    'Decision Tree': DecisionTreeRegressor(random_state=42),
    'XGBoost': xgb.XGBRegressor(random_state=42, verbosity=0),

}

# Assuming X_train, y_train, X_test, y_test are already defined

from sklearn.metrics import mean_squared_error, r2_score

for name, model in models.items():
    # Create pipeline — if you have preprocessing, you can add it here
    pipeline = Pipeline([
        ('model', model)
    ])

    # Train
    pipeline.fit(X_train, y_train)

    # Predict
    y_pred = pipeline.predict(X_test)

    # Evaluate
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    print(f"{name} Performance:")
    print(f"Mean Squared Error: {mse:.4f}")
    print(f"R-squared: {r2:.4f}")
    print("-" * 30)


## Hyperparameter Tuning
As I now have the best performing models, I am gonna do the hyperparameter tuning of them to enhance the results.
Hyperparameter tuning is the process of systematically searching for the best combination of hyperparameters (settings) for a machine learning model to optimize its performance. Unlike model parameters learned during training (e.g., weights in regression), hyperparameters are set before training and control the learning process.
- Different hyperparameter values can significantly impact model accuracy, generalization, and training time.
- Tuning helps prevent overfitting or underfitting by finding the right balance.
- It enables models to learn the underlying patterns more effectively.

Models	and their Important Hyperparameters
- Random Forest - Number of trees (n_estimators), max tree depth (max_depth), min samples per leaf (min_samples_leaf)
- XGBoost -	Learning rate (eta), number of trees (n_estimators), max depth (max_depth), subsample ratio (subsample)
- Decision Tree	- Max depth (max_depth), min samples split (min_samples_split)
- Support Vector Regressor (SVR) - Kernel type, regularization parameter (C), epsilon (epsilon)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# Define the model
rf = RandomForestRegressor(random_state=42)

# Define hyperparameter grid (keep small for GridSearchCV)
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Set up GridSearchCV
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=3,              # 3-fold cross-validation
    scoring='neg_mean_squared_error',  # Use MSE for scoring
    n_jobs=-1,         # Use all cores
    verbose=2
)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Best parameters and best score
print("Best parameters found by GridSearchCV:")
print(grid_search.best_params_)

print(f"Best RMSE: {(-grid_search.best_score_)**0.5:.4f}")


In [None]:
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

# Define the model
rf = RandomForestRegressor(random_state=42)

# Define hyperparameter distributions (can be wider)
param_dist = {
    'n_estimators': [50, 100, 150, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Set up RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_dist,
    n_iter=20,         # Number of parameter settings sampled
    cv=3,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=2,
    random_state=42
)

# Fit RandomizedSearchCV
random_search.fit(X_train, y_train)

# Best parameters and best score
print("Best parameters found by RandomizedSearchCV:")
print(random_search.best_params_)

print(f"Best RMSE: {(-random_search.best_score_)**0.5:.4f}")


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb

# Define pipelines for each model
pipelines = {
    'RandomForest': Pipeline([('model', RandomForestRegressor(random_state=42))]),
    'DecisionTree': Pipeline([('model', DecisionTreeRegressor(random_state=42))]),
    'XGBoost': Pipeline([('model', xgb.XGBRegressor(random_state=42, verbosity=0))])
}

# Hyperparameter distributions for tuning (inside 'model' step)
param_distributions = {
    'RandomForest': {
        'model__n_estimators': [50, 100, 150],
        'model__max_depth': [10, 20, None],
        'model__min_samples_split': [2, 5],
        'model__min_samples_leaf': [1, 2]
    },
    'DecisionTree': {
        'model__max_depth': [None, 5, 10, 20],
        'model__min_samples_split': [2, 5, 10],
        'model__min_samples_leaf': [1, 2, 4],
        'model__max_features': [None, 'auto', 'sqrt', 'log2']
    },
    'XGBoost': {
        'model__n_estimators': [50, 100, 150, 200],
        'model__max_depth': [3, 5, 7],
        'model__learning_rate': [0.01, 0.1, 0.2],
        'model__subsample': [0.6, 0.8, 1.0],
        'model__colsample_bytree': [0.6, 0.8, 1.0],
        'model__reg_alpha': [0, 0.1, 0.5],
        'model__reg_lambda': [1, 1.5, 2]
    }
}

results = {}

for model_name in ['RandomForest', 'DecisionTree', 'XGBoost']:
    print(f"\nTuning hyperparameters for {model_name}...\n")

    random_search = RandomizedSearchCV(
        estimator=pipelines[model_name],
        param_distributions=param_distributions[model_name],
        n_iter=20,
        cv=3,
        scoring='neg_mean_squared_error',
        n_jobs=-1,
        verbose=2,
        random_state=42
    )

    random_search.fit(X_train, y_train)

    best_params = random_search.best_params_
    best_rmse = (-random_search.best_score_)**0.5

    print(f"Best parameters for {model_name}: {best_params}")
    print(f"Best RMSE for {model_name}: {best_rmse:.4f}")

    # Save results for reference
    results[model_name] = {'best_params': best_params, 'best_rmse': best_rmse}


## Validation on Test Data

In [None]:
test_df = pd.read_csv("/kaggle/input/engage-2-value-from-clicks-to-conversions/test_data.csv")
y_pred_test = best_model.predict(X_test)


### Submitting the file

In [None]:
submission = pd.DataFrame({
    'id': test_df['id'],       # Replace 'id' with the actual ID column name in your test set
    'purchaseValue': y_pred    # The target predictions
})

# Save submission to CSV
submission.to_csv('submission.csv', index=False)

print("Submission file 'submission.csv' created successfully.")

# L1

In [None]:
import numpy as np

# 4. Outlier handling (numeric columns only)
numeric_cols = pjtrn.select_dtypes(include=[np.number]).columns

for col in numeric_cols:
    Q1 = pjtrn[col].quantile(0.25)
    Q3 = pjtrn[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers_count = ((pjtrn[col] < lower_bound) | (pjtrn[col] > upper_bound)).sum()
    print(outliers_count)
    """if outliers_count > 0:
        print(f"{col}: {outliers_count} outliers detected — capping values.")
        pjtrn[col] = np.where(pjtrn[col] < lower_bound, lower_bound,
                              np.where(pjtrn[col] > upper_bound, upper_bound, pjtrn[col]))"""


Let me have some visulaizations of the data I have.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
sns.histplot(pjtrn['purchaseValue'], bins=50, kde=True)
plt.title('Distribution of Purchase Value')
plt.xlabel('Purchase Value')
plt.ylabel('Count')
plt.show()


This plot shows that `purchaseValue` is highly right-skewed with many low values and a few very high outliers.
Log transformation is applied to normalize this distribution for better model performance.


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

pjtrn['purchaseValue_log'] = np.log1p(pjtrn['purchaseValue'])

plt.figure(figsize=(10, 6))
sns.histplot(pjtrn['purchaseValue_log'], bins=50, kde=True)
plt.title('Distribution of Log-Transformed Purchase Value')
plt.xlabel('Log(Purchase Value)')
plt.ylabel('Count')
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.heatmap(pjtrn.isnull(), cbar=False, cmap="viridis")
plt.title('Missing Values Heatmap')
plt.show()


This visual highlights which features have missing values and how widespread they are.
It help me decide the right imputation strategies like mean, median, or mode.


In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(pjtrn.corr(numeric_only=True), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()

The correlation heatmap helps identify strong linear relationships between numerical features.
I use this to detect multicollinearity and select features most related to the target.

In [None]:
plt.figure(figsize=(8, 5))
sns.boxplot(data=pjtrn, x='deviceType', y='purchaseValue')
plt.title('Purchase Value by Device Type')
plt.xticks(rotation=45)
plt.show()

Boxplots show the spread and presence of outliers in numerical features.
They guide me in deciding whether to cap/fix outliers or apply transformations.


In [None]:
plt.figure(figsize=(10, 5))
sns.countplot(data=pjtrn, x='browser', order=pjtrn['browser'].value_counts().index)
plt.title('Count of Users by Browser')
plt.xticks(rotation=45)
plt.show()


Countplots display the distribution of categorical feature values.
This helps identify dominant categories and whether encoding is needed.


## Data Preprocessing

Now, I need to know the missing values, duplicate values and outliers in the dataset. Need to know the columns which have missing values, percentage of missing value and their order. this is important because the training dataset need to be clean for model training. Or else the results will be biased.

### Missing Values Analysis
Let me start by finding out the missing values in the columns in the data set. I am using the isnull() command for that.

In [None]:
pjtrn.isnull()

In [None]:
pjtrn.isnull().sum()

In [None]:
pjtrn.isnull().sum().sum()

In [None]:
missing_cols = pjtrn.columns[pjtrn.isnull().any()]
pjtrn[missing_cols].T

In [None]:
missing_percent = (pjtrn.isnull().sum() / len(pjtrn)) * 100
(pjtrn.isnull().sum() / len(pjtrn)) * 100

##### Missing Value Analysis Results
From this I understood that, 11 columns have missing values, and a total of 882719 values are misisng. The percentage of misisng values are also found out. The rows without misisng values are clean and ready to use. The columns with less than 30% missing values can be imputed for a better usage. And it is wise to drop the columns that has more than 95% missing values as they can barely contribute to the model training.

- I observed that several features have a significant proportion of missing values.
- Features like `trafficSource.adContent`, `adwordsClickInfo.slot`, and others have more than **95% missing values**, making them less useful for modeling.
- Some features like `trafficSource.keyword`, `totals.bounces`, and `new_visits` have moderate missing values and could still carry useful information.

So, I am gonna drop the high missing values and impute the rest.

In [None]:
print("Number of columns in the dataset:", pjtrn.shape[1])
print("The highest percent o")
high_missing = missing_percent[missing_percent > 95]
print(high_missing)

Based on missing percentages and business understanding, I will **drop** features with more than 95% missing values and plan to **impute** or process the remaining ones accordingly.
First let me drop the highly missing value columns, those with more than 95% missing values.


In [None]:
pjtrn.drop(columns=high_missing.index, inplace=True, errors='ignore')

I dropped the highly missing values and lets see how many columns left.

In [None]:
print("Number of columns left in the dataset:", pjtrn.shape[1])
missing_percent = (pjtrn.isnull().sum() / len(pjtrn)) * 100

low_missing = missing_percent[missing_percent > 0]
print("Remaining columns with missing values:")
print(low_missing)


I am generating the list of columns with missing data.

In [None]:
missing_cols = [
    'trafficSource.isTrueDirect',
    'trafficSource.keyword',
    'pageViews',
    'trafficSource.referralPath',
    'totals.bounces',
    'new_visits'
]

print("📌 Data types of missing columns:")
print(pjtrn[missing_cols].dtypes)


Let me sort them to categorical and numerical columns.

In [None]:
missing_dtypes = pjtrn[[
    'trafficSource.isTrueDirect',
    'trafficSource.keyword',
    'pageViews',
    'trafficSource.referralPath',
    'totals.bounces',
    'new_visits'
]].dtypes

numerical_missing_cols = missing_dtypes[missing_dtypes == 'float64'].index.tolist()
print(" Numerical columns with missing values:")
print(numerical_missing_cols)

categorical_missing_cols = missing_dtypes[missing_dtypes == 'object'].index.tolist()
print("\n Categorical columns with missing values:")
print(categorical_missing_cols)


In [None]:
missing_cols = missing_percent[missing_percent > 0].index.tolist()

categorical_missing = [col for col in missing_cols if pjtrn[col].dtype == 'object']
numerical_missing   = [col for col in missing_cols if pjtrn[col].dtype in ['float64', 'int64']]
boolean_missing     = [col for col in missing_cols if pjtrn[col].dtype == 'bool']

print(" Categorical Features with Missing Values:")
print(categorical_missing)

print("\n Numerical Features with Missing Values:")
print(numerical_missing)

print("\n Boolean Features with Missing Values:")
print(boolean_missing)


#### Imputing Missing Values

As I dropped the highly missing value columns, next is to impute the rest of columns in a meaningful way. I will handle different types of features separately:

Numerical columns ➝ Mean/Median

Categorical columns ➝ Mode (most frequent value)

Boolean columns ➝ Mode

Three categorical features, and three numerical features have to be imputed. Let me do the categorical first. Categorical values will be imputed with the mode of the column.

In [None]:
categorical_missing = ['trafficSource.isTrueDirect', 'trafficSource.keyword', 'trafficSource.referralPath']
# Find and display the mode for each categorical column
for col in categorical_missing:
    mode_value = pjtrn[col].mode()[0]
    print(f"Mode of '{col}': {mode_value}")


Let me impute the values.

In [None]:
pjtrn.fillna({
    'trafficSource.isTrueDirect': True,
    'trafficSource.keyword': '(not provided)',
    'trafficSource.referralPath': '/'
}, inplace=True)


I am gonna cross check once more.

In [None]:
categorical_cols = pjtrn.select_dtypes(include=['object', 'category']).columns

missing_categorical = pjtrn[categorical_cols].isnull().sum()
missing_categorical = missing_categorical[missing_categorical > 0]

if missing_categorical.empty:
    print(" No missing values in categorical features.")
else:
    print(" Missing values still exist in these categorical columns:")
    print(missing_categorical)

Now, I am gonna impute the numerical values. Finding the mean, median and mode.

In [None]:
numerical_cols_with_na = ['pageViews', 'totals.bounces', 'new_visits']

for col in numerical_cols_with_na:
    print(f"\n🔹 Column: {col}")
    print(f"Mode   : {pjtrn[col].mode()[0]}")
    print(f"Mean   : {pjtrn[col].mean()}")
    print(f"Median : {pjtrn[col].median()}")

Choosing between mean, median, or mode for imputing missing values depends on the distribution of the data and the nature of the variable.
1. Use Mean when:
- The data is normally distributed (symmetric).
- There are no extreme outliers.
- When want a value that represents the arithmetic average.

Good for: sensor readings, marks, heights, etc. Not good if there are spikes or extreme values (outliers).

2. Use Median when:
- The data is skewed, not symmetric.
- There are outliers.
- When want to use a robust measure that's not influenced by extremes.

Best choice when the feature has 0s and a few very large numbers (like page views or bounces).
Example: If most customers visit 1–2 pages but a few visit 100s, median is safer.

3. Use Mode when:
- The data is categorical or discrete with repeating values.
- Imputing values like "Yes"/"No", 0/1, or specific categories.
- Ideal for binary or categorical numerical-looking data.



In [None]:
missing_numerical_data = {
    'Feature': ['pageViews', 'totals.bounces', 'new_visits'],
    'Missing %': [0.006, 59.36, 30.60],
    'Likely Data Type/Behavior': ['Count-like, skewed', 'Binary (0/1)', 'Binary (0/1)'],
    'Best Imputation': ['Median', 'Mode', 'Mode']
}

imputation_plan_df = pd.DataFrame(missing_numerical_data)
imputation_plan_df

Now let me impute the values.

In [None]:
pjtrn['pageViews'] = pjtrn['pageViews'].fillna(pjtrn['pageViews'].median())

pjtrn['totals.bounces'] = pjtrn['totals.bounces'].fillna(pjtrn['totals.bounces'].mode()[0])

pjtrn['new_visits'] = pjtrn['new_visits'].fillna(pjtrn['new_visits'].mode()[0])

pageViews is a continuous numeric variable (e.g., 1.0, 3.0, 15.0...). These kinds of variables can have skewed distributions (a few very large values). So, median is preferred over mean — it is robust to outliers and gives a better center value.
Mode is better suited for categorical or binary values. For continuous values, it’s usually not meaningful.

totals.bounces is imputed with Mode. Likely values: 0 or 1 (i.e., bounce or not). So this is binary categorical (even if it's in float format like 1.0 or 0.0).

Mode is perfect here because it just fills missing values with the most frequent class (either 0 or 1).
new_visits is also imputed with Mode. Same situation as totals.bounces. Likely values: 0 or 1 (new visit or not). This is also binary, and we just want to assign the most frequent value.

Let me crosscheck once more for missing values.

In [None]:
pjtrn[['pageViews', 'totals.bounces', 'new_visits']].isnull().sum()

### Removing Duplicates
Next, I checked for the duplicate rows in the datasets.

In [None]:
duplicate_rows = pjtrn.duplicated()
print("Total number of duplicated rows:", duplicate_rows.sum())

In [None]:
pjtrn[duplicate_rows].head()

Removing the duplicate rows.

In [None]:
pjtrn.drop_duplicates(inplace=True)

print("Duplicates removed. New dataset shape:", pjtrn.shape)

### Removing Outliers

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

num_cols = ['pageViews', 'totals.bounces', 'new_visits', 'purchaseValue', 'totalHits']

plt.figure(figsize=(15, 8))
for i, col in enumerate(num_cols, 1):
    plt.subplot(2, 3, i)
    sns.boxplot(data=pjtrn, x=col)
    plt.title(f'Boxplot of {col}')
plt.tight_layout()
plt.show()

In [None]:
outlier_summary = {}

for col in num_cols:
    Q1 = pjtrn[col].quantile(0.25)
    Q3 = pjtrn[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outliers = pjtrn[(pjtrn[col] < lower) | (pjtrn[col] > upper)]
    outlier_summary[col] = outliers.shape[0]
    print(f"{col}: {outliers.shape[0]} outliers")

import pandas as pd
outlier_df = pd.DataFrame.from_dict(outlier_summary, orient='index', columns=['# of Outliers'])
outlier_df


Removing the outliers by capping

In [None]:
lower_pv = pjtrn['pageViews'].quantile(0.01)
upper_pv = pjtrn['pageViews'].quantile(0.99)
pjtrn['pageViews'] = pjtrn['pageViews'].clip(lower_pv, upper_pv)

Log transformation for the purchase value column values.
1. Because the target variable is skewed
- As seen in the histogram earlier, purchaseValue is highly right-skewed.
- Skewed targets often hurt model performance, especially linear models and tree-based regressors.
2. Log transformation helps by:
- Compressing large values and reducing the effect of outliers.
- Making the target distribution more normal-like (bell-shaped).
- Helping models learn better relationships in the data.

In [None]:
import numpy as np
pjtrn['purchaseValue_log'] = np.log1p(pjtrn['purchaseValue'])

In [None]:
lower_th = pjtrn['totalHits'].quantile(0.01)
upper_th = pjtrn['totalHits'].quantile(0.99)
pjtrn['totalHits'] = pjtrn['totalHits'].clip(lower_th, upper_th)

### Exploratory Data Analysis
EDA is the process of visually and statistically exploring the dataset to understand its structure, patterns, anomalies, and relationships between features before modeling.

Why to do EDA?
- To understand the data distribution
- To detect outliers or inconsistencies
- To identify patterns or trends
- To see how features relate to the target variable
- To decide what transformations or preprocessing might be needed

Types of EDA
I am going to explore data using two main approaches:

1. Univariate Analysis - Studying one feature at a time

- For categorical features: Value counts, bar plots
- For numerical features: Histograms, box plots, summary statistics

2. Bivariate/Multivariate Analysis - Studying two or more features together

- Feature vs. Target (e.g., purchaseValue)
- Correlation matrix between numerical features
- Grouped aggregations

Tools I’ll Use
- pandas (for value counts and describe)
- matplotlib and seaborn (for plotting)
- plotly (for interactive graphs, if needed later)

In [None]:
# Step 1: Select only numerical columns using .select_dtypes
numerical_features = pjtrn.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Step 2: Check the list
print("Numerical Features:")
print(numerical_features)


In [None]:
categorical_features = pjtrn.select_dtypes(include=['object']).columns.tolist()
print("Categorical Features:")
print(categorical_features)


In [None]:
boolean_features = pjtrn.select_dtypes(include=['bool']).columns.tolist()
print("Boolean Features:")
print(boolean_features)


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
sns.histplot(pjtrn['purchaseValue'], bins=50, kde=True)
plt.title("Distribution of Purchase Value")
plt.xlabel("Purchase Value")
plt.ylabel("Count")
plt.show()


In [None]:
sns.boxplot(x=pjtrn['purchaseValue'])


### Feature Engineering

In [None]:
feature_summary = {
    'Numerical': len(numerical_features),
    'Categorical': len(categorical_features),
    'Boolean': len(boolean_features)
}

import pandas as pd
pd.DataFrame.from_dict(feature_summary, orient='index', columns=['Count'])


In [None]:
pjtrn[numerical_features].describe().T

Converting UNIX timestamp to date and time

In [None]:
import pandas as pd

pjtrn['session_date'] = pd.to_datetime(pjtrn['sessionStart'], unit='s')

pjtrn['session_hour'] = pjtrn['session_date'].dt.hour
pjtrn['session_weekday'] = pjtrn['session_date'].dt.weekday
pjtrn['session_month'] = pjtrn['session_date'].dt.month

pjtrn[['sessionStart', 'session_date', 'session_hour', 'session_weekday', 'session_month']].head()


In [None]:
pjtrn['has_keyword'] = pjtrn['trafficSource.keyword'].apply(lambda x: 0 if pd.isnull(x) or x == '(not provided)' else 1)

In [None]:
pjtrn['has_referral'] = pjtrn['trafficSource.referralPath'].notnull().astype(int)

In [None]:
res_split = pjtrn['device.screenResolution'].str.extract(r'(?P<screen_width>\d+)x(?P<screen_height>\d+)')

pjtrn['screen_width'] = pd.to_numeric(res_split['screen_width'], errors='coerce')
pjtrn['screen_height'] = pd.to_numeric(res_split['screen_height'], errors='coerce')

In [None]:
pjtrn['browser_major_version'] = pjtrn['device.browserVersion'].str.extract(r'(\d+)').astype(float)

In [None]:
screen_split = pjtrn['screenSize'].str.extract(r'(?P<screen_width_px>\d+)x(?P<screen_height_px>\d+)')
pjtrn['screen_width_px'] = pd.to_numeric(screen_split['screen_width_px'], errors='coerce')
pjtrn['screen_height_px'] = pd.to_numeric(screen_split['screen_height_px'], errors='coerce')

In [None]:
pjtrn['device.isMobile'] = pjtrn['device.isMobile'].astype(int)

In [None]:
from sklearn.preprocessing import LabelEncoder

low_card_cols = ['deviceType', 'browser', 'userChannel', 'geoNetwork.continent', 'os']
le = LabelEncoder()
for col in low_card_cols:
    pjtrn[col] = le.fit_transform(pjtrn[col])

In [None]:
pjtrn = pd.get_dummies(pjtrn, columns=['device.language'], drop_first=True)

#### Correlation Analysis


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

numerical_cols = pjtrn.select_dtypes(include=['int64', 'float64']).columns

corr_matrix = pjtrn[numerical_cols].corr()

plt.figure(figsize=(14, 10))
sns.heatmap(corr_matrix, cmap='coolwarm', annot=True, fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix of Numerical Features")
plt.show()


### Feature Selection based on Correlation

To improve model performance and reduce noise, I computed the Pearson correlation of all numerical features with the target (`purchaseValue_log`). Features with correlation ≥ 0.1 were retained. This helped in:

- Reducing irrelevant or noisy features
- Speeding up model training
- Improving model interpretability

The final feature set includes: `totalHits`, `pageViews`, `new_visits`, etc.


In [None]:
target_corr = corr_matrix['purchaseValue_log'].sort_values(ascending=False)
print("🔍 Top correlations with purchaseValue_log:")
print(target_corr)


What I am doing:
Get correlations with target
Finding Absolute correlation (to catch both positive and negative)
Select features with at least 0.1 correlation
Remove the target itself


In [None]:

target_corr = corr_matrix['purchaseValue_log']


abs_corr = target_corr.abs()


selected_features = abs_corr[abs_corr >= 0.1].index.tolist()


selected_features.remove('purchaseValue_log')

print("📌 Selected features based on correlation:")
print(selected_features)


Now I have the Final dataset for modeling

In [None]:
X = pjtrn[selected_features]
y = pjtrn['purchaseValue_log']

### Split the Dataset
X = my selected features from before

y = target variable (purchaseValue_log)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

## Model Training
Model training is the process where a machine learning algorithm learns patterns from the input features (X_train) and their corresponding target values (y_train). The model adjusts its internal parameters to minimize the error in predictions.

By training models on historical data, we enable them to make accurate predictions on unseen (test) data. A well-trained model captures meaningful patterns and generalizes well.

I will train multiple regression models such as Linear Regression, Decision Tree, Random Forest, and XGBoost. Each model will be evaluated using metrics like RMSE and R². Later, I’ll fine-tune the best-performing models using GridSearchCV for better accuracy.
### Train a Baseline Model - Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print("🔍 Linear Regression Results:")
print("RMSE:", rmse)
print("R² Score:", r2)


### Train a Tree-Based Model - Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)

y_rf_pred = rf.predict(X_test)

# Evaluation
rmse_rf = np.sqrt(mean_squared_error(y_test, y_rf_pred))
r2_rf = r2_score(y_test, y_rf_pred)

print("🌳 Random Forest Results:")
print("RMSE:", rmse_rf)
print("R² Score:", r2_rf)


In [None]:
print("\n📊 Model Comparison")
print(f"Linear Regression → RMSE: {rmse:.4f}, R²: {r2:.4f}")
print(f"Random Forest     → RMSE: {rmse_rf:.4f}, R²: {r2_rf:.4f}")

In [None]:
print("🔍 Features used in training:")
print(X_train.columns.tolist())

In [None]:
X = pjtrn[selected_features].copy()
y = pjtrn['purchaseValue_log'].copy()

# Ensure 'purchaseValue' and related columns are not in X
print(X.columns)


In [None]:
# Check highest pairwise correlation
abs_corr_matrix = X.corr().abs()
np.fill_diagonal(abs_corr_matrix.values, 0)  # Ignore self-correlation

highly_corr_pairs = abs_corr_matrix[abs_corr_matrix > 0.98].stack()
print(highly_corr_pairs)


In [None]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)

y_rf_pred = rf.predict(X_test)

# Evaluation
rmse_rf = np.sqrt(mean_squared_error(y_test, y_rf_pred))
r2_rf = r2_score(y_test, y_rf_pred)

print("🌳 Random Forest Results:")
print("RMSE:", rmse_rf)
print("R² Score:", r2_rf)


While evaluating the Random Forest model, I initially observed an R² score of 1.0 — which was unusually perfect. Upon inspection, I found that the original `purchaseValue` column (which is the target itself) was mistakenly included as a feature.

This led to data leakage, causing the model to "cheat" and memorize the answer. After removing the leaked feature, the model performance was re-evaluated to reflect realistic results.

So,
- I dropped 'purchaseValue' from X.
- Used log-transformed purchase value as target
- And then split the train and test


In [None]:

X = pjtrn[selected_features].drop(columns=['purchaseValue'], errors='ignore')

y = pjtrn['purchaseValue_log']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)

y_rf_pred = rf.predict(X_test)

# Evaluation
rmse_rf = np.sqrt(mean_squared_error(y_test, y_rf_pred))
r2_rf = r2_score(y_test, y_rf_pred)

print("🌳 Random Forest Results:")
print("RMSE:", rmse_rf)
print("R² Score:", r2_rf)


### Model Evaluation using Pipelines

To ensure clean and scalable modeling, I used Scikit-learn Pipelines for all 7 regression models. This allowed me to include preprocessing steps (like Standard Scaling) for models like SVR and KNN, while skipping it for tree-based models that don’t require it.

The pipeline ensured:
- Cleaner code
- Avoidance of data leakage
- Reproducible model comparisons

Each pipeline was trained and evaluated using RMSE and R² metrics.


### Why Use Pipelines?

Pipelines are a best practice in machine learning to streamline the workflow and ensure consistency.

Pipelines
- Avoids data leakage by applying transformations only to training data
- Ensures cleaner, modular, and reproducible code
- Allows preprocessing and modeling to be bundled together
- Supports hyperparameter tuning across all steps using GridSearchCV

In this project, pipelines were especially useful for models like SVR and KNN which require scaling, while avoiding unnecessary preprocessing for tree-based models.


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score
from xgboost import XGBRegressor
import numpy as np

models = {
    "Linear Regression": Pipeline([('model', LinearRegression())]),
    "Random Forest": Pipeline([('model', RandomForestRegressor(random_state=42))]),
    "Decision Tree": Pipeline([('model', DecisionTreeRegressor(random_state=42))]),
    "KNN": Pipeline([('scaler', StandardScaler()), ('model', KNeighborsRegressor())]),
    "SVR": Pipeline([('scaler', StandardScaler()), ('model', SVR())]),
    "Gradient Boosting": Pipeline([('model', GradientBoostingRegressor(random_state=42))]),
    "XGBoost": Pipeline([('model', XGBRegressor(random_state=42, verbosity=0))])
}

model_results = {}

for name, pipeline in models.items():
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    model_results[name] = {"RMSE": rmse, "R2": r2}

print("\n📊 Model Performance Comparison (via Pipeline)")
for name, result in model_results.items():
    print(f"{name:<20} → RMSE: {result['RMSE']:.4f}, R²: {result['R2']:.4f}")


I used StandardScaler beacuse it standardizes the data by:
- Centering the mean at 0
- Scaling the variance to 1
This means it transforms numerical features so they all contribute equally during distance or margin-based calculations.


K-Nearest Neighbors (KNN):
KNN is a distance-based algorithm (uses Euclidean or similar distance). If one feature (like pageViews) has a larger range than another (like new_visits), it will dominate the distance metric. StandardScaler ensures all features contribute equally to the distance.

Support Vector Regressor (SVR):
SVR tries to find the best margin (hyperplane) in the feature space. It is sensitive to the scale of input features when calculating support vectors. Unscaled data can distort the kernel function and margin calculations. So scaling is mandatory for SVR to perform well.

Why I did not scale for other models is that they are tree-based models (Decision Tree, Random Forest, Gradient Boosting, XGBoost). These models are not sensitive to feature scale. They split nodes based on feature thresholds, not distances. Scaling doesn't affect how the splits are made, so it's not needed.

In Linear Regression, While scaling can help interpret coefficients, it’s not mandatory for performance. Can add scaling if using regularization (like Ridge, Lasso), but here it’s plain Linear Regression.

So, I applied StandardScaler only to models like KNN and SVR because they are sensitive to feature scales, unlike tree-based models which are scale-invariant.

### Model Performance Comparison

I evaluated seven regression models using RMSE and R² metrics via Scikit-learn Pipelines. Here's a summary of the results:

- **XGBoost** delivered the best performance with RMSE of 3.3882 and R² of 0.7796.
- Tree-based ensemble models (Random Forest, Gradient Boosting, XGBoost) outperformed others significantly.
- **Linear Regression** performed the poorest, likely due to its assumption of linearity, which doesn’t hold well for this dataset.

These insights suggest that ensemble tree models are better suited for predicting `purchaseValue_log` in this case.


###  Hyperparameter Tuning

To improve model performance, I applied `GridSearchCV` to fine-tune the hyperparameters of the top-performing models:

- **Random Forest**
- **Gradient Boosting**
- **XGBoost**

This method uses cross-validation to evaluate different combinations of parameters and selects the one with the best RMSE score.

**Advantages of Hyperparameter Tuning:**
- Helps avoid overfitting or underfitting
- Enhances generalization to unseen data
- Boosts the overall predictive performance


Models to Tune:
-Random Forest
-Gradient Boosting
-XGBoost

Why these?
- They’re already top performers.
- They benefit a lot from hyperparameter tuning.
- And they’re all tree-based → so can learn a common tuning pattern.

In [None]:
from sklearn.model_selection import GridSearchCV

# Define pipeline
rf_pipe = Pipeline([
    ('model', RandomForestRegressor(random_state=42))
])

# Define parameter grid
rf_param_grid = {
    'model__n_estimators': [50, 100],
    'model__max_depth': [None, 10, 20],
    'model__min_samples_split': [2, 5],
}

# Grid Search
rf_grid = GridSearchCV(estimator=rf_pipe,
                       param_grid=rf_param_grid,
                       scoring='neg_root_mean_squared_error',
                       cv=5,
                       n_jobs=-1,
                       verbose=1)

rf_grid.fit(X_train, y_train)

# Best score and parameters
print("🔍 Best RF RMSE (CV):", -rf_grid.best_score_)
print("🏆 Best RF Parameters:", rf_grid.best_params_)


GridSearchCV exhaustively tries all combinations of hyperparameters you provide in the grid.For each combination, it performs cross-validation (here 5-fold) to measure model performance.It returns the combination that gives the best performance score (in this case, lowest RMSE).
Small search space:

My parameter grid is relatively small:
n_estimators: 2 options (50, 100)
max_depth: 3 options (None, 10, 20)
min_samples_split: 2 options (2, 5)
→ 2 × 3 × 2 = 12 combinations
Since the total number of combinations is small (12), an exhaustive search is feasible and reliable.
GridSearch ensures that every single combination is evaluated, which gives me confidence that the selected hyperparameters are truly optimal from the options I chose. Since all combinations are tested systematically, results are deterministic and reproducible, especially useful when using random_state.

RandomizedSearchCV randomly selects a subset of combinations from the parameter space. It is useful when the search space is huge or training is time-consuming. But in my case, The grid was small. I wanted an exhaustive and complete evaluation. So, I didn't need the randomness or approximation of RandomizedSearchCV.

In [None]:
gb_pipe = Pipeline([
    ('model', GradientBoostingRegressor(random_state=42))
])

gb_param_grid = {
    'model__n_estimators': [100, 200],
    'model__learning_rate': [0.05, 0.1],
    'model__max_depth': [3, 5]
}

gb_grid = GridSearchCV(estimator=gb_pipe,
                       param_grid=gb_param_grid,
                       scoring='neg_root_mean_squared_error',
                       cv=5,
                       n_jobs=-1,
                       verbose=1)

gb_grid.fit(X_train, y_train)

print("🔍 Best GB RMSE (CV):", -gb_grid.best_score_)
print("🏆 Best GB Parameters:", gb_grid.best_params_)


### 🎯 Hyperparameter Tuning: Gradient Boosting

To improve the performance of Gradient Boosting Regressor, I performed hyperparameter tuning using GridSearchCV with 5-fold cross-validation.

The grid included:
- `n_estimators`: [100, 200]
- `learning_rate`: [0.05, 0.1]
- `max_depth`: [3, 5]

**Best RMSE (CV):** 3.3855  
**Best Parameters:**
- learning_rate = 0.1
- max_depth = 5
- n_estimators = 100

These values strike a balance between underfitting and overfitting, leading to a more generalizable model.


In [None]:
xgb_pipe = Pipeline([
    ('model', XGBRegressor(random_state=42, verbosity=0))
])

xgb_param_grid = {
    'model__n_estimators': [100, 200],
    'model__learning_rate': [0.05, 0.1],
    'model__max_depth': [3, 5]
}

xgb_grid = GridSearchCV(estimator=xgb_pipe,
                        param_grid=xgb_param_grid,
                        scoring='neg_root_mean_squared_error',
                        cv=5,
                        n_jobs=-1,
                        verbose=1)

xgb_grid.fit(X_train, y_train)

print("🔍 Best XGB RMSE (CV):", -xgb_grid.best_score_)
print("🏆 Best XGB Parameters:", xgb_grid.best_params_)


### 🎯 Hyperparameter Tuning: XGBoost Regressor

XGBoost was tuned using GridSearchCV with 5-fold cross-validation.

**Parameter grid included:**
- `n_estimators`: [100, 200]
- `learning_rate`: [0.05, 0.1]
- `max_depth`: [3, 5]

**Best RMSE (CV):** 3.3849  
**Best Parameters:**
- learning_rate = 0.05
- max_depth = 5
- n_estimators = 200

XGBoost emerged as the best-performing model in cross-validation, slightly edging out Gradient Boosting.


In [None]:
xgb_best = xgb_grid.best_estimator_
y_pred_xgb = xgb_best.predict(X_test)

rmse_xgb = mean_squared_error(y_test, y_pred_xgb, squared=False)
r2_xgb = r2_score(y_test, y_pred_xgb)

print("📊 XGBoost (Tuned) Test Results:")
print("RMSE:", rmse_xgb)
print("R² Score:", r2_xgb)


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor

rf_pipeline = Pipeline([
    ('model', RandomForestRegressor(random_state=42))
])

rf_param_grid = {
    'model__n_estimators': [100, 200],
    'model__max_depth': [5, 10],
    'model__min_samples_split': [2, 5, 10]
}

rf_grid = GridSearchCV(
    rf_pipeline,
    rf_param_grid,
    cv=5,
    scoring='neg_root_mean_squared_error',
    verbose=1,
    n_jobs=-1
)

rf_grid.fit(X_train, y_train)

print("🔍 Best RF RMSE (CV):", -rf_grid.best_score_)
print("🏆 Best RF Parameters:", rf_grid.best_params_)


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

rf_pipe = Pipeline([
    ('model', RandomForestRegressor(random_state=42))
])

rf_params = {
    'model__n_estimators': [100, 200],
    'model__max_depth': [None, 10, 20],
    'model__min_samples_split': [2, 5]
}

rf_grid = GridSearchCV(
    rf_pipe,
    param_grid=rf_params,
    scoring='neg_root_mean_squared_error',
    cv=5,
    n_jobs=-1,
    verbose=2
)

rf_grid.fit(X_train, y_train)


In [None]:
print(" Best RF Parameters:", rf_grid.best_params_)

best_rmse = -rf_grid.best_score_
print("Best RF RMSE (CV):", best_rmse)

best_rf_model = rf_grid.best_estimator_


In [None]:
from sklearn.metrics import mean_squared_error, r2_score

y_pred = best_rf_model.predict(X_test)

rmse_test = mean_squared_error(y_test, y_pred, squared=False)
r2_test = r2_score(y_test, y_pred)

print("Test RMSE:", rmse_test)
print("Test R² Score:", r2_test)


## Testing on the Real Test Data Set

In [None]:
import pandas as pd
pjtest = pd.read_csv('/kaggle/input/engage-2-value-from-clicks-to-conversions/test_data.csv')

pjtest.shape, pjtest.head()

Imputation done same as training data set.

In [None]:
for col in ['trafficSource.isTrueDirect', 'trafficSource.keyword', 'trafficSource.referralPath']:
    pjtest[col] = pjtest[col].fillna(pjtrn[col].mode()[0])

pjtest['pageViews'] = pjtest['pageViews'].fillna(pjtrn['pageViews'].median())
pjtest['totals.bounces'] = pjtest['totals.bounces'].fillna(pjtrn['totals.bounces'].mode()[0])
pjtest['new_visits'] = pjtest['new_visits'].fillna(pjtrn['new_visits'].mode()[0])


In [None]:

for col in ['trafficSource.isTrueDirect', 'trafficSource.keyword', 'trafficSource.referralPath']:
    pjtest[col] = pjtest[col].fillna(pjtrn[col].mode()[0])

pjtest['pageViews'] = pjtest['pageViews'].fillna(pjtrn['pageViews'].median())
pjtest['totals.bounces'] = pjtest['totals.bounces'].fillna(pjtrn['totals.bounces'].mode()[0])
pjtest['new_visits'] = pjtest['new_visits'].fillna(pjtrn['new_visits'].mode()[0])


In [None]:
pjtest['pageViews_log'] = np.log1p(pjtest['pageViews'])
pjtest['totalHits_log'] = np.log1p(pjtest['totalHits'])

In [None]:
for col in ['browser', 'os', 'deviceType', 'geoNetwork.continent']:
    pjtest[col] = pjtest[col].map(pjtrn[col].value_counts().index.to_series().reset_index(drop=True).to_dict()).fillna(-1)


In [None]:
pjtest_encoded = pd.get_dummies(pjtest)
pjtest_encoded = pjtest_encoded.reindex(columns=X_train.columns, fill_value=0)

In [None]:
from xgboost import XGBRegressor
from sklearn.pipeline import Pipeline


xgb_final = Pipeline([
    ('model', XGBRegressor(
        learning_rate=0.05,
        max_depth=5,
        n_estimators=200,
        random_state=42,
        verbosity=0
    ))
])

xgb_final.fit(X, y)


In [None]:
pjtest_preds_log = xgb_final.predict(pjtest_encoded)

pjtest_preds = np.expm1(pjtest_preds_log)

In [None]:

final_predictions = xgb_final.predict(pjtest_encoded)


submission = pd.DataFrame({
    'sessionId': pjtest['sessionId'],
    'purchaseValue': final_predictions
})


submission_df = pd.DataFrame({
    'id': range(len(pjtest['userId'])),
    'purchaseValue': final_predictions
})


submission_df.to_csv('submission.csv', index=False)

print(" Submission file created: submission.csv")

In [None]:
submission = pd.DataFrame({
    'userId': pjtest['userId'],
    'purchaseValue': pjtest_preds
})

submission.to_csv('submission.csv', index=False)
submission.head()
print(" Submission file created: submission.csv")

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import skew

# Load your training data
df = pd.read_csv("/kaggle/input/engage-2-value-from-clicks-to-conversions/train_data.csv")

# Select numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns

# Function to classify distribution
def classify_distribution(skew_val):
    if pd.isna(skew_val):
        return "Likely Constant"
    elif abs(skew_val) < 0.5:
        return "Approximately Normal"
    elif skew_val > 0:
        return "Right Skewed"
    else:
        return "Left Skewed"

# Create results table
skew_results = []
for col in numeric_cols:
    skew_val = skew(df[col].dropna())
    dist_type = classify_distribution(skew_val)

    # Suggest handling
    if dist_type == "Right Skewed" and abs(skew_val) > 1:
        action = "Consider log/sqrt transformation"
    elif dist_type == "Left Skewed" and abs(skew_val) > 1:
        action = "Consider square/power transformation"
    elif dist_type == "Likely Constant":
        action = "Drop column (no variation)"
    else:
        action = "No transformation needed"

    skew_results.append({
        "Column": col,
        "Skewness": round(skew_val, 4) if not pd.isna(skew_val) else None,
        "Distribution": dist_type,
        "Recommended Action": action
    })

skew_df = pd.DataFrame(skew_results)

# Show table
print("\n📊 Skewness Summary Table")
display(skew_df)

# Plot histograms for each numeric column
for col in numeric_cols:
    plt.figure(figsize=(5,3))
    sns.histplot(df[col].dropna(), kde=True)
    plt.title(f"{col} Distribution\nSkewness: {skew(df[col].dropna()):.2f}")
    plt.show()


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import skew

# Load dataset
df = pd.read_csv("/kaggle/input/engage-2-value-from-clicks-to-conversions/train_data.csv")

# Select numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns

# Function to apply right transformation
def transform_column(series, skew_val):
    if pd.isna(skew_val) or series.nunique() <= 1:
        return None, "Drop column (no variation)"
    elif abs(skew_val) < 0.5:
        return series, "No transformation needed"
    elif skew_val > 1:
        return np.log1p(series), "Log(x+1) transformation"
    elif 0.5 < skew_val <= 1:
        return np.sqrt(series), "Square root transformation"
    elif skew_val < -1:
        return np.power(series, 2), "Square transformation"
    elif -1 <= skew_val < -0.5:
        return np.power(series, 3), "Cube transformation"
    else:
        return series, "No transformation applied"

# Store transformation summary
transform_summary = []
df_transformed = df.copy()

for col in numeric_cols:
    original_skew = skew(df[col].dropna())
    transformed_series, action = transform_column(df[col].dropna(), original_skew)

    if transformed_series is None:
        df_transformed.drop(columns=[col], inplace=True)
        transformed_skew = None
    else:
        df_transformed[col] = transformed_series
        transformed_skew = skew(transformed_series.dropna())

        # Plot before and after
        fig, axes = plt.subplots(1, 2, figsize=(10, 3))
        sns.histplot(df[col].dropna(), kde=True, ax=axes[0])
        axes[0].set_title(f"{col} - Before\nSkew: {original_skew:.2f}")
        sns.histplot(transformed_series.dropna(), kde=True, ax=axes[1])
        axes[1].set_title(f"{col} - After\nSkew: {transformed_skew:.2f}")
        plt.tight_layout()
        plt.show()

    transform_summary.append({
        "Column": col,
        "Original Skewness": round(original_skew, 4) if not pd.isna(original_skew) else None,
        "Transformation Applied": action,
        "Transformed Skewness": round(transformed_skew, 4) if transformed_skew is not None else None
    })

# Create and display summary table
transform_df = pd.DataFrame(transform_summary)
display(transform_df)


In [None]:
"""# --------------------
# 📍 MILESTONE 1
# --------------------

# ✅ Step 1: Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split

# ✅ Step 2: Load dataset
pjtrn = pd.read_csv("/kaggle/input/engage-2-value-from-clicks-to-conversions/train_data.csv")

# ✅ Step 3: Basic Info
print("Shape of training data:", pjtrn.shape)
print("\nColumns in the dataset:\n", pjtrn.columns.tolist())
print("\nData types:\n", pjtrn.dtypes)
print("\nSample rows:\n")
pjtrn.head()

# ✅ Step 4: Categorize feature types
dtype_map = pjtrn.dtypes
feature_type_df = pd.DataFrame({
    'Feature': pjtrn.columns,
    'Data Type': dtype_map.values,
})

categorical_features = feature_type_df[feature_type_df['Data Type'] == 'object']['Feature'].tolist()
numerical_features = feature_type_df[feature_type_df['Data Type'].isin(['int64', 'float64'])]['Feature'].tolist()
boolean_features = feature_type_df[feature_type_df['Data Type'] == 'bool']['Feature'].tolist()

# ✅ Step 5: Identify missing values
missing_check = pjtrn.isnull()
missing_count = missing_check.sum()
missing_values_df = missing_count.to_frame(name='Missing Values')
missing_values_df['% of Total Values'] = (missing_values_df['Missing Values'] / len(pjtrn)) * 100
missing_values_df = missing_values_df[missing_values_df['Missing Values'] > 0]
missing_values_df.reset_index(inplace=True)
missing_values_df.rename(columns={'index': 'Feature'}, inplace=True)
missing_values_df = missing_values_df.sort_values(by='Missing Values', ascending=False)

# ✅ Step 6: Impute missing categorical features with mode
pjtrn['trafficSource.isTrueDirect'] = pjtrn['trafficSource.isTrueDirect'].fillna(True)
pjtrn['trafficSource.keyword'] = pjtrn['trafficSource.keyword'].fillna('(not provided)')
pjtrn['trafficSource.referralPath'] = pjtrn['trafficSource.referralPath'].fillna('/')

# ✅ Step 7: Impute numerical features with mean
pjtrn['pageViews'] = pjtrn['pageViews'].fillna(pjtrn['pageViews'].mean())
pjtrn['totals.bounces'] = pjtrn['totals.bounces'].fillna(pjtrn['totals.bounces'].mean())
pjtrn['new_visits'] = pjtrn['new_visits'].fillna(pjtrn['new_visits'].mean())

# ✅ Step 8: EDA - Descriptive Statistics
pjtrn.describe(include='all').T

# ✅ Step 9: EDA - Visualizations
plt.figure(figsize=(12,6))
sns.histplot(pjtrn['purchaseValue'], bins=50, kde=True)
plt.title("Distribution of Purchase Value")
plt.show()

plt.figure(figsize=(10,5))
sns.boxplot(x=pjtrn['purchaseValue'])
plt.title("Boxplot of Purchase Value")
plt.show()

# ✅ Step 10: Train-Validation Split
X = pjtrn.drop(['purchaseValue'], axis=1)
y = pjtrn['purchaseValue']

# Drop high-missing-value features before modeling (optional)
high_missing = missing_values_df[missing_values_df['% of Total Values'] > 90]['Feature'].tolist()
X = X.drop(columns=high_missing)

# Simple encoding for categorical columns (optional early baseline)
X = pd.get_dummies(X, drop_first=True)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# ✅ Step 11: Baseline model (Linear Regression)
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
y_pred = lr_model.predict(X_val)

print("\nBaseline Model Evaluation:")
print("MSE:", mean_squared_error(y_val, y_pred))
print("R^2 Score:", r2_score(y_val, y_pred))"""


Milestone 2

In [None]:
"""# Impute remaining numerical NaN values with median
X = X.fillna(X.median())"""


In [None]:
"""# Drop target column to get features
X = pjtrn.drop(columns=['purchaseValue'])

# Select only numerical features
X = X.select_dtypes(include=['int64', 'float64'])

# Fill remaining missing values in numerical features
X = X.fillna(X.median())

# Target variable
y = pjtrn['purchaseValue']"""


In [None]:
"""from sklearn.linear_model import LinearRegression

# Linear Regression Model
lin_reg = LinearRegression()
lin_reg.fit(X, y)
y_pred_lr = lin_reg.predict(X)

# Performance Metrics
from sklearn.metrics import mean_squared_error, r2_score
mse_lr = mean_squared_error(y, y_pred_lr)
r2_lr = r2_score(y, y_pred_lr)

print("Linear Regression Results:")
print(f"Mean Squared Error: {mse_lr}")
print(f"R2 Score: {r2_lr}")
"""

In [None]:
"""from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score

# ----------- 2. Stochastic Gradient Descent Regressor -----------
sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3, random_state=42)

# Fit the model
sgd_reg.fit(X, y)

# Predict
y_pred_sgd = sgd_reg.predict(X)

# Evaluate
mse_sgd = mean_squared_error(y, y_pred_sgd)
r2_sgd = r2_score(y, y_pred_sgd)

print("SGD Regressor Results:")
print("Mean Squared Error:", mse_sgd)
print("R2 Score:", r2_sgd)
"""

In [None]:
"""from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)"""


In [None]:
"""sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3, random_state=42)
sgd_reg.fit(X_scaled, y)
y_pred_sgd = sgd_reg.predict(X_scaled)

# Evaluate
mse_sgd = mean_squared_error(y, y_pred_sgd)
r2_sgd = r2_score(y, y_pred_sgd)

print("Scaled SGD Regressor Results:")
print("Mean Squared Error:", mse_sgd)
print("R2 Score:", r2_sgd)"""


In [None]:
"""from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# 1. Create a pipeline with scaling + SGD
sgd_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('sgd', SGDRegressor(random_state=42, max_iter=1000, tol=1e-3))
])

# 2. Define parameter grid
param_grid = {
    'sgd__alpha': [0.0001, 0.001, 0.01],               # Regularization strength
    'sgd__penalty': ['l2', 'l1', 'elasticnet'],        # Type of penalty
    'sgd__learning_rate': ['constant', 'optimal', 'invscaling', 'adaptive'],
    'sgd__eta0': [0.001, 0.01, 0.1]                    # Initial learning rate
}

# 3. Grid search
grid_search = GridSearchCV(sgd_pipeline, param_grid, cv=5, scoring='r2', verbose=1, n_jobs=-1)
grid_search.fit(X, y)

# 4. Best parameters and score
print("Best Parameters:\n", grid_search.best_params_)
print("Best R2 Score from CV:", grid_search.best_score_)

# 5. Evaluate on full training data
best_sgd_model = grid_search.best_estimator_
y_pred_grid = best_sgd_model.predict(X)

from sklearn.metrics import mean_squared_error, r2_score
print("Final SGD on Full Data")
print("MSE:", mean_squared_error(y, y_pred_grid))
print("R²:", r2_score(y, y_pred_grid))
"""

In [None]:
"""from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

# 1. Split Data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Scale the Data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

# ------------------------------
# 🌲 Random Forest Regressor
# ------------------------------
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_val)

rf_mse = mean_squared_error(y_val, y_pred_rf)
rf_r2 = r2_score(y_val, y_pred_rf)

print("🌲 Random Forest Regressor Results:")
print("MSE:", rf_mse)
print("R² Score:", rf_r2)

# ------------------------------
# ⚡ XGBoost Regressor
# ------------------------------
xgb_model = XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train_scaled, y_train)
y_pred_xgb = xgb_model.predict(X_val_scaled)

xgb_mse = mean_squared_error(y_val, y_pred_xgb)
xgb_r2 = r2_score(y_val, y_pred_xgb)

print("\n⚡ XGBoost Regressor Results:")
print("MSE:", xgb_mse)
print("R² Score:", xgb_r2)"""


In [None]:
"""from lightgbm import LGBMRegressor

lgb_model = LGBMRegressor(n_estimators=100, learning_rate=0.1, max_depth=6, random_state=42)
lgb_model.fit(X_train_scaled, y_train)
y_pred_lgb = lgb_model.predict(X_val_scaled)

lgb_mse = mean_squared_error(y_val, y_pred_lgb)
lgb_r2 = r2_score(y_val, y_pred_lgb)

print("🌿 LightGBM Regressor Results:")
print("MSE:", lgb_mse)
print("R² Score:", lgb_r2)"""


In [None]:
"""from sklearn.model_selection import GridSearchCV
from xgboost import XGBRegressor

# Define your base model
xgb = XGBRegressor(objective='reg:squarederror', random_state=42)

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.05, 0.1, 0.2],
    'subsample': [0.7, 1.0],
    'colsample_bytree': [0.7, 1.0],
    'reg_alpha': [0, 0.1],
    'reg_lambda': [1, 10]
}

# Setup the GridSearch
grid_search = GridSearchCV(
    estimator=xgb,
    param_grid=param_grid,
    scoring='r2',
    cv=3,
    verbose=1,
    n_jobs=-1
)

# Fit to the data
grid_search.fit(X, y)

# Best results
print("Best Parameters:")
print(grid_search.best_params_)
print("Best R² Score from CV:", grid_search.best_score_)

# Evaluate the best model on full data
best_xgb = grid_search.best_estimator_
y_pred_best_xgb = best_xgb.predict(X)

from sklearn.metrics import mean_squared_error, r2_score
mse_best = mean_squared_error(y, y_pred_best_xgb)
r2_best = r2_score(y, y_pred_best_xgb)

print("\n🎯 Tuned XGBoost Regressor Results:")
print("MSE:", mse_best)
print("R² Score:", r2_best)"""


In [None]:
"""from xgboost import XGBRegressor

# Train the tuned XGBoost model again
tuned_xgb = XGBRegressor(
    colsample_bytree=0.7,
    learning_rate=0.2,
    max_depth=7,
    n_estimators=200,
    reg_alpha=0,
    reg_lambda=1,
    subsample=0.7,
    random_state=42
)

# Fit the model on the training data
tuned_xgb.fit(X, y)"""


In [None]:
"""# 1. Load Test Data
import pandas as pd
pjtst = pd.read_csv("//kaggle/input/engage-2-value-from-clicks-to-conversions/test_data.csv")

# 2. Preprocess Test Data — same steps as training
# (Use same encoders, scalers if any, and handle missing values)

# Example: Simple mode/mean imputation (update with exact steps you used)
pjtst['trafficSource.isTrueDirect'].fillna(True, inplace=True)
pjtst['trafficSource.keyword'].fillna('(not provided)', inplace=True)
pjtst['trafficSource.referralPath'].fillna('/', inplace=True)
pjtst['pageViews'].fillna(pjtst['pageViews'].median(), inplace=True)
pjtst['totals.bounces'].fillna(pjtst['totals.bounces'].median(), inplace=True)
pjtst['new_visits'].fillna(pjtst['new_visits'].median(), inplace=True)

# 3. Select the same features used in training
X_test = pjtst[X.columns]  # same columns used to train XGBoost

# 4. Predict using final model
final_predictions = tuned_xgb.predict(X_test)

# 5. Create submission DataFrame
submission = pd.DataFrame({
    'sessionId': pjtst['sessionId'],    # Or any ID column required
    'purchaseValue': final_predictions  # This is the predicted target
})

# Create the submission DataFrame with correct column names
submission_df = pd.DataFrame({
    'id': range(len(pjtst['userId'])),                # Rename userId to id
    'purchaseValue': final_predictions
})

# Save to CSV in proper format
submission_df.to_csv('submission.csv', index=False)

print("✅ Submission file created: submission.csv")"""
