# Introduction
In this notebook, our goal is build a predictive model that can estimate how valuable a user is likely to be to the business. This involves:

- Analyzing user behavior and platform data

- Engineering meaningful features

- Applying regression models to predict user value

- Evaluating and tuning model performance


# Libraries Used
- Pandas and Numpy: For Data Analysis
- Matplotlib and Seaborn: For Data Visualization
- Sklearn: For Model Selection, Preprocessing, Fitting, and Evaluation
- XGBoost: One of the Tree Regressor models used

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, RobustScaler
from sklearn.metrics import r2_score
import xgboost as xgb
from category_encoders import TargetEncoder
from sklearn.ensemble import GradientBoostingRegressor, BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
import lightgbm as lgb
import warnings
import seaborn as sns
warnings.filterwarnings('ignore')

train = pd.read_csv('/kaggle/input/engage-2-value-from-clicks-to-conversions/train_data.csv')
test = pd.read_csv('/kaggle/input/engage-2-value-from-clicks-to-conversions/test_data.csv')
submission = pd.read_csv('/kaggle/input/engage-2-value-from-clicks-to-conversions/sample_submission.csv')

In [None]:
train.shape

In [None]:
train.info()

In [None]:
train.head()

In [None]:
train.tail()

# Dataset Overview

This dataset captures various aspects of user interactions and sessions on a digital platform. Below are the major categories of features:

## Key Feature Categories

### User Behavior & Session Metrics

- `totalHits`, `pageViews`, `totals.bounces`, `new_visits`, `totals.visits`:  
  Indicators of user engagement and session activity.

- `sessionNumber`, `sessionStart`:  
  Information related to session sequence and timing.

### Device & Technical Attributes

- `deviceType`, `os`, `browser`, `screenSize`, `device.browserSize`, `device.language`:  
  Details about the user's device and browsing environment.

- `browserMajor`, `device.*`:  
  Encompasses a variety of device-level descriptors such as model, version, and screen specifications.

- `gclIdPresent`:  
  Signals the presence of a Google Click ID used in ad tracking.

### Traffic & Marketing Source

- `userChannel`, `trafficSource`, `trafficSource.medium`, `trafficSource.keyword`, `trafficSource.campaign`:  
  Insights into how users arrived at the platform.

- `trafficSource.adwordsClickInfo.*`:  
  Contains attributes from advertising sources, including ad network type and slot.

- `trafficSource.adContent`, `trafficSource.referralPath`, `trafficSource.isTrueDirect`:  
  Provide further attribution details.

### Geographical Context

- `geoNetwork.city`, `locationCountry`, `geoNetwork.continent`, `geoNetwork.subContinent`, `geoNetwork.metro`, `geoNetwork.region`:  
  Geographic identifiers to help understand regional behavior trends.

- `geoCluster`, `locationZone`:  
  Groupings based on geographic or behavioral patterns.

### Identifiers

- `userId`, `sessionId`:  
  Unique identifiers for each user and session, allowing for multi-session analysis.

## Target Variable

- `purchaseValue`:  
  The amount (in currency units) spent by the customer during the session.  
  This is the target variable to be predicted.


In [None]:
train.nunique()

In [None]:
test.shape

In [None]:
train.columns

# 1. Exploratory Data Analysis
In this module, each feature of the train dataset is analyzed for distribution of categories in categorical columns or numerical analysis of numerical columns.

## Feature 1: `trafficSource.isTrueDirect`

- **Type**: `Categorical`
- **Description**: `Indicates whether a session originated from a direct source or not`
- **Distribution**: 
  - Barplot of value counts
- **Key Insights**:
  - `Most of the values are missing`
- **Missing Values**: `63% NaN`
- **Next Steps**:
  - `Fill the NaN with False`


In [None]:
sns.countplot(x=train['trafficSource.isTrueDirect'])

In [None]:
train['trafficSource.isTrueDirect'].value_counts(dropna=False) #Majority NULL

In [None]:
train['trafficSource.isTrueDirect'].describe(include='all')

## Feature 2: `browser`

- **Type**: `Categorical`
- **Description**: `Indicates which browser the user used`
- **Distribution**: 
  - Barplot of value counts
- **Key Insights**:
  - `Chrome browser dominates the column in both train.csv and test.csv`
- **Missing Values**: `None`
- **Next Steps**:
  - `Group rare categories as 'Other'`


In [None]:
train['browser'].value_counts(dropna=False)

In [None]:
train['browser'].describe(include='all')

In [None]:
browser_counts = train['browser'].value_counts()
over_1000 = browser_counts[browser_counts > 1000]
under_1000 = browser_counts[browser_counts <= 1000]

In [None]:
xg = over_1000.index.tolist()
yg = over_1000.values.tolist()

plt.figure(figsize=(12, 6))
plt.bar(xg, yg)

plt.xticks(rotation=45)
plt.title("Browsers with more than 1000 Occurrences")
plt.xlabel("Browser")
plt.ylabel("Count")
plt.tight_layout()
plt.show()

In [None]:
xu = under_1000.index.tolist()
yu = under_1000.values.tolist()

plt.figure(figsize=(12, 6))
plt.bar(xu, yu)

plt.xticks(rotation=60)
plt.title("Browsers with less than 1000 Occurrences")
plt.xlabel("Browser")
plt.ylabel("Count")
plt.tight_layout()
plt.show()

## Feature 3: `device.screenResolution`

- **Type**: `N/A`
- **Description**: `Supposed to indicate the screen resolution of the user's device`
- **Distribution**: 
  - N/A
- **Key Insights**:
  - `This feature's contents aren't disclosed`
- **Missing Values**: `N/A`
- **Next Steps**:
  - `Drop this feature`


In [None]:
train['device.screenResolution'].value_counts(dropna=False) #Single valued and no NULL

## Feature 4: `trafficSource.adContent`

- **Type**: `Categorical`
- **Description**: `Indicates the content of the Advertisement which the user clicked on`
- **Distribution**: 
  - Barplot of value counts
- **Key Insights**:
  - `Most of the trafficSource is not from an Advertisement`
- **Missing Values**: `97%`
- **Next Steps**:
  - `Drop this feature`

In [None]:
train['trafficSource.adContent'].value_counts(dropna=False) #97% NULL Values

In [None]:
top_content = train['trafficSource.adContent'].value_counts().nlargest(10).index

# Filter data to include only top 10 categories
filtered = train[train['trafficSource.adContent'].isin(top_content)]

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(data=filtered, x='trafficSource.adContent', order=top_content)
plt.xticks(rotation=45, ha='right')
plt.title('Top 10 Ad Content Categories')
plt.xlabel('Ad Content')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

In [None]:
train['trafficSource.adContent'].describe()

## Feature 5: `trafficSource.keyword`

- **Type**: `Categorical`
- **Description**: `Indicates the specific keyword a user searched with`
- **Distribution**: 
  - Barplot of value counts
- **Key Insights**:
  - `Most of the trafficSource is not from a keyword search`
- **Missing Values**: `62%`
- **Next Steps**:
  - `Group rare categories as 'Other'`

In [None]:
train['trafficSource.keyword'].value_counts(dropna=False) #62% NULL Values

In [None]:
train['trafficSource.keyword'].describe()

## Feature 6: `screenSize`

- **Type**: `N/A`
- **Description**: `Supposed to indicate the screen size of the user's device`
- **Distribution**: 
  - N/A
- **Key Insights**:
  - `This feature's contents aren't disclosed`
- **Missing Values**: `N/A`
- **Next Steps**:
  - `Drop this feature`


In [None]:
train['screenSize'].value_counts(dropna=False) #Single value no null

## Feature 7: `geoCluster`

- **Type**: `Categorical`
- **Description**: `Indicates which geographical region a user is accessing from`
- **Distribution**: 
  - Bar plot of value_counts
- **Key Insights**:
  - `Uniformly distributed among both train and test datasets`
- **Missing Values**: `None`
- **Next Steps**:
  - `Keep the feature as is`

In [None]:
train['geoCluster'].value_counts(dropna=False) #No NULL values

In [None]:
train['geoCluster'].describe()

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(data=train, x='geoCluster')
plt.xticks(rotation=45, ha='right')
plt.title('geoCluster Region')
plt.xlabel('Region')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

## Feature 8: `trafficSource.adwordsClickInfo.slot`

- **Type**: `Categorical`
- **Description**: `Indicates the position of an Advertisement on a page`
- **Distribution**: 
  - Bar plot of value_counts
- **Key Insights**:
  - `Most visits weren't from an Advertisement`
- **Missing Values**: `96%`
- **Next Steps**:
  - `Drop this feature`

In [None]:
train['trafficSource.adwordsClickInfo.slot'].value_counts(dropna=False) #96% NULL Values

In [None]:
train['trafficSource.adwordsClickInfo.slot'].describe()

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(data=train, x='trafficSource.adwordsClickInfo.slot')
plt.xticks(rotation=45, ha='right')
plt.title('trafficSource.adwordsClickInfo.slot')
plt.xlabel('Region')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

## Feature 9: `device.mobileDeviceBranding`

- **Type**: `N/A`
- **Description**: `Supposed to indicate the brand of the mobile device the user used`
- **Distribution**: 
  - N/A
- **Key Insights**:
  - `This feature's contents aren't disclosed`
- **Missing Values**: `N/A`
- **Next Steps**:
  - `Drop this feature`


In [None]:
train['device.mobileDeviceBranding'].value_counts(dropna=False) #not available in demo dataset

## Feature 10: `device.mobileInputSelector`

- **Type**: `N/A`
- **Description**: `Supposed to indicate how a mobile device handles user input`
- **Distribution**: 
  - N/A
- **Key Insights**:
  - `This feature's contents aren't disclosed`
- **Missing Values**: `N/A`
- **Next Steps**:
  - `Drop this feature`


In [None]:
train['device.mobileInputSelector'].value_counts(dropna=False) #not available in demo dataset

## Feature 11: `userId`

- **Type**: `Categorical (but raw data given as Numerical)`
- **Description**: `Represents the Identity of every User that accessed the page`
- **Distribution**: 
  - Nothing measurable here as they're categories
- **Key Insights**:
  - `Column is given as Numerical datatype but true meaning is Categorical`
- **Missing Values**: `None`
- **Next Steps**:
  - `Encoding this feature as categorical`


In [None]:
train['userId']

In [None]:
train['userId'].value_counts(dropna=False) #No NULL Values

In [None]:
train['userId'].describe()

## Feature 12: `trafficSource.campaign`

- **Type**: `Categorical`
- **Description**: `Indicates the marketing campaign that brought the user to the website`
- **Distribution**: 
  - Bar plot of value_counts
- **Key Insights**:
  - `Most of the traffic did not result from a Marketing Campaign (not set)`
- **Missing Values**: `95%`
- **Next Steps**:
  - `Drop this feature`


In [None]:
train['trafficSource.campaign'].value_counts(dropna=False) #Majority value (95%): (not set) -> NULL

In [None]:
train['trafficSource.campaign'].describe()

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(data=train, x='trafficSource.campaign')
plt.xticks(rotation=45, ha='right')
plt.title('Traffic Source Campaign')
plt.xlabel('Campaign')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

## Feature 13: `device.mobileDeviceMarketingName`

- **Type**: `N/A`
- **Description**: `Supposed to indicate the marketing name of the mobile device used`
- **Distribution**: 
  - N/A
- **Key Insights**:
  - `This feature's contents aren't disclosed`
- **Missing Values**: `N/A`
- **Next Steps**:
  - `Drop this feature`


In [None]:
train['device.mobileDeviceMarketingName'].value_counts(dropna=False) #not available in demo dataset

## Feature 14: `geoNetwork.networkDomain`

- **Type**: `Categorical`
- **Description**: `Indicates which network domain a user is accessing from`
- **Distribution**: 
  - Bar plot of value_counts
- **Key Insights**:
  - `Uniformly distributed among both train and test datasets`
- **Missing Values**: `None`
- **Next Steps**:
  - `Keep the feature as is`

In [None]:
train['geoNetwork.networkDomain'].value_counts(dropna=False)

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(data=train, x='geoNetwork.networkDomain')
plt.xticks(rotation=45, ha='right')
plt.title('GeoNetwork Domain')
plt.xlabel('Domain')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

## Feature 15: `gclIdPresent`

- **Type**: `Categorical`
- **Description**: `Indicates the ID passed in the URL with Advertisement clicks`
- **Distribution**: 
  - Bar plot of value_counts
- **Key Insights**:
  - `Most traffic is not from Advertisements`
- **Missing Values**: `None`
- **Next Steps**:
  - `Keep the feature as is`

In [None]:
train['gclIdPresent'].value_counts(dropna=False)

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(data=train, x='gclIdPresent')
plt.xticks(rotation=45, ha='right')
plt.title('gclIdPresent')
plt.xlabel('Category')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

## Feature 16: `device.operatingSystemVersion`

- **Type**: `N/A`
- **Description**: `Supposed to indicate the Operation System Version of the device used`
- **Distribution**: 
  - N/A
- **Key Insights**:
  - `This feature's contents aren't disclosed`
- **Missing Values**: `N/A`
- **Next Steps**:
  - `Drop this feature`


In [None]:
train['device.operatingSystemVersion'].value_counts(dropna=False) #not available in demo dataset

## Feature 17: `sessionNumber`

- **Type**: `Numerical`
- **Description**: `Indicates the number of sessions initiated`
- **Distribution**: 
  - Box plot to analyze outliers
- **Key Insights**:
  - `Most of the users initiated only one session`
- **Missing Values**: `None`
- **Next Steps**:
  - `Bin the feature according to sessionNumber`


In [None]:
train['sessionNumber'].value_counts(dropna=False)

In [None]:
plt.figure(figsize=(8, 4))
sns.boxplot(x=train['sessionNumber'], color='lightgreen')
plt.title("Boxplot of Session Number")
plt.xlabel("Session Number")
plt.tight_layout()
plt.show()

In [None]:
train['sessionNumber'].describe()

## Feature 18: `device.flashVersion`

- **Type**: `N/A`
- **Description**: `Supposed to indicate the version of Flash in the user's device`
- **Distribution**: 
  - N/A
- **Key Insights**:
  - `This feature's contents aren't disclosed`
- **Missing Values**: `N/A`
- **Next Steps**:
  - `Drop this feature`


In [None]:
train['device.flashVersion'].value_counts(dropna=False) #not available in demo dataset

## Feature 19: `geoNetwork.region`

- **Type**: `Categorical`
- **Description**: `Indicates which region the network is accessing from`
- **Distribution**: 
  - Bar plot of value_counts
- **Key Insights**:
  - `More than 50% of regions are undefined`
- **Missing Values**: `52%`
- **Next Steps**:
  - `Group rare categories as 'Other'`

In [None]:
train['geoNetwork.region'].value_counts(dropna=False) #Majority not available in dataset

In [None]:
train['geoNetwork.region'].describe()

In [None]:
train['geoNetwork.region'][train['geoNetwork.region'] == 'not available in demo dataset'].shape[0] / train['geoNetwork.region'].shape[0]

In [None]:
top_regions = train['geoNetwork.region'].value_counts().nlargest(10).index
plt.figure(figsize=(10, 6))
sns.countplot(data=train, x='geoNetwork.region', order=top_regions)
plt.xticks(rotation=45, ha='right')
plt.title('GeoNetwork Region')
plt.xlabel('Region')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

## Feature 20: `trafficSource`

- **Type**: `Categorical`
- **Description**: `Indicates the source of traffic`
- **Distribution**: 
  - Bar plot of value_counts
- **Key Insights**:
  - `Direct sources dominate`
- **Missing Values**: `None`
- **Next Steps**:
  - `Group rare categories as 'Other'`

In [None]:
train['trafficSource'].value_counts(dropna=False)

In [None]:
train['trafficSource'].describe()

In [None]:
top_sources = train['trafficSource'].value_counts().nlargest(10).index
plt.figure(figsize=(10, 6))
sns.countplot(data=train, x='trafficSource', order=top_sources)
plt.xticks(rotation=45, ha='right')
plt.title('Traffic Source')
plt.xlabel('Source')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

## Feature 21: `totals.visits`

- **Type**: `Categorical`
- **Description**: `Indicates whether the user has visited or not`
- **Distribution**: 
  - N/A
- **Key Insights**:
  - `Column is given as numerical but nature is categorical and no variance`
- **Missing Values**: `None`
- **Next Steps**:
  - `Drop this feature`

In [None]:
train['totals.visits'].value_counts(dropna=False) #Singular value without NULL

## Feature 22: `geoNetwork.networkLocation`

- **Type**: `N/A`
- **Description**: `Supposed to indicate the location of the network used`
- **Distribution**: 
  - N/A
- **Key Insights**:
  - `This feature's contents aren't disclosed`
- **Missing Values**: `N/A`
- **Next Steps**:
  - `Drop this feature`


In [None]:
train['geoNetwork.networkLocation'].value_counts(dropna=False) #not available in demo dataset

## Feature 23: `sessionId`

- **Type**: `Categorical (but given as Numerical)`
- **Description**: `Indicates the Id of the initiated session`
- **Distribution**: 
  - No distribution to analyze as it's not a measurable column
- **Key Insights**:
  - `Given data is numerical but is categorical in nature`
- **Missing Values**: `None`
- **Next Steps**:
  - `Convert into categorical`


In [None]:
train['sessionId'].value_counts(dropna=False)

In [None]:
train['sessionId'].describe()

## Feature 24: `os`

- **Type**: `Categorical`
- **Description**: `Indicates the Operation System of the user`
- **Distribution**: 
  - Bar plot of value_counts
- **Key Insights**:
  - `Windows and Mac make up for more than 60% of users`
- **Missing Values**: `None`
- **Next Steps**:
  - `Keep feature as is`

In [None]:
train['os'].value_counts(dropna=False)

In [None]:
train['os'].describe()

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(data=train, x='os')
plt.xticks(rotation=45, ha='right')
plt.title('OS')
plt.xlabel('os')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

## Feature 25: `geoNetwork.subContinent`

- **Type**: `Categorical`
- **Description**: `Indicates the continent or part of continent the network is accessing from`
- **Distribution**: 
  - Bar plot of value_counts
- **Key Insights**:
  - `More than 50% of users are accessing from Northern America`
- **Missing Values**: `None`
- **Next Steps**:
  - `Keep feature as is`

In [None]:
train['geoNetwork.subContinent'].value_counts(dropna=False)

In [None]:
train['geoNetwork.subContinent'].describe()

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(data=train, x='geoNetwork.subContinent')
plt.xticks(rotation=45, ha='right')
plt.title('Network Subcontinent')
plt.xlabel('Subcontinent')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

## Feature 26: `trafficSource.medium`

- **Type**: `Categorical`
- **Description**: `Indicates the medium through which the source of traffic is from`
- **Distribution**: 
  - Bar plot of value_counts
- **Key Insights**:
  - `More than 65% of users are accessing without a medium or 'Organic'`
- **Missing Values**: `None`
- **Next Steps**:
  - `Keep feature as is`

In [None]:
train['trafficSource.medium'].value_counts(dropna=False)

In [None]:
train['trafficSource.medium'].describe()

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(data=train, x='trafficSource.medium')
plt.xticks(rotation=45, ha='right')
plt.title('Traffic Source Medium')
plt.xlabel('Medium')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

## Feature 27: `trafficSource.adwordsClickInfo.isVideoAd`

- **Type**: `Categorical`
- **Description**: `Indicates whether the traffic source is through a Video Advertisement`
- **Distribution**: 
  - Bar plot of value_counts
- **Key Insights**:
  - `More than 96% of users didn't access through a Video Advertisement`
- **Missing Values**: `96%`
- **Next Steps**:
  - `Drop this feature`

In [None]:
train['trafficSource.adwordsClickInfo.isVideoAd'].value_counts(dropna=False) #96% NULL Values

In [None]:
train['trafficSource.adwordsClickInfo.isVideoAd'].describe()

In [None]:
train['trafficSource.adwordsClickInfo.isVideoAd'].isnull().sum() / train['trafficSource.adwordsClickInfo.isVideoAd'].shape[0]

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(data=train, x='trafficSource.adwordsClickInfo.isVideoAd')
plt.xticks(rotation=45, ha='right')
plt.title('Traffic Source Through Video Ad')
plt.xlabel('Class')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

## Feature 28: `browserMajor`

- **Type**: `N/A`
- **Description**: `Supposed to indicate the major version number of a browser`
- **Distribution**: 
  - N/A
- **Key Insights**:
  - `This feature's contents aren't disclosed`
- **Missing Values**: `N/A`
- **Next Steps**:
  - `Drop this feature`


In [None]:
train['browserMajor'].value_counts(dropna=False) #not available in demo dataset

## Feature 29: `locationCountry`

- **Type**: `Categorical`
- **Description**: `Indicates the country in which the user is accessing from`
- **Distribution**: 
  - Bar plot of value counts
- **Key Insights**:
  - `Most are from United States`
- **Missing Values**: `None`
- **Next Steps**:
  - `Group rare categories as 'Other`


In [None]:
train['locationCountry'].value_counts(dropna=False)

In [None]:
train['locationCountry'].describe()

## Feature 30: `device.browserSize`

- **Type**: `N/A`
- **Description**: `Supposed to indicate the dimensions of the browser window`
- **Distribution**: 
  - N/A
- **Key Insights**:
  - `This feature's contents aren't disclosed`
- **Missing Values**: `N/A`
- **Next Steps**:
  - `Drop this feature`


In [None]:
train['device.browserSize'].value_counts(dropna=False) #not available in demo dataset

## Feature 31: `trafficSource.adwordsClickInfo.adNetworkType`

- **Type**: `Categorical`
- **Description**: `Indicates the Advertisement Network Type`
- **Distribution**: 
  - Bar plot of value counts
- **Key Insights**:
  - `Most traffic is not from an Advertisement (extended)`
- **Missing Values**: `96%`
- **Next Steps**:
  - `Drop this feature`


In [None]:
train['trafficSource.adwordsClickInfo.adNetworkType'].value_counts(dropna=False) #96% NULL Values

In [None]:
train['trafficSource.adwordsClickInfo.adNetworkType'].describe()

In [None]:
train['trafficSource.adwordsClickInfo.adNetworkType'].isnull().sum() / train['trafficSource.adwordsClickInfo.adNetworkType'].shape[0]

## Feature 32: `socialEngagementType`

- **Type**: `N/A`
- **Description**: `Indicates the social engagement level or type of the user`
- **Distribution**: 
  - N/A
- **Key Insights**:
  - `This feature only takes one value. No variance.`
- **Missing Values**: `None`
- **Next Steps**:
  - `Drop this feature`


In [None]:
train['socialEngagementType'].value_counts(dropna=False) #Singular valued non null column

## Feature 33: `geoNetwork.city`

- **Type**: `N/A`
- **Description**: `Supposed to indicate the city the network is accessing from`
- **Distribution**: 
  - Bar plot of value counts
- **Key Insights**:
  - `Most cities aren't available in the data`
- **Missing Values**: `N/A`
- **Next Steps**:
  - `Group rare cities as 'Other'`


In [None]:
train['geoNetwork.city'].value_counts(dropna=False) #Majority not available in demo dataset

In [None]:
train['geoNetwork.city'].describe()

## Feature 34: `trafficSource.adwordsClickInfo.page`

- **Type**: `Categorical`
- **Description**: `Indicates the Page Number of the Advertisement the user clicked from`
- **Distribution**: 
  - Bar plot of value counts
- **Key Insights**:
  - `Most traffic is not from an Advertisement (extended)`
- **Missing Values**: `96%`
- **Next Steps**:
  - `Drop this feature`

In [None]:
train['trafficSource.adwordsClickInfo.page'].value_counts(dropna=False) #96% NULL Values

In [None]:
train['trafficSource.adwordsClickInfo.page'].describe()

## Feature 35: `geoNetwork.metro`

- **Type**: `Categorical`
- **Description**: `Indicates the Market Area from which traffic originated`
- **Distribution**: 
  - Bar plot of value counts
- **Key Insights**:
  - `More than 70% of data is missing`
- **Missing Values**: `70%`
- **Next Steps**:
  - `Drop this feature`

In [None]:
train['geoNetwork.metro'].value_counts(dropna=False) #52% not available + 18% (not set) = 70% effective NULL

In [None]:
train['geoNetwork.metro'].describe()

## Feature 36: `pageViews`

- **Type**: `Numerical`
- **Description**: `Indicates the number of times the page was visited`
- **Distribution**: 
  - Box plot for tracking views
- **Key Insights**:
  - `Most users view once`
- **Missing Values**: `None`
- **Next Steps**:
  - `Keep feature as is or aggregate with userId`

In [None]:
train['pageViews'].value_counts(dropna=False) #non-null column

In [None]:
train['pageViews'].describe()

In [None]:
plt.figure(figsize=(8, 4))
sns.boxplot(x=train['pageViews'], color='lightgreen')
plt.title("Boxplot of Page Views")
plt.xlabel("Page Views")
plt.tight_layout()
plt.show()

## Feature 37: `locationZone`

- **Type**: `Numerical`
- **Description**: `Indicates the location's zone of the user accessing`
- **Distribution**: 
  - N/A
- **Key Insights**:
  - `Only one zone is given in the data`
- **Missing Values**: `None`
- **Next Steps**:
  - `Drop this feature`

In [None]:
train['locationZone'].value_counts(dropna=False) #Single-valued, non-null column

## Feature 38: `device.mobileDeviceModel`

- **Type**: `N/A`
- **Description**: `Supposed to indicate the Mobile Device Model`
- **Distribution**: 
  - N/A
- **Key Insights**:
  - `Data wasn't disclosed`
- **Missing Values**: `N/A`
- **Next Steps**:
  - `Drop this feature`

In [None]:
train['device.mobileDeviceModel'].value_counts(dropna=False) #not available in demo dataset

## Feature 38: `trafficSource.referralPath`

- **Type**: `Categorical`
- **Description**: `Indicates the path in which the user took to refer the website`
- **Distribution**: 
  - N/A
- **Key Insights**:
  - `More than 60% of users didn't use a referral path`
- **Missing Values**: `63%`
- **Next Steps**:
  - `Drop this feature`

In [None]:
train['trafficSource.referralPath'].value_counts(dropna=False) #Majority NULL Values

In [None]:
train['trafficSource.referralPath'].describe()

## Feature 39: `totals.bounces`

- **Type**: `Categorical`
- **Description**: `Supposed to indicate the number of single page sessions without interaction`
- **Distribution**: 
  - Bar plot of value counts
- **Key Insights**:
  - `Data is given as categorical (0 and 1 class)`
- **Missing Values**: `60%`
- **Next Steps**:
  - `Treat as binary feature and fillna with 0`

In [None]:
train['totals.bounces'].value_counts(dropna=False)

In [None]:
train['totals.bounces'].describe()

## Feature 40: `date`

- **Type**: `Categorical (but given in Numerical)`
- **Description**: `Indicates the date that the user accessed`
- **Distribution**: 
  - Treat as Categorical
- **Key Insights**:
  - `Data given is numerical but date is not measurable`
- **Missing Values**: `None`
- **Next Steps**:
  - `Convert to categorical column`

In [None]:
train['date'].value_counts(dropna=False)

In [None]:
train['date'].describe()

## Feature 41: `device.language`

- **Type**: `N/A`
- **Description**: `Supposed to indicate the language settings in the device`
- **Distribution**: 
  - N/A
- **Key Insights**:
  - `Data is not disclosed`
- **Missing Values**: `N/A`
- **Next Steps**:
  - `Drop this feature`

In [None]:
train['device.language'].value_counts(dropna=False) #not available in demo dataset

## Feature 42: `deviceType`

- **Type**: `Categorical`
- **Description**: `Indicates the type of device used`
- **Distribution**: 
  - Categorical Distribution
- **Key Insights**:
  - `More than 70% are Desktop Users`
- **Missing Values**: `None`
- **Next Steps**:
  - `Keep feature as is`

In [None]:
train['deviceType'].value_counts(dropna=False)

In [None]:
train['deviceType'].describe()

## Feature 43: `userChannel`

- **Type**: `Categorical`
- **Description**: `Indicates the channel in which the user is accessing`
- **Distribution**: 
  - Categorical Distribution
- **Key Insights**:
  - `Organic Search dominates the column`
- **Missing Values**: `None`
- **Next Steps**:
  - `Keep feature as is`

In [None]:
train['userChannel'].value_counts(dropna=False)

In [None]:
train['userChannel'].describe()

## Feature 44: `device.browserVersion`

- **Type**: `N/A`
- **Description**: `Supposed to indicate the version of the browser`
- **Distribution**: 
  - N/A
- **Key Insights**:
  - `Data was not disclosed`
- **Missing Values**: `N/A`
- **Next Steps**:
  - `Drop this feature`

In [None]:
train['device.browserVersion'].value_counts(dropna=False) #not available in demo dataset

## Feature 45: `totalHits`

- **Type**: `Numerical`
- **Description**: `Indicates the total number of interactions/hits for a session`
- **Distribution**: 
  - Boxplot of hits
- **Key Insights**:
  - `Most visits have one interaction`
- **Missing Values**: `None`
- **Next Steps**:
  - `Keep feature as is`

In [None]:
train['totalHits'].value_counts(dropna=False)

In [None]:
train['totalHits'].describe()

In [None]:
plt.figure(figsize=(8, 4))
sns.boxplot(x=train['totalHits'], color='lightgreen')
plt.title("Boxplot of Total Hits")
plt.xlabel("Hits")
plt.tight_layout()
plt.show()

## Feature 46: `device.screenColors`

- **Type**: `N/A`
- **Description**: `Supposed to indicate the color of the device screen for a user`
- **Distribution**: 
  - N/A
- **Key Insights**:
  - `Data was not disclosed`
- **Missing Values**: `N/A`
- **Next Steps**:
  - `Drop this feature`

In [None]:
train['device.screenColors'].value_counts(dropna=False) #not available in demo dataset

## Feature 47: `sessionStart`

- **Type**: `Categorical`
- **Description**: `Indicates the timestamp of when a session starts`
- **Distribution**: 
  - No distribution
- **Key Insights**:
  - `Data is identical to sessionId`
- **Missing Values**: `N/A`
- **Next Steps**:
  - `Drop this feature`

In [None]:
train['sessionStart'].value_counts(dropna=False)

## Feature 48: `geoNetwork.continent`

- **Type**: `Categorical`
- **Description**: `Indicates the continent of the network`
- **Distribution**: 
  - Bar plot of value counts
- **Key Insights**:
  - `Americas dominates the column`
- **Missing Values**: `None`
- **Next Steps**:
  - `Keep feature as is`

In [None]:
train['geoNetwork.continent'].value_counts(dropna=False)

In [None]:
train['geoNetwork.continent'].describe()

## Feature 49: `device.isMobile`

- **Type**: `Categorical`
- **Description**: `Indicates whether the user used a Mobile Device`
- **Distribution**: 
  - Categorical Distribution
- **Key Insights**:
  - `More than 80% of users are not using a mobile device`
- **Missing Values**: `None`
- **Next Steps**:
  - `Keep feature as is`

In [None]:
train['device.isMobile'].value_counts(dropna=False)

## Feature 50: `new_visits`

- **Type**: `Categorical`
- **Description**: `Indicates whether the user is a returning user or a new user`
- **Distribution**: 
  - Categorical
- **Key Insights**:
  - `Most users are new users (new visits)`
- **Missing Values**: `30%`
- **Next Steps**:
  - `Impute null values with 0`

In [None]:
train['new_visits'].value_counts(dropna=False)

In [None]:
train['new_visits'].describe()

## Label Column Analysis: purchaseValue
- **Type**: `Numerical`
- **Description**: `Indicates the amount purchased in a session`
- **Distribution**: 
  - Boxplot to detect outliers
- **Key Insights**:
  - `More than 80% of the column is zero (zero-inflated)`
- **Missing Values**: `None`
- **Next Steps**:
  - `Keep feature as is`

In [None]:
train['purchaseValue'].value_counts(dropna=False)

In [None]:
train['purchaseValue'].describe()

In [None]:
log_value = np.log1p(train['purchaseValue'])

plt.figure(figsize=(10,5))
sns.histplot(log_value, bins=50, kde=True)
plt.title("Log-Transformed Purchase Value Distribution")
plt.show()

In [None]:
plt.figure(figsize=(8, 4))
sns.boxplot(x=train['purchaseValue'], color='lightgreen')
plt.title("Boxplot of Purchase Value")
plt.xlabel("Purchase Value")
plt.tight_layout()
plt.show()

In [None]:
train['purchaseValue'].skew()

In [None]:
train['purchaseValue'].value_counts(normalize=True)

# 2. Data Cleaning
In this module, we focus on the following:

- Imputing null values
- Grouping rare categories into a single category in categorical columns
- Dropping unnecessary columns (non-null columns with singular values and columns with over 80% NaN)

In [None]:
single_value_drop_cols = [col for col in train.columns if train[col].nunique(dropna=False) == 1]
train.drop(columns=single_value_drop_cols, inplace=True)
test.drop(columns=single_value_drop_cols, inplace=True)

In [None]:
mostly_null = [col for col in train.columns if train[col].isnull().sum() / train.shape[0] > 0.8]
mostly_null

In [None]:
train.drop(columns=mostly_null, inplace=True)
test.drop(columns=mostly_null, inplace=True)

In [None]:
train.columns

In [None]:
train['trafficSource.isTrueDirect'] = train['trafficSource.isTrueDirect'].fillna(False)
test['trafficSource.isTrueDirect'] = test['trafficSource.isTrueDirect'].fillna(False)

In [None]:
top_browsers = train['browser'].value_counts().nlargest(20).index
top_browsers

In [None]:
train['browser'] = train['browser'].apply(lambda x: x if x in top_browsers else 'Other')
test['browser'] = test['browser'].apply(lambda x: x if x in top_browsers else 'Other')

In [None]:
train['browser'].value_counts(dropna=False)

In [None]:
# top_ad_content = train['trafficSource.adContent'].value_counts(dropna=False).nlargest(10).index
# top_ad_content

In [None]:
# train['trafficSource.adContent'] = train['trafficSource.adContent'].apply(
#     lambda x:x if x in top_ad_content else 'Other')
# test['trafficSource.adContent'] = test['trafficSource.adContent'].apply(
#     lambda x:x if x in top_ad_content else 'Other')

In [None]:
train['trafficSource.keyword'] = train['trafficSource.keyword'].fillna('unknown')
train['trafficSource.keyword'] = train['trafficSource.keyword'].replace('(not provided)', 'unknown')
test['trafficSource.keyword'] = test['trafficSource.keyword'].fillna('unknown')
test['trafficSource.keyword'] = test['trafficSource.keyword'].replace('(not provided)', 'unknown')

In [None]:
keyword = train['trafficSource.keyword'].value_counts().nlargest(20).index
keyword

In [None]:
train['trafficSource.keyword'] = train['trafficSource.keyword'].apply(
    lambda x:x if x in keyword else 'Other')
test['trafficSource.keyword'] = test['trafficSource.keyword'].apply(
    lambda x:x if x in keyword else 'Other')

In [None]:
# train['trafficSource.adwordsClickInfo.slot'].value_counts(dropna=False)

In [None]:
# train['trafficSource.adwordsClickInfo.slot'] = train['trafficSource.adwordsClickInfo.slot'].fillna("Organic")
# test['trafficSource.adwordsClickInfo.slot'] = test['trafficSource.adwordsClickInfo.slot'].fillna("Organic")

In [None]:
train['userId'] = train['userId'].astype('category')
test['userId'] = test['userId'].astype('category')

In [None]:
test['userId']

In [None]:
top_campaigns = train['trafficSource.campaign'].value_counts(dropna=False).nlargest(10).index
top_campaigns

In [None]:
train['trafficSource.campaign'] = train['trafficSource.campaign'].apply(
    lambda x:x if x in top_campaigns else 'Other')
test['trafficSource.campaign'] = test['trafficSource.campaign'].apply(
    lambda x:x if x in top_campaigns else 'Other')

In [None]:
def bucket_session_number(x):
    if x == 1:
        return 'First session'
    elif x == 2:
        return 'Second session'
    elif x <= 5:
        return '3rd–5th session'
    elif x <= 10:
        return '6th–10th session'
    else:
        return '11+ sessions'

In [None]:
train['sessionNumber'] = train['sessionNumber'].apply(bucket_session_number)
test['sessionNumber'] = test['sessionNumber'].apply(bucket_session_number)

In [None]:
top_regions = train['geoNetwork.region'].value_counts(dropna=False).nlargest(50).index
top_regions

In [None]:
train['geoNetwork.region'] = train['geoNetwork.region'].apply(
    lambda x:x if x in top_regions else 'Other')
test['geoNetwork.region'] = test['geoNetwork.region'].apply(
    lambda x:x if x in top_regions else 'Other')

In [None]:
top_sources = train['trafficSource'].value_counts(dropna=False).nlargest(20).index
top_sources

In [None]:
train['trafficSource'] = train['trafficSource'].apply(lambda x:x if x in top_sources else 'Other')
test['trafficSource'] = test['trafficSource'].apply(lambda x:x if x in top_sources else 'Other')

In [None]:
train['trafficSource'].value_counts()

In [None]:
train['sessionId'] = train['sessionId'].astype('category')
test['sessionId'] = test['sessionId'].astype('category')

In [None]:
top_os = train['os'].value_counts(dropna=False).nlargest(10).index
top_os

In [None]:
train['os'] = train['os'].apply(lambda x:x if x in top_os else 'Other')
test['os'] = test['os'].apply(lambda x:x if x in top_os else 'Other')

In [None]:
# train.drop(columns=['trafficSource.adwordsClickInfo.isVideoAd'], inplace=True)
# test.drop(columns=['trafficSource.adwordsClickInfo.isVideoAd'], inplace=True)

In [None]:
top_countries = train['locationCountry'].value_counts(dropna=False).nlargest(30).index
top_countries

In [None]:
train['locationCountry'] = train['locationCountry'].apply(lambda x:x if x in top_countries else 'Other')
test['locationCountry'] = test['locationCountry'].apply(lambda x:x if x in top_countries else 'Other')

In [None]:
# train.drop(columns=['trafficSource.adwordsClickInfo.adNetworkType'], inplace=True)
# test.drop(columns=['trafficSource.adwordsClickInfo.adNetworkType'], inplace=True)

In [None]:
train['geoNetwork.city'] = train['geoNetwork.city'].replace('not available in demo dataset', 'Unknown')
train['geoNetwork.city'] = train['geoNetwork.city'].replace('(not set)', 'Unknown')
test['geoNetwork.city'] = test['geoNetwork.city'].replace('not available in demo dataset', 'Unknown')
test['geoNetwork.city'] = test['geoNetwork.city'].replace('(not set)', 'Unknown')

In [None]:
top_cities = train['geoNetwork.city'].value_counts(dropna=False).nlargest(20).index
top_cities

In [None]:
train['geoNetwork.city'] = train['geoNetwork.city'].apply(lambda x:x if x in top_cities else 'Other')
test['geoNetwork.city'] = test['geoNetwork.city'].apply(lambda x:x if x in top_cities else 'Other')

In [None]:
# train['trafficSource.adwordsClickInfo.page'] = train['trafficSource.adwordsClickInfo.page'].fillna(0)
# test['trafficSource.adwordsClickInfo.page'] = test['trafficSource.adwordsClickInfo.page'].fillna(0)

In [None]:
train['geoNetwork.metro'] = train['geoNetwork.metro'].replace('not available in demo dataset', 'Unknown')
train['geoNetwork.metro'] = train['geoNetwork.metro'].replace('(not set)', 'Unknown')
test['geoNetwork.metro'] = test['geoNetwork.metro'].replace('not available in demo dataset', 'Unknown')
test['geoNetwork.metro'] = test['geoNetwork.metro'].replace('(not set)', 'Unknown')

In [None]:
top_metro = train['geoNetwork.metro'].value_counts(dropna=False).nlargest(20).index
top_metro

In [None]:
train['geoNetwork.metro'] = train['geoNetwork.metro'].apply(lambda x:x if x in top_metro else 'Other')
test['geoNetwork.metro'] = test['geoNetwork.metro'].apply(lambda x:x if x in top_metro else 'Other')

In [None]:
train['pageViews'].value_counts(dropna=False)

In [None]:
train.drop(columns=['trafficSource.referralPath'], inplace=True)
test.drop(columns=['trafficSource.referralPath'], inplace=True)

In [None]:
train['totals.bounces'] = train['totals.bounces'].fillna(0)
test['totals.bounces'] = test['totals.bounces'].fillna(0)

In [None]:
train['date'] = train['date'].astype('category')
test['date'] = test['date'].astype('category')

In [None]:
train.drop(columns=['sessionStart'], inplace=True)
test.drop(columns=['sessionStart'], inplace=True)

In [None]:
train['new_visits'].value_counts(dropna=False)

In [None]:
train['new_visits'] = train['new_visits'].fillna(0)
test['new_visits'] = test['new_visits'].fillna(0)

In [None]:
train['pageViews'] = train['pageViews'].fillna(0)
test['pageViews'] = test['pageViews'].fillna(0)

In [None]:
train.drop(columns=['sessionId'], inplace=True)
test.drop(columns=['sessionId'], inplace=True)

## User-Based Aggregation
- `purchaseValue` is aggregated grouped by `userId` as per `count`, `mean`, and `sum`
- `pageViews` is aggregated by `userId` as per `sum`

In [None]:
user_stats = train.groupby('userId').agg({
    'purchaseValue': ['count', 'mean', 'sum'],
    'pageViews': 'sum'
})

In [None]:
user_stats.columns = ['_'.join(col) for col in user_stats.columns]
train = train.join(user_stats, on='userId')
test = test.join(user_stats, on='userId')

In [None]:
for col in ['purchaseValue_mean', 'purchaseValue_sum', 'purchaseValue_count', 'pageViews_sum']:
    test[col] = test[col].fillna(train[col].mean())

In [None]:
test.isnull().sum()

In [None]:
target = 'purchaseValue'

# 3 Data Preprocessing and Splitting
In this module, we focus on the following:

- Splitting data into train and test
- Fitting a preprocessing pipeline on select columns

In [None]:
X = train.drop(columns=[target])
y = train[target]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
X.shape

In [None]:
train.info()

In [None]:
cat_cols = X.select_dtypes(include=['object', 'bool', 'category']).columns.tolist()

In [None]:
# cat_transform = Pipeline(steps=[
#     ('imputer', SimpleImputer(strategy='most_frequent')),
#     ('ohe', TargetEncoder())
# ])

preprocessor = ColumnTransformer(transformers=[
    ('cat', TargetEncoder(smoothing=7.0), cat_cols)
], remainder='passthrough')


# 4 Model Fitting
In this module, we focus on the following:

- Fit the preprocessed data on a model pipeline on three different tree-based models
- Perform Hyperparameter Tuning on each using RandomizedSearchCV

In [None]:
model = xgb.XGBRegressor(objective='reg:squarederror', random_state=0, n_jobs=-1, subsample=0.7,
                        n_estimators=500, min_child_weight=3, max_depth=8, learning_rate=0.1, 
                        colsample_bytree=0.7)

In [None]:
bag_model = BaggingRegressor(
    base_estimator=DecisionTreeRegressor(
        max_depth=8,
        min_samples_leaf=3
    ), n_estimators=100, max_samples=0.7, max_features=0.7, random_state=0, n_jobs=-1)

In [None]:
lgb_model = lgb.LGBMRegressor(objective='regression', random_state=0, n_estimators=300, learning_rate=0.1,
    subsample=0.7, colsample_bytree=0.7, max_depth=6, min_child_weight=3, n_jobs=-1)

In [None]:
pipeline = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('model', model)
])

In [None]:
bag_pipeline = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('bag', bag_model)
])

In [None]:
lgbm_pipeline = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('bag', lgb_model)
])

In [None]:
param_grid = {
    'n_estimators': [200, 300, 500],
    'max_depth': [4, 6, 8],
    'learning_rate': [0.01, 0.05, 0.1],
    'subsample': [0.7, 0.8, 1.0],
    'colsample_bytree': [0.7, 0.8, 1.0],
    'min_child_weight': [1, 3, 5]
}

In [None]:
param_grid_bagging = {
    'n_estimators': [50, 100, 200],
    'max_samples': [0.5, 0.7, 1.0],
    'max_features': [0.5, 0.7, 1.0],
    'base_estimator__max_depth': [5, 8, 12],
    'base_estimator__min_samples_leaf': [1, 3, 5],
}

In [None]:
search = RandomizedSearchCV(
    pipeline,
    param_grid,
    scoring='r2',
    n_iter=30,
    cv=3,
    verbose=1,
    random_state=42,
    n_jobs=-1
)

In [None]:
search_bag = RandomizedSearchCV(
    bag_pipeline,
    param_grid_bagging,
    scoring='r2',
    n_iter=30,
    cv=3,
    verbose=1,
    random_state=42,
    n_jobs=-1
)

In [None]:
pipeline.fit(X_train, y_train)

In [None]:
bag_pipeline.fit(X_train, y_train)

In [None]:
lgbm_pipeline.fit(X_train, y_train)

# 5 Model Evaluation and Prediction
In this module we focus on the following:

- Evaluation of the Validation and Training R2 Score on all three models
- Observing the predictions
- Preparing the submission.csv file

In [None]:
y_train_pred = pipeline.predict(X_train)
r2_train = r2_score(y_train, y_train_pred)
print("XGBoost Model Train Score: ", r2_train)

In [None]:
y_val_pred = pipeline.predict(X_val)
r2 = r2_score(y_val, y_val_pred)
print("XGBoost Model Validation Score: ", r2)

In [None]:
y_train_pred_bag = bag_pipeline.predict(X_train)
r2_train_bag = r2_score(y_train, y_train_pred_bag)
print("Bagging Model Train Score: ", r2_train_bag)

In [None]:
y_bag_pred = bag_pipeline.predict(X_val)
r2_bag = r2_score(y_val, y_bag_pred)
print("Bagging Regressor Model Validation Score: ", r2_bag)

In [None]:
y_train_pred_lgbm = lgbm_pipeline.predict(X_train)
r2_train_lgbm = r2_score(y_train, y_train_pred_lgbm)
print("LGBM Model Train Score: ", r2_train_lgbm)

In [None]:
y_lgbm_pred = lgbm_pipeline.predict(X_val)
r2_lgbm = r2_score(y_val, y_lgbm_pred)
print("LGBM Model Validation Score: ", r2_lgbm)

In [None]:
booster = pipeline.named_steps['model'].get_booster()

In [None]:
feature_names = pipeline.named_steps['preprocess'].get_feature_names_out()
booster.feature_names = feature_names.tolist()

In [None]:
xgb.plot_importance(booster, importance_type='gain', xlabel='Gain')
plt.show()

In [None]:
y_test_pred = pipeline.predict(test)

In [None]:
len(y_test_pred[y_test_pred < 0])

In [None]:
y_test_pred = y_test_pred.clip(min=0)

In [None]:
submission = pd.DataFrame({
    "id": test.index,  
    "purchaseValue": y_test_pred.clip(min=0)
})

In [None]:
submission.to_csv("submission3.csv", index=False)

In [None]:
submission2 = pd.read_csv('/kaggle/input/engage-2-value-from-clicks-to-conversions/sample_submission.csv')
submission2.shape

In [None]:
submission2

In [None]:
submission3 = pd.read_csv('/kaggle/working/submission3.csv')

In [None]:
submission3