## Problem Statement

Most organizations today rely on email campaigns for effective communication with users. Email communication is one of the popular ways to pitch products to users and build trustworthy relationships with them.


Email campaigns contain different types of CTA (Call To Action). The ultimate goal of email campaigns is to maximize the Click Through Rate (CTR).


CTR is a measure of success for email campaigns. The higher the click rate, the better your email marketing campaign is. CTR is calculated by the no. of users who clicked on at least one of the CTA divided by the total no. of users the email was delivered to.


CTR =   No. of users who clicked on at least one of the CTA / No. of emails delivered

## Objective

We have to build a machine learning-based approach to predict the CTR of an email campaign.

## Table of Content

* __Step 1: Importing the Relevant Libraries__
    
* __Step 2: Data Inspection__
    
* __Step 3: Exploratory Data Analysis__
    
* __Step 4: Data preparation for Building our Model__

* __Step 5: Machine Learning Models and their evaluation__

* __step 6: Final training of selected ML model__

* __Step 7: Creating submission file__

### Step 1: Importing the Relevant Libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')

### Step 2: Data Inspection

In [2]:
train = pd.read_csv("train_F3fUq2S.csv")
test = pd.read_csv("test_Bk2wfZ3.csv")

In [3]:
train.shape,test.shape

((1888, 22), (762, 21))

* __We have 1888 rows and 22 columns in Train set whereas Test set has 762 rows and 21 columns.__

In [5]:
#ratio of null values
train.isnull().sum()/train.shape[0] *100

campaign_id           0.0
sender                0.0
subject_len           0.0
body_len              0.0
mean_paragraph_len    0.0
day_of_week           0.0
is_weekend            0.0
times_of_day          0.0
category              0.0
product               0.0
no_of_CTA             0.0
mean_CTA_len          0.0
is_image              0.0
is_personalised       0.0
is_quote              0.0
is_timer              0.0
is_emoticons          0.0
is_discount           0.0
is_price              0.0
is_urgency            0.0
target_audience       0.0
click_rate            0.0
dtype: float64

In [5]:
#ratio of null values
test.isnull().sum()/test.shape[0] *100

campaign_id           0.0
sender                0.0
subject_len           0.0
body_len              0.0
mean_paragraph_len    0.0
day_of_week           0.0
is_weekend            0.0
times_of_day          0.0
category              0.0
product               0.0
no_of_CTA             0.0
mean_CTA_len          0.0
is_image              0.0
is_personalised       0.0
is_quote              0.0
is_timer              0.0
is_emoticons          0.0
is_discount           0.0
is_price              0.0
is_urgency            0.0
target_audience       0.0
dtype: float64

* __We don't have any missing value in our Train and Test data set.__

In [6]:
#categorical features
categorical = train.select_dtypes(include =[np.object])
print("Categorical Features in Train Set:",categorical.shape[1])

#numerical features
numerical= train.select_dtypes(include =[np.float64,np.int64])
print("Numerical Features in Train Set:",numerical.shape[1])

Categorical Features in Train Set: 1
Numerical Features in Train Set: 21


In [6]:
#categorical features
categorical = test.select_dtypes(include =[np.object])
print("Categorical Features in Test Set:",categorical.shape[1])

#numerical features
numerical= test.select_dtypes(include =[np.float64,np.int64])
print("Numerical Features in Test Set:",numerical.shape[1])

Categorical Features in Test Set: 1
Numerical Features in Test Set: 20


In [8]:
train.head()

Unnamed: 0,campaign_id,sender,subject_len,body_len,mean_paragraph_len,day_of_week,is_weekend,times_of_day,category,product,...,is_image,is_personalised,is_quote,is_timer,is_emoticons,is_discount,is_price,is_urgency,target_audience,click_rate
0,1,3,76,10439,39,5,1,Noon,6,26,...,0,0,0,0,0,0,0,0,14,0.103079
1,2,3,54,2570,256,5,1,Morning,2,11,...,0,0,0,0,0,0,0,0,10,0.7
2,3,3,59,12801,16,5,1,Noon,2,11,...,1,0,1,0,0,0,0,0,16,0.002769
3,4,3,74,11037,30,4,0,Evening,15,9,...,0,0,0,0,0,0,0,0,10,0.010868
4,5,3,80,10011,27,5,1,Noon,6,26,...,0,0,1,0,0,0,0,0,14,0.142826


* __We have 1 categorical feature in Train and Test set. which is 'times_of_day'.__

### Step 3: Exploratory Data Analysis

In [9]:
train['times_of_day'].value_counts()

Evening    1317
Noon        447
Morning     124
Name: times_of_day, dtype: int64

* __There is no irregularities in the column. We can encode the categorical feature to train our model__

In [12]:
#Detecting Multicollinearity with VIF
X = train.drop(['times_of_day','campaign_id','click_rate'],axis=1)
# VIF dataframe
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
# calculating VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(X.values, i)
                          for i in range(len(X.columns))]
vif_data

Unnamed: 0,feature,VIF
0,sender,3.382995
1,subject_len,11.648248
2,body_len,9.629524
3,mean_paragraph_len,3.121267
4,day_of_week,7.210303
5,is_weekend,2.622069
6,category,5.216433
7,product,3.09409
8,no_of_CTA,2.807581
9,mean_CTA_len,7.808439


* __As we can see, features like subject_len, body_len, mean_CTA_len, target_audience have very high values of VIF, indicating that these variables are highly correlated. Hence, considering these features together leads to a model with high multicollinearity.__

* __Tree based models like RandomForestRegressor, XGBoostRegressor are immune to multicollinearity.__

### Step 4: Data preparation for our Model.

In [14]:
#Labelencoding
le = LabelEncoder()
var_mod = train.select_dtypes(include='object').columns
for i in var_mod:
    train[i] = le.fit_transform(train[i])
    
for i in var_mod:
    test[i] = le.fit_transform(test[i])

* __Encoding the required columns from training and test dataset__

In [222]:
train.columns

Index(['campaign_id', 'sender', 'subject_len', 'body_len',
       'mean_paragraph_len', 'day_of_week', 'is_weekend', 'times_of_day',
       'category', 'product', 'no_of_CTA', 'mean_CTA_len', 'is_image',
       'is_personalised', 'is_quote', 'is_timer', 'is_emoticons',
       'is_discount', 'is_price', 'is_urgency', 'target_audience',
       'click_rate'],
      dtype='object')

In [27]:
# Seperate Valid Features and Target
X= train.drop(columns = ['campaign_id','click_rate','is_timer', 'is_personalised', 'is_discount'], axis=1)
y= train['click_rate']

In [29]:
# 20% data as validation set
X_train,X_valid,y_train,y_valid = train_test_split(X,y,test_size=0.2,random_state=22)

### Step 5: Machine Learning Models and their evaluation

In [30]:
#Comparing different Machine Learning Models
algos = [LinearRegression(),  Ridge(), Lasso(),
          KNeighborsRegressor(), DecisionTreeRegressor(), RandomForestRegressor(), XGBRegressor()]

names = ['Linear Regression', 'Ridge Regression', 'Lasso Regression',
         'K Neighbors Regressor', 'Decision Tree Regressor', 'Random Forest Regressor', 'XGBoost']

r2_list = []
for name in algos:
    model = name
    model.fit(X_train,y_train)
    y_pred = model.predict(X_valid)
    r2= r2_score(y_valid,y_pred)
    r2_list.append(r2)
evaluation = pd.DataFrame({'Model': names,
                           'R2_score': r2_list})
evaluation

Unnamed: 0,Model,R2_score
0,Linear Regression,0.124127
1,Ridge Regression,0.124132
2,Lasso Regression,0.06033
3,K Neighbors Regressor,0.249348
4,Decision Tree Regressor,0.001897
5,Random Forest Regressor,0.468798
6,XGBoost,0.578963


* __We will proceed with the XGBoost Regression model as it is clearly giving the best result.__

In [31]:
#Performing Grid Search to find the best parameters for XGBoost
from sklearn.model_selection import GridSearchCV
params = { 'max_depth': [20, 25, 30], 
           'learning_rate': [0.05, 0.1],
           'n_estimators': [50, 100, 150],
           'colsample_bytree': [0.2, 0.3],
           'min_child_weight': [0, 1]}
xgbr = XGBRegressor(seed = 20)
clf = GridSearchCV(estimator=xgbr, 
                   param_grid=params,
                   scoring='r2', 
                   verbose=1)
clf.fit(X, y)
print("Best parameters:", clf.best_params_)

Fitting 5 folds for each of 72 candidates, totalling 360 fits
Best parameters: {'colsample_bytree': 0.3, 'learning_rate': 0.05, 'max_depth': 20, 'min_child_weight': 0, 'n_estimators': 100}


In [33]:
my_model = XGBRegressor(n_estimators=100, max_depth=20, learning_rate=0.05, colsample_bytree=0.3, min_child_weight=0, seed=20)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)],
             verbose=False)

XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.3,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.05, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=20, max_leaves=0, min_child_weight=0,
             missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0,
             num_parallel_tree=1, predictor='auto', random_state=20,
             reg_alpha=0, reg_lambda=1, ...)

In [34]:
# Model Evaluation
y_pred = my_model.predict(X_valid)
r2_score(y_valid, y_pred)

0.500449233840567

### Step 6: Final training of selected ML model

In [35]:
my_model.fit(X, y)

XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.3,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.05, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=20, max_leaves=0, min_child_weight=0,
             missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0,
             num_parallel_tree=1, predictor='auto', random_state=20,
             reg_alpha=0, reg_lambda=1, ...)

### Step 7: Creating submission file

In [37]:
campaign_id = test["campaign_id"]
test = test.drop(columns = ['campaign_id','is_timer', 'is_personalised', 'is_discount'], axis=1)
final_predictions = my_model.predict(test)
final_predictions = pd.DataFrame(final_predictions, columns=['click_rate'])
submission = pd.concat([campaign_id, final_predictions], axis=1)
submission.to_csv('my_submission_A.csv', index=False)