<a href="https://colab.research.google.com/github/meyush0/Email-Campaign-Effectiveness-Prediction/blob/main/Email_Campaign_Effectiveness_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name** -  **Email Campaign Effectiveness Prediction**



##### **Project Type**    - Classification
##### **Contribution**    - Ayush Singh( Individual )

# **Project Summary -**

Email campaign is a sequence of marketing efforts that contacts multiple recipients at once. Email campaigns are designed to reach out to subscribers at the best time and provide valuable content and relevant offers. Using email campaigns allows businesses to build deep and trusting relationships with their customers. Marketing through Email can make communication with clients easier and more effective. Email campaigns are a very powerful medium between a business company and its audience. It helps not only to increase sales but build brand image. Most of the small to medium business owners are making effective use of Gmail-based Email marketing Strategies for offline targeting of converting their prospective customers into leads so that they stay with them in business.

The main objective is to create a machine learning model to characterize the mail and track the mail that is ignored; read; acknowledged by the reader.

Performing exploratory data analysis helped us to understand the features and relationships that they have and their impact on the target or the client's response and find out important insights.

Data is labeled and the target column being categorical, I implemented classification based machine learning algorithms to complete the prediction task.

The Email campaign data contains various types of information regarding the emails that were sent, it contains information about their customers and their responses. Checking the shape of the data, I found that it has 68353 observations and 12 columns.

Null values and outliers were treated accordingly. I checked data distributions for various features. New features were created from existing and correlated features to solve the problem of multicollinearity. I used Synthetic Minority Oversampling Technique (SMOTE) to handle the imbalance in the target column. I used the model such as Logistic Regression, Random Forest classifier and Xgboost classifier and also tuned it with Hyperparameter. To evaluate the performance of the model, I split our data into a training set and a testing set. I used the training set to fit the model and the testing set to evaluate its performance. I used a variety of metrics, such as precision, recall, and F1 score, to assess the model's accuracy and effectiveness, but the problem statement clearly mentioned that we need to characterize the mails based on the user response, thus I decided to use F1 Score and AUC-ROC Score then I compared these evaluation metrics of each classifiers and found the best model among all model. After that I checked the feature importance for the model that performed the best.

Once got the best model, it can be deployed in a production environment to help small to medium business owners improve the effectiveness of their email marketing campaigns. By using the model to characterize and track emails, they will be able to make more informed decisions about how to target their marketing efforts and increase customer retention.

# **GitHub Link -**

https://github.com/meyush0/Email-Campaign-Effectiveness-Prediction/tree/main

# **Problem Statement**


Small to medium business owners are using Gmail-based email marketing strategies to convert prospective customers into leads, but they are unable to track which emails are being ignored, read, or acknowledged by the reader. They want to create a machine learning model to help characterize and track these emails. The main objective is to improve the effectiveness of their email marketing efforts and increase customer retention.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#Load dataset
data=pd.read_csv("/content/drive/MyDrive/ML - Capstone ( Classification)/data_email_campaign.csv")

### Dataset First View

In [None]:
# Dataset First 5 Look
data.head()

In [None]:
#last 5 values
data.tail()

### Dataset Rows & Columns count

In [None]:
print(f"There are {data.shape[0]} rows and {data.shape[1]} columns")

### Dataset Information

In [None]:
data.info()

#### Duplicate Values

In [None]:
print(f"There are {len(data[data.duplicated()])} duplicates in the dataset")

#### Missing Values/Null Values

In [None]:
data.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(data.isnull(),cbar=False)

### What did you know about your dataset?

I got to know the following things about the dataset:
* There are 68353 rows and 12 columns present in the dataset.
* Information about the Datatype of each column.
* There are null values in four features namely, customer_location, total_past_communication, Total_links, Total_images.
* There are no duplicate values found!

## ***2. Understanding Your Variables***

In [None]:
#column names
data.columns.values

In [None]:
#new dataframe
Cat_data=data.copy()

In [None]:
#change datatype of ['Email_Type','Email_Source_Type','Email_Campaign_Type','Time_Email_sent_Category','Email_Status'] from int to object
#because it is also a nominal categorial data and we want their categorical description.
Cat_data[['Email_Type','Email_Source_Type','Email_Campaign_Type','Time_Email_sent_Category','Email_Status']]=Cat_data[['Email_Type',
                                          'Email_Source_Type','Email_Campaign_Type','Time_Email_sent_Category','Email_Status']].astype("str")

In [None]:
# categorical Description
Cat_data.describe(include=object)

### Observation

* There are two unique email type 1 and 2, where 1 is on the top with frequency 48866.

* Email_Source_Type has 2 unique values 1 and 2, where 1 is on the top with frequency 37149.

* There is 7 different demographic location of the customers. Most of the customers is in G location with frequency 23173.

* Email_Campaign_Type has 3 unique campaign type of the email which are 1,2 and 3, where 2 is mostly used with frequency 48273.

* Time_Email_sent_Category has 3 categories 1,2 and 3 in which 2 is most prefer time of sending the email.

* Email_Status also has 3 categories 0,1 and 2 in which mostly the status of the email is 0.

In [None]:
#numerical description
Cat_data.describe()

### **Observations:**

* Subject Hotness Score is float value ranges from 0 to 5, where 0 is not hot, 5 is very hot. The average subject hotness score for the given data set is around 1.10

* Average total past communications is around 29. maximum total past communications are around 67 and the minimum is 0.

* Average word count is around 700 words. An email was sent with maximum words of around 1316 words. An email was sent with minimum words around 40 words.

* Average total links sent is around 10 links, and the maximum total links sent is around 49 links, the minimum is 1 link.

* Average total images sent through an email is around 4 images. The maximum total images sent are around 45 images and the minimum is 0.
---

# Variables Description

* **Email_Id** - Email id of customer
* **Email_Type** - Email type contains 2 categories 1 and 2. We can assume that the types are like promotional email or sales email.
* **Subject_Hotness_Score** - It is the email's subject's score on the basis of how good and effective the content is.
* **Email_Source_Type** - It represents the source of the email like sales,marketing or product type email.
* **Email_Campaign_Type** - The campaign type of the email.
* **Customer_Location** - Categorical data which explains the different demographic location of the customers.
* **Total_Past_Communications** - This columns contains the total previous mails from the same source.
* **Time_Email_sent_Category** - It has 3 categories: 1,2 and 3 which are considered as morning,evening and night time slot.
* **Word_Count** - Total count of word in each email
* **Total_links** - Total number of links in the email
* **Total_Images** - Total Number of images in the email
* **Email_Status** - Our target variable which contains whether the mail was ignored, read, acknowledged by the reader

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in data.columns.tolist():
  print("No. of unique values in '{}' is {}.".format(i, data[i].nunique()))

## 3. ***Data Wrangling***

Data Wrangling is the process of gathering, collecting, and transforming Raw data into another format for better understanding, decision-making, accessing, and analysis in less time. Data Wrangling is also known as Data Munging.

In [None]:
data.drop(columns="Email_ID",inplace=True)

In [None]:
## extracting the numerical features

num_feature=Cat_data.describe().columns.values

In [None]:
## creating a set of categorical features excluding 'Email_Status' and 'Email_ID'.
cat_feature=set(Cat_data.describe(include="object").columns.values)-{'Email_Status','Email_ID'}

In [None]:
data.head()

In [None]:
table =[{value:pd.pivot_table(data, values =value, index =['Email_Type', 'Email_Source_Type','Time_Email_sent_Category'],
                         columns =['Email_Status'], aggfunc = np.sum)} for value in num_feature]
table

### What all manipulations have you done and insights you found?

From the above manipulations, for the columns of Email_Status 0, 1 and 2 respectively, I found out that

* Subject_Hotness_Score is maximum for Email_Type 1, for Email_Source_type 2 and for Time_Email_Sent_Category 2.

* Total_Past_Communications is maximum for Email_Type 1, for Email_Source_type 1 and for Time_Email_Sent_Category 2.

* Email_Status is maximum for Email_Type 1, for Email_Source_type 2 and for Time_Email_Sent_Category 2.

* Total_Links is maximum for Email_Type 1, for Email_Source_type 2 and for Time_Email_Sent_Category 2.

* Total_Images is maximum for Email_Type 1, for Email_Source_type 2 and for Time_Email_Sent_Category 2.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1  Discrete Distribution of Target Variable

In [None]:
# visualize the target variable
data2 = data['Email_Status'].value_counts()
plt.figure(figsize=(10, 6))

# Plot pie chart
plt.figure(figsize=(10, 6))
plt.pie(data2, autopct='%1.2f%%', labels=data2.index, colors=('lightpink', 'yellow', 'thistle'))
plt.title('Email_Status Distribution', fontsize=18)
plt.show()

##### 1. Why did you pick the specific chart?

Since our target variable is categorical, and we know that Bar Chart and Pie Chart are typically used to visualize categorical data.

* A **bar chart** places the separate values of the data on the x-axis and the height of the bar indicates the count of that category.
* A **pie plot** is a proportional representation of the numerical data in a column.

##### 2. What is/are the insight(s) found from the chart?

 From the above Bar chart and Pie chart, we conclude that
 * No. of Email Read :- 11039 i.e., 16.15%
 * No. of Email Acknowledged :- 2373 i.e., 3.47%
 * No. of Email Ignored :- 54941 i.e., 80.38%

 This result shows that most of the emails were ignored

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights will help us in understanding the effectiveness of email campaign. As we saw that No. of Email Acknowledged are only 3.47%, where as No. of Email Read and Ignored are 16.15% and 80.38% respectively.
for this, we find the factor which make the email acknowledged and use it in other email too, so that it help in the growth of business.

Negative growth means decrement of no. of email acknowledged, since  No. of Email Acknowledged is related less as compare to the other email. if it is decreasing then it lead to negative growth in the business.

#### Chart - 2 Distribution of Numerical Variables

In [None]:
# visualization of numerical feature
from scipy import stats

for col in num_feature :
    # sns.set_style("ticks")
    # sns.set_context("poster");
    plt.figure(figsize=(25,6))

    plt.subplot(1, 3, 1)
    fig = sns.boxplot(y=data[col], color='#00FF7F')
    fig.set_ylabel(col)

    plt.subplot(1, 3, 2)
    sns.distplot(data[col], color = '#055E85', fit = stats.norm);
    feature = data[col]
    plt.axvline(feature.mean(), color='#ff033e', linestyle='dashed', linewidth=3,label= 'mean');  #red
    plt.axvline(feature.median(), color='#A020F0', linestyle='dashed', linewidth=3,label='median'); #cyan
    plt.legend(bbox_to_anchor = (1.0, 1), loc = 'upper right')

    plt.subplot(1,3,3)
    f=sns.boxplot(x=data["Email_Status"],y=data[col])
    f.set_xticklabels(['Ignored','Read', 'Acknowledged'])


##### 1. Why did you pick the specific chart?

Distplot and Boxplot are best for plotting continuous variable and understanding the distribution of the data and visualizing outliers as well as quartiles positions.

##### 2. What is/are the insight(s) found from the chart?

***Boxplot***
---
Boxplot gives us 5 point summary which consists of the minimum point, the first quartile, the median, the third quartile, and the maximum point.

from Boxplot, we have got to know about outliers, as we see that, all numerical variable have outliers except wordcount.

---
***Distplot***
---
As we know that, positive skewed, negative skewed and no skewed in the data is determined by mean, median amd mode.

if mean > median > mode then, distribution of the data is positively skewed,

if mean = median = mode then, no skewed that is normally distributed,

otherwise, it is negatively skewed.

Subject hotness score, total image and total links are positively skewed whereas wordcount and total comunication count show somewhat normal distribution.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These plot were drawn to understand distribution of each variable which eventually will help in building model and treating with null values or outliers, however it helps how each values plays important role in creating an effective email campaign.

There are no such insights that lead to negative growth.

#### Chart - 3  Distribution of Categorical Variables and Categorical Features vs Email Status

In [None]:
categorical_variables = ['Email_Type','Email_Source_Type','Customer_Location','Email_Campaign_Type','Time_Email_sent_Category']
Target_var = ['Email_Status']

for i,value in enumerate(categorical_variables):
  ax = sns.countplot(x=data[value], hue=data[Target_var[0]])
  unique = len([x for x in data[value].unique() if x==x])
  bars = ax.patches
  for i in range(unique):
      catbars=bars[i:][::unique]
      #get height
      total = sum([x.get_height() for x in catbars])
      #print percentage
      for bar in catbars:
        ax.text(bar.get_x()+bar.get_width()/2.,
                    bar.get_height(),
                    f'{bar.get_height()/total:.0%}',
                    ha="center",va="bottom")
  plt.show()


##### 1. Why did you pick the specific chart?

Countplot is best for plotting Categorical variable, it is used to represent the occurrence(counts) of the observation present in the categorical variable.

##### 2. What is/are the insight(s) found from the chart?

As we can observe that
* The percentage of each class in each categorical variable.
* The distribution of Email_Status is almost similar in all the categories except in Email_Campaign_Type we can see that it shows a totally different trend.
For Email_Campaign_Type=1 we see that no. of email ignored < no. of email acknowledged < no. of email Read

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights will help us in understanding the effectiveness of email campaign. Since campaign type 1 show more engagement if Company increases campaign type 1 then company moves in the direction of Positive growth.

if company increases campaign type 2 then company moves in the direction of negative growth.

#### Chart - 4 Correlation Heatmap

#### Spearman Correlation

In [None]:
# Correlation Heatmap visualization code
sns.set_context('notebook')
plt.figure(figsize = (14,8))
plt.xticks(fontsize= 14)
plt.yticks(fontsize= 14)
sns.heatmap(data[num_feature].corr(), annot=True,linewidth=.5,cmap="Greens")
plt.title("Pearson Correlation")

##### 1. Why did you pick the specific chart?

A correlation heatmap is a graphical representation of a correlation matrix representing the correlation between different variables. Each cell shows the correlation between two variables. The value of correlation can take any value from -1 to 1.

The Pearson correlation coefficient is used to measure the strength of a linear association between two numerical variables

Thus to know the correlation between all the numerical variables along with the correlation coeficients, i used Pearson correlation heatmap.

##### 2. What is/are the insight(s) found from the chart?

By using Pearson correlation, we observed that There is a high positive correlation (i.e., 0.78 ) between Total links and Total image which causes multicollinearity.

#### Spearman Correlation

In [None]:
import scipy.stats
plt.figure(figsize = (14,8))
plt.xticks(fontsize= 14)
plt.yticks(fontsize= 14)
heatmap= sns.heatmap(data.corr(method="spearman"),vmin= -1,vmax=1,annot=True)
plt.title("Spearman Correlation")

##### 1. Why did you pick the specific chart?

Correlation heatmaps can be used to find potential relationships between variables and to understand the strength of these relationships. The value of correlation can take any value from -1 to 1.

Spearman rank correlation is a non-parametric test that is used to measure the degree of association between two variables.

Thus to know the correlation between all the variables along with the correlation coeficients, i used Spearman correlation heatmap.

##### 2. What is/are the insight(s) found from the chart?

By using Spearman correlation Heatmap, we observed that
* There is a high positive correlation (i.e., 0.64 ) between Total links and Total image.
* There is a high negative correlation (i.e., -0.68 ) between Email_Campaign_Type and Subject_hotness_score.
* As compare to all other variable, Total_Past_Communications has higher correlation with target variable (Email_Status).

#### Chart - 6 Pair Plot

In [None]:
sns.pairplot(data,hue ="Email_Status", diag_kind = "kde" ,kind = "scatter",palette = "rocket")

##### 1. Why did you pick the specific chart?

Pair plot allows us to look at the diagonal distribution of these variable and on the non-diagonal linear relationship between the variables.

Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters.

Thus, I used pair plot to analyse the patterns of data and relationship between the features. It's exactly same as the correlation heatmap but here you will get the graphical representation of it.

##### 2. What is/are the insight(s) found from the chart?

From the above chart I got to know, there are less linear relationship between variables.

Total links and total image show some linear relation and we already know they are correlated as seen in earlier heatmap.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset.

### Hypothetical Statement - 1
----
Test whether a Total_Past_Communications has a Normal distribution.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.


* Null Hypothesis H0: Total_Past_Communications has a normal distribution.

* Alternative Hypothesis H1: Total_Past_Communications does not have a normal distribution.
* Test Type : Shapiro-Wilk Test

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# perform Shapiro-Wilk Normality Test
from scipy.stats import shapiro
stat, p = shapiro(data["Total_Past_Communications"])
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('Null hypothesis is probably true, i.e. Total_Past_Communications has a normal distribution')
else:
    print('Null hypothesis is probably false, i.e Total_Past_Communications does not have a normal distribution.')

##### Which statistical test have you done to obtain P-Value?

I used Shapiro-Wilk Test to obtain P-Value, to check whether a data has a Normal distribution.

##### Why did you choose the specific statistical test?

Shapiro-Wilk Test is the appropriate test for testing the normality of data. I used this test on Total_Past_Communications variable because it has the highest correlation with target variable (Email_Status) among all the numerical variable.

### Hypothetical Statement - 2
---
The Email_Type of the campaign will not have any significant impact on the Email_Status

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* Null Hypothesis: There is no relationship between Email_Type and Email_Status.
* Alternative Hypothesis: There is a relationship between Email_Type and Email_Status.
* Test Type : chi-square test of independence.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# perform chi-square test of independence
chi2, p_value, dof, expected = stats.chi2_contingency(pd.crosstab(data['Email_Type'], data['Email_Status']))

if p_value < 0.05:
    print("Reject the null hypothesis - the Email_Type has a significant impact on the Email_Status")
else:
    print("Fail to reject the null hypothesis - the Email_Type does not have a significant impact on the Email_Status")

##### Which statistical test have you done to obtain P-Value?

For these hypothesis, I used chi-square test of independence which is a statistical test to determine whether there is a significant association between two categorical variables. In this case, the two variables are Email_Type and Email_Status.

##### Why did you choose the specific statistical test?

This test is appropriate for the determination of existence of any relationship between the two categorical variable.

### Hypothetical Statement - 3
----
The Customer_Location will not have any significant impact on the Total_Links, Total_Images and Total_Past_Communications in the email

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* Null Hypothesis: The mean of Total_Links is equal among the location (A, B, C, D, E, F, G) (H0: μ1 = μ2 = μ3 = μ4 = μ5 = μ6 = μ7)
* Alternative Hypothesis: The mean of Total_Links is not equal among the location (A, B, C, D, E, F, G) (H1: at least one mean is different from the others)
* Test Type : ANOVA Test

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# perform ANOVA test
f_value, p_value = stats.f_oneway(data[data['Customer_Location'] == 'A']['Total_Images'],
                                  data[data['Customer_Location'] == 'B']['Total_Images'],
                                  data[data['Customer_Location'] == 'C']['Total_Images'],
                                  data[data['Customer_Location'] == 'D']['Total_Images'],
                                  data[data['Customer_Location'] == 'E']['Total_Images'],
                                  data[data['Customer_Location'] == 'F']['Total_Images'],
                                  data[data['Customer_Location'] == 'G']['Total_Images'])
if p_value < 0.05:
    print("Reject the null hypothesis - the Customer_Location has a significant impact on the Total_Images in the email")
else:
    print("Fail to reject the null hypothesis - the Customer_Location does not have a significant impact on the Total_Images in the email")


##### Which statistical test have you done to obtain P-Value?

For this hypothesis, I used ANOVA (Analysis of Variance) test to abtain P-Value because ANOVA is a statistical test that is used to determine whether there is a statistically significant difference in the means of two or more groups.

##### Why did you choose the specific statistical test?

ANOVA test is used to determine if there are significant difference between the means of two or more groups. In this case, we have different locations (A,B,C,D,E,F,G) and we want to determine if there is a significant difference in the means of Total_Links, Total_Images and Total_Past_Communications among these groups.

ANOVA is appropriate test for this case because the variables Total_Links, Total_Images and Total_Past_Communications are continuous and we want to compare the means of multiple groups.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
## Lets go and see the percentage of missing values
data.isnull().mean()*100

In [None]:
#divide columns on the basis of percentage of missing values.
null_percent_col=data.isnull().mean()[data.isnull().mean()>0]
null_percentl_col=null_percent_col[null_percent_col.values<0.05]
null_percenth_col=data.isnull().mean()[data.isnull().mean()>0.05]

In [None]:
#missing value percentage greater than 0 and less than 5%.
null_percent_col

In [None]:
#missing value percentage less than 5%.
null_percentl_col

In [None]:
# missing value percentage greater than 5%.
null_percenth_col

In [None]:
#summary about columns on the basis of percentage of missing values.
print(f"There are {len(null_percent_col.index)} columns {null_percent_col.index.values}  having null values and  the columns which have less than 5% null values are {null_percentl_col.index.values} \n and more than 5% null values are {null_percenth_col.index.values}")

when the percentage of missing data in column is high, then we remove that column. Hence I drop the column 'Customer_Location'.

In [None]:
data2=data.drop(columns=['Customer_Location'])

In [None]:
#checking distribution of other null value to find correct way to impute
for cat in ['Total_Past_Communications','Total_Links','Total_Images']:
    sns.distplot(data2[cat], hist= True);
    feature = data2[cat]
    plt.axvline(feature.mean(), color='#ff033e', linestyle='dashed', linewidth=1,label= 'mean');  #red
    plt.axvline(feature.median(), color='#A020F0', linestyle='dashed', linewidth=1,label='median'); #cyan
    plt.legend(bbox_to_anchor = (1.0, 1), loc = 'upper right')
    plt.title(f'{cat.title()}');
    plt.xlabel(cat)
    plt.show()
    print('='*120)

Since, all the variables are positively skewed hence we fill median in place of null variable present in the variable because it is not influenced by the outliers.

In [None]:
data2["Total_Links"].fillna(data2["Total_Links"].median(),inplace=True)
data2["Total_Images"].fillna(data2["Total_Images"].median(),inplace=True)
data2["Total_Past_Communications"].fillna(data2["Total_Past_Communications"].median(),inplace=True)

In [None]:
## Again check the percentage of missing values
data2.isnull().sum().mean()*100

#### What all missing value imputation techniques have you used and why did you use those techniques?

Missing value imputation techniques deals with replacing the null value by  measures of central tendency. The Most commonly used measures of central tendency are the mean, median, and mode.

I replace the null values with median in all the variable because all the variables are positively skewed and The median is the value in the middle of a dataset i.e., it is not influenced by the outliers.

### 2. Handling Outliers

In [None]:
# Importing scipy
import scipy
#Find Skewness in all numerical variable
skewness= [{num_col:scipy.stats.skew(data2[num_col])} for num_col in num_feature ]
skewness

In [None]:
def find_outliers(data, col):
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = data[(data[col] < lower_bound) | (data[col] > upper_bound)]

    return outliers


In [None]:
for feature in ['Total_Links', 'Total_Images','Subject_Hotness_Score','Total_Past_Communications']:
    outliers = find_outliers(data2, feature)
    print(f"Outliers in {feature}: {len(outliers)}")

In [None]:
total_outliers = []

for feature in num_feature:
    outliers = find_outliers(data2, feature)
    total_outliers.append(outliers.index)

# Combine the outlier indices into a single set
outliers_index = set().union(*total_outliers)

# Drop rows with outliers
data3 = data2.drop(list(outliers_index), axis=0)

In [None]:
data3.head()

In [None]:
#before removal of outliers
print(f"number of rows are {data2.shape[0]} before removal of outliers")

In [None]:
#after removal of outliers
print(f"number of rows are {data3.shape[0]} before removal of outliers")

##### What all outlier treatment techniques have you used and why did you use those techniques?

Initially, I find Skewness for each numerical variable in order to know about distribution of the variable, hence we got to know that all variables are positively skewed except Word_Count i.e., there are some outliers present in the variable. To remove those outliers, I choose to use trimming technique which excludes the outlier values.

I used this technique because it improved the quality of the dataset and enhance the accuracy and stability of statistical models.

### 3. Categorical Encoding
---
Categorical Encoding is a process where we transform categorical data into numerical data.

In [None]:
cat_features= set(cat_feature)-{'Customer_Location'}
cat_feature

Because `customer_Location` have high percentage of missing value thats why we exclude from the further Analysis

In [None]:
data4=pd.get_dummies(data3, columns=cat_features, drop_first=True).reset_index().drop(columns=["index"])
data4.head()

In [None]:
data4.shape

#### What all categorical encoding techniques have you used & why did you use those techniques?

Here I have used one hot encoding technique on categorial feature for transforming categorical data into numerical, where all the variables are nominal.

I used One hot encoding because it makes our data more useful and expressive, and it can be rescaled easily.

### 4. Feature Manipulation & Selection
----
We previously saw that there are highly correlated numeric features. `'Total_Links'` and `'Total_Images'`. We can combine these two features to create a new feature and drop the original features.

#### 1. Feature Manipulation

In [None]:
# creating a new column called Total_link_Lmages
data4['Total_link_Images'] = data4['Total_Links'] + data4['Total_Images']

In [None]:
# dropping the two original columns
data4.drop(columns=['Total_Links','Total_Images'],inplace= True)

In [None]:
#after manipulation
data4.head()

In [None]:
data4.shape

#### 2. Feature Selection

In [None]:
#Independent variables
col=set(data4.columns.values)-{"Email_Status"}
col

In [None]:
#Multicollinearity by VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

In [None]:
calc_vif(data4[[i for i in col]])

In [None]:
calc_vif(data4[[i for i in col if i not in ["Email_Campaign_Type_2"]]])

In [None]:
new_num_features=calc_vif(data4[[i for i in col if i not in ["Email_Campaign_Type_2"]]]).variables.values

In [None]:
data4.drop(columns=["Email_Campaign_Type_2"],inplace=True)

##### What all feature selection methods have you used  and why?


I Calculate Variance Inflation Factor (VIF) which help us in the detection of multicollinearity in the data. I found some features having VIF of more than 5-10 and I considered it to be 6 then I dropped the multicolinear feature `"Email_Campaign_Type_2"` & `"Customer_Location_G"` to make the VIF less than 6.


##### Which all features you found important and why?

In [None]:
from sklearn.ensemble import RandomForestClassifier

def randomforest_embedded(x, y):
    # Create the random forest with hyperparameters
    model = RandomForestClassifier(n_estimators=550)

    # Fit the model
    model.fit(x, y)

    # Get the importance of the resulting features
    importances = model.feature_importances_

    # Create a data frame for visualization
    final_df = pd.DataFrame({"Features": pd.DataFrame(x).columns, "Importances": importances})
    final_df.set_index('Importances')

    # Sort in ascending order for better visualization
    final_df = final_df.sort_values('Importances')

    return final_df

In [None]:
# Getting feature importance of selected features
randomforest_embedded(x=data4[new_num_features],y=data4["Email_Status"])

In [None]:
#Drop the columns which are insignificant for our dataset.
drop=['Time_Email_sent_Category_3',"Email_Type_2"]
data4.drop(drop,inplace=True,axis=1)

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
# Getting symmetric and skew symmetric features from the columns
symmetric_feature = []
non_symmetric_feature = []

# Looping through columns in the dataset
for i in data4.describe().columns.values:
    # Check if the absolute difference between mean and median is less than 0.1
    if abs(data4[i].mean() - data4[i].median()) < 0.1:
        symmetric_feature.append(i)
    else:
        non_symmetric_feature.append(i)

# Getting Symmetric Distributed Features
print("Symmetric Distributed Features : -", symmetric_feature)

# Getting Skew Symmetric Distributed Features
print("Skew Symmetric Distributed Features : -", non_symmetric_feature)


Since, all features have skew symmetric distribution, hence our data needs to be transforms.

Primarily I plot the probability plot which is a graphical technique for assessing whether or not a dataset follows a normal distribution. The data are plotted against a theoretical distribution in such a way that the points should form approximately a straight line.

if the points form approximately a straight line then, we said that dataset follows a normal distribution. otherwise, it does not follow a normal distribution.

In [None]:
from scipy.stats import *
from scipy import stats

# Loop through non-symmetric features
for variable in non_symmetric_feature:
    sns.set_context('notebook')

    # Create a figure with two subplots
    plt.figure(figsize=(14, 5))

    # Subplot 1: Histogram
    plt.subplot(1, 2, 1)   # means 1 row, 2 Columns and 1st plot
    data4[variable].hist(bins=30)

    # Subplot 2: QQ plot
    plt.subplot(1, 2, 2)
    stats.probplot(data4[variable], dist='norm', plot=plt)
    plt.title(variable)

    # Show the plot
    plt.show()

    # Print separator line
    print('=' * 120)


In [None]:
## since categorial feature does not required transformation hence take continuous feature
for col in ['Total_Past_Communications', 'Word_Count', 'Total_link_Images', 'Subject_Hotness_Score']:
    # Doing square root transformation
    data4[col] = np.sqrt(data4[col])


I used square-root transformation in all continuous features because all are moderately skew.

In [None]:
#draw the distribution plot of transform features with the value of mean and median.
for i,col in enumerate(['Subject_Hotness_Score','Total_Past_Communications','Word_Count',"Total_link_Images"]) :
    plt.figure(figsize = (18,18))
    plt.subplot(6,2,i+1);
    sns.distplot(data4[col], color = '#055E85', fit = norm);
    feature = data4[col]
    plt.axvline(feature.mean(), color='#ff033e', linestyle='dashed', linewidth=3,label= 'mean');  #red
    plt.axvline(feature.median(), color='#A020F0', linestyle='dashed', linewidth=3,label='median'); #cyan
    plt.title(f'{col.title()}');
    plt.tight_layout();

### 6. Data Scaling
---
If the ranges of features have large difference then we should use feature scaling which help us in getting all the features in similar range.


In [None]:
data4.head(3)

In [None]:
# Scaling your data
#standard scaler
from sklearn.preprocessing import StandardScaler
for col in ['Subject_Hotness_Score','Total_Past_Communications','Word_Count',
            'Total_link_Images']:
    data4[col] = StandardScaler().fit_transform(data4[col].values.reshape(-1, 1))

In [None]:
data4.head(3)

##### Which method have you used to scale you data and why?

Basically,
 we use Standardization when your data follows Gaussian distribution and use Normalization when your data does not follow Gaussian distribution.

Since all of numerical features are almost normal distributed, hence I applied standard scaler to scale the features.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

This dataset doesnot need any dimensionality reduction.

Dimensionality reduction is a technique that is used to reduce the number of features in a dataset. It is often used when the number of features is very large, as this can lead to problems such as overfitting and slow computation. There are a variety of techniques that can be used for dimensionality reduction, such as principal component analysis (PCA) and singular value decomposition (SVD).

### 8. Data Splitting

In [None]:
from sklearn.model_selection import train_test_split
# Split the data into train and test.
X_train, X_test, y_train, y_test = train_test_split(data4.drop("Email_Status",axis=1),data4["Email_Status"],test_size = 0.20, random_state = 0)
print(f"There are {y_train.shape[0]} rows for training and {y_test.shape[0]} for testing")

### 9. Handling Imbalanced Dataset

In [None]:
data4['Email_Status'].value_counts()

##### Do you think the dataset is imbalanced? Explain Why.

Imbalance means that the number of data points available for different the classes is different: If there are two classes, then balanced data would mean 50% points for each of the class. For most machine learning techniques, little imbalance is not a problem. So, if there are 60% points for one class and 40% for the other class, it should not cause any significant performance degradation. Only when the class imbalance is high, e.g. 90% points for one class and 10% for the other, standard optimization criteria or performance measures may not be as effective and would need modification.

In our dataset, dependent column data ratio is 80:16:4. So, during model creating it's obvious that there will be bias and having a great chance of predicting the majority one so frequently. So the dataset should be balanced before it going for the model creation part.

In [None]:
# Handaling imbalance dataset using SMOTE
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_train, y_train = sm.fit_resample(X_train, y_train)

In [None]:
#visualization of resampled data
from collections import Counter
counter = Counter(y_train)
for key,value in counter.items():
    per = value / len(y_train) * 100
    print('Class=%d, n=%d (%.3f%%)' % (key, value, per))
# plot the distribution
plt.bar(counter.keys(), counter.values())
plt.show()

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix,f1_score, recall_score
from sklearn import metrics
from sklearn.metrics import roc_curve,roc_auc_score,precision_score, roc_auc_score
# from sklearn.metrics import roc_auc_ovr
# from sklearn.metrics import roc_auc_ovo
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold

from xgboost import XGBClassifier
from xgboost import XGBRFClassifier


### ML Model - 1 Logistic Regression

In [None]:
# ML Model - 1 Implementation
lr = LogisticRegression(fit_intercept=True,
            class_weight='balanced',multi_class='multinomial')
# Fit the Algorithm
lr.fit(X_train, y_train)

In [None]:
# Checking the coefficients
lr.coef_

In [None]:
# Checking the intercept value
lr.intercept_

In [None]:
# Predict on the model
# Get the predicted probabilities
train_probability_lr = lr.predict_proba(X_train)
test_probability_lr = lr.predict_proba(X_test)

In [None]:
# Get the predicted classes
y_pred_train_lr = lr.predict(X_train)
y_pred_lr = lr.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
#define a function which print the result of Evaluation metrics.
def print_metrics(actual_train,actual_test,predicted_train,predicted_test,test_probability):

    print('accuracy on train data is {}'.format(accuracy_score(actual_train,predicted_train)))
    print('accuracy on test data is {}'.format(accuracy_score(actual_test,predicted_test)))
    print('precision on test data is {}'.format(precision_score(actual_test,predicted_test,average='weighted')))
    print('recall on test data is {}'.format(recall_score(actual_test,predicted_test,average='weighted')))
    print('f1 Score on test data is {}'.format(f1_score(actual_test,predicted_test,average='weighted')))
    print('roc_auc_score on test data is {}'.format(roc_auc_score(actual_test, test_probability,multi_class='ovr',average='weighted')))

In [None]:
print_metrics(y_train,y_test,y_pred_train_lr,y_pred_lr,test_probability_lr)

In [None]:
# Visualizing evaluation Metric Score chart that is confusion matrix for both training and testing data

def confusion_metrics(actual_train,actual_test,predicted_train,predicted_test):
    labels = ['Ignored', 'Opened', 'Acknowledged']
    cm1 = confusion_matrix(actual_train, predicted_train)
    ax1= plt.subplot()
    sns.heatmap(cm1, annot=True, ax = ax1) #annot=True to annotate cells
    ax1.set_xlabel('Predicted labels')
    ax1.set_ylabel('True labels')
    ax1.set_title('Confusion Matrix for training data')
    ax1.xaxis.set_ticklabels(labels)
    ax1.yaxis.set_ticklabels(labels)
    plt.show()
    print(" ")

    cm2 = confusion_matrix(actual_test, predicted_test)
    ax2= plt.subplot()
    sns.heatmap(cm2, annot=True, ax = ax2)
  # labels, title and ticks
    ax2.set_xlabel('Predicted labels')
    ax2.set_ylabel('True labels')
    ax2.set_title('Confusion Matrix for testing data')
    ax2.xaxis.set_ticklabels(labels)
    ax2.yaxis.set_ticklabels(labels)
    plt.show()

In [None]:
confusion_metrics(y_train,y_test,y_pred_train_lr,y_pred_lr)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques.
model = LogisticRegression(fit_intercept=True, max_iter=10000,
            class_weight='balanced',multi_class='multinomial')
solvers = ['lbfgs']
penalty = ['l2']
c_values = [1000,100, 10, 1.0, 0.1, 0.01,0.001]
# define grid search
grid = dict(solver=solvers,penalty=penalty,C=c_values)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='f1',error_score=0)

# Fit the Algorithm
grid_result=grid_search.fit(X_train, y_train)

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

In [None]:
# Predict on the model
# Get the predicted probabilities
train_probability_lrh = grid_result.predict_proba(X_train)
test_probability_lrh = grid_result.predict_proba(X_test)

In [None]:
# Predict on the model
# Get the predicted classes
y_pred_train_lrh = grid_result.predict(X_train)
y_pred_lrh = grid_result.predict(X_test)

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV which uses the Grid Search technique for finding the optimal hyperparameters to increase the model performance.

our goal should be to find the best hyperparameters values to get the perfect prediction results from our model. But the question arises, how to find these best sets of hyperparameters? One can try the Manual Search method, by using the hit and trial process and can find the best hyperparameters which would take huge time to build a single model.

For this reason, methods like Random Search, GridSearch were introduced. Grid Search uses a different combination of all the specified hyperparameters and their values and calculates the performance for each combination and selects the best value for the hyperparameters. This makes the processing time-consuming and expensive based on the number of hyperparameters involved.

In GridSearchCV, along with Grid Search, cross-validation is also performed. Cross-Validation is used while training the model.

That's why I have used GridsearCV method for hyperparameter optimization.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
print_metrics(y_train,y_test,y_pred_train_lrh,y_pred_lrh,test_probability_lrh)

In [None]:
# Visualizing evaluation Metric Score chart that is confusion matrix for both training and testing data
confusion_metrics(y_train,y_test,y_pred_train_lrh,y_pred_lrh)

There is no such improvement in the result while using Cross- Validation & Hyperparameter Tuning because our dataset is large enough to give accurate result using Hold-Out Method.


### ML Model - 2 Random Forest Classifier

In [None]:
# ML Model - 2 Implementation
# Create an instance of the RandomForestClassifier
rf_model = RandomForestClassifier()

# Fit the Algorithm
rf_model.fit(X_train,y_train)

In [None]:
# Predict on the model
# Making predictions on train and test data
y_pred_train_rf = rf_model.predict(X_train)
y_pred_rf = rf_model.predict(X_test)

In [None]:
# Predict on the model
# Get the predicted probabilities
train_probability_rf = rf_model.predict_proba(X_train)
test_probability_rf = rf_model.predict_proba(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
print_metrics(y_train,y_test,y_pred_train_rf,y_pred_rf,test_probability_rf)

In [None]:
# Visualizing evaluation Metric Score chart that is confusion matrix for both training and testing data
confusion_metrics(y_train,y_test,y_pred_train_rf,y_pred_rf)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 2 Implementation with hyperparameter optimization techniques
# Number of trees
n_estimators = [50,80,100]

# Maximum depth of trees
max_depth = [4,6,8]

# Minimum number of samples required to split a node
min_samples_split = [50,100,150]

# Minimum number of samples required at each leaf node
min_samples_leaf = [40,50]

# HYperparameter Grid
param_dict = {'n_estimators' : n_estimators,
              'max_depth' : max_depth,
              'min_samples_split' : min_samples_split,
              'min_samples_leaf' : min_samples_leaf}

# Create an instance of the RandomForestClassifier
rf_model = RandomForestClassifier()

# Grid search
rf_grid = GridSearchCV(estimator=rf_model,
                       param_grid = param_dict,
                       cv = 5, verbose=2, scoring='f1')


# Fit the Algorithm
rf_grid.fit(X_train,y_train)

In [None]:
#best parameter
print("Best: %f using %s" % (rf_grid.best_score_, rf_grid.best_params_))

In [None]:
# Predict on the model
# Making predictions on train and test data
y_pred_train_rfh = rf_grid.predict(X_train)
y_pred_rfh = rf_grid.predict(X_test)

In [None]:
# Predict on the model
# Get the predicted probabilities
train_probability_rfh = rf_grid.predict_proba(X_train)
test_probability_rfh = rf_grid.predict_proba(X_test)

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV which uses the Grid Search technique for finding the optimal hyperparameters to increase the model performance.

our goal should be to find the best hyperparameters values to get the perfect prediction results from our model. But the question arises, how to find these best sets of hyperparameters? One can try the Manual Search method, by using the hit and trial process and can find the best hyperparameters which would take huge time to build a single model.

For this reason, methods like Random Search, GridSearch were introduced. Grid Search uses a different combination of all the specified hyperparameters and their values and calculates the performance for each combination and selects the best value for the hyperparameters. This makes the processing time-consuming and expensive based on the number of hyperparameters involved.

In GridSearchCV, along with Grid Search, cross-validation is also performed. Cross-Validation is used while training the model.

That's why I have used GridsearCV method for hyperparameter optimization.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
print_metrics(y_train,y_test,y_pred_train_rfh,y_pred_rfh,test_probability_rfh)

In [None]:
# Visualizing evaluation Metric Score chart that is confusion matrix for both training and testing data
confusion_metrics(y_train,y_test,y_pred_train_rf,y_pred_rf)

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

I have used these metrices for evaluation of the model and their impact on business are as follows:

**Accuracy:** This metric indicates the percentage of correctly classified instances out of the total number of instances. In a business setting, this would indicate the overall effectiveness of the model in making correct predictions. A high accuracy score would have a positive impact on the business, as it would indicate a high level of confidence in the model's predictions.

**Precision:** This metric indicates the proportion of true positive predictions out of all positive predictions made by the model. In a business setting, this would indicate the level of confidence in the model's ability to identify positive instances correctly. A high precision score would have a positive impact on the business, as it would indicate that the model is not making false positive predictions.

**Recall:** This metric indicates the proportion of true positive predictions out of all actual positive instances. In a business setting, this would indicate the model's ability to identify all positive instances. A high recall score would have a positive impact on the business, as it would indicate that the model is not missing any positive instances.

**F1 Score:** This metric is a combination of precision and recall and is used to balance the trade-off between the two. In a business setting, this would indicate the overall effectiveness of the model in making correct predictions while also avoiding false positives and false negatives. A high F1 score would have a positive impact on the business, as it would indicate that the model is making accurate predictions while also being able to identify all positive instances.

**ROC AUC:** This metric indicates the ability of the model to distinguish between positive and negative instances. In a business setting, this would indicate the model's ability to correctly classify instances as positive or negative. A high ROC AUC score would have a positive impact on the business, as it would indicate that the model is able to correctly classify instances.

In summary, the Random Forest Classifier can be considered as an efficient model for the business, especially when it achieves high scores in all of these evaluation metrics, which would indicate that it can accurately predict outcomes, identify all positive instances, and correctly classify instances as positive or negative.

### ML Model - 3 XgBoost Classifier

In [None]:
# Create an instance of the RandomForestClassifier
xg_model = XGBClassifier()

# Fit the Algorithm
xg_models=xg_model.fit(X_train,y_train)

In [None]:
# Predict on the model
# Making predictions on train and test data

y_pred_train_xg = xg_models.predict(X_train)
y_pred_xg = xg_models.predict(X_test)

In [None]:
# Predict on the model
# Get the predicted probabilities
train_probability_xg = xg_models.predict_proba(X_train)
test_probability_xg = xg_models.predict_proba(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
print_metrics(y_train,y_test,y_pred_train_xg,y_pred_xg,test_probability_xg)

In [None]:
# Visualizing evaluation Metric Score chart that is confusion matrix for both training and testing data
confusion_metrics(y_train,y_test,y_pred_train_xg,y_pred_xg)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques
# Number of trees
n_estimators = [50,80,100]

# Maximum depth of trees
max_depth = [4,6,8]

# Minimum number of samples required to split a node
min_samples_split = [50,100,150]

# Minimum number of samples required at each leaf node
min_samples_leaf = [40,50]

# Hyperparameter Grid
param_dict = {'n_estimators' : n_estimators,
              'max_depth' : max_depth,
              'min_samples_split' : min_samples_split,
              'min_samples_leaf' : min_samples_leaf}

# Create an instance of the XGBClassifier
xg_model = XGBClassifier()

# Fit the Algorithm
# Grid search
xg_grid = GridSearchCV(estimator=xg_model,
                       param_grid = param_dict,
                       cv = 5, verbose=2, scoring='roc_auc')

xg_grid1=xg_grid.fit(X_train,y_train)

In [None]:
#best parameter
print("Best: %f using %s" % (xg_grid.best_score_, xg_grid.best_params_))


In [None]:
# Predict on the model
# Making predictions on train and test data

y_pred_train_xgh = xg_grid1.predict(X_train)
y_pred_xgh = xg_grid1.predict(X_test)

In [None]:
# Predict on the model
# Get the predicted probabilities
train_probability_xgh = xg_grid1.predict_proba(X_train)
test_probability_xgh = xg_grid1.predict_proba(X_test)

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV which uses the Grid Search technique for finding the optimal hyperparameters to increase the model performance.

our goal should be to find the best hyperparameters values to get the perfect prediction results from our model. But the question arises, how to find these best sets of hyperparameters? One can try the Manual Search method, by using the hit and trial process and can find the best hyperparameters which would take huge time to build a single model.

For this reason, methods like Random Search, GridSearch were introduced. Grid Search uses a different combination of all the specified hyperparameters and their values and calculates the performance for each combination and selects the best value for the hyperparameters. This makes the processing time-consuming and expensive based on the number of hyperparameters involved.

In GridSearchCV, along with Grid Search, cross-validation is also performed. Cross-Validation is used while training the model.

That's why I have used GridsearCV method for hyperparameter optimization.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
print_metrics(y_train,y_test,y_pred_train_xgh,y_pred_xgh,test_probability_xgh)

In [None]:
# Visualizing evaluation Metric Score chart that is confusion matrix for both training and testing data
confusion_metrics(y_train,y_test,y_pred_train_xgh,y_pred_xgh)

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

When evaluating the effectiveness of an email campaign in a classification model, the following evaluation metrics would be considered for a positive business impact:

* **Precision**: This metric indicates the proportion of true positive predictions (emails that were opened and resulted in a desired action) out of all positive predictions made by the model. In a business setting, this would indicate the level of confidence in the model's ability to identify individuals who are likely to engage with the campaign. A high precision score would have a positive impact on the business, as it would indicate that the model is not making false positive predictions and is effectively identifying individuals who are likely to engage with the campaign.

* **Recall**: This metric indicates the proportion of true positive predictions (emails that were opened and resulted in a desired action) out of all actual positive instances (emails that were opened and resulted in a desired action). In a business setting, this would indicate the model's ability to identify all individuals who engaged with the campaign. A high recall score would have a positive impact on the business, as it would indicate that the model is not missing any individuals who engaged with the campaign.

* **F1 Score**: This metric is a combination of precision and recall and is used to balance the trade-off between the two. In a business setting, this would indicate the overall effectiveness of the model in identifying individuals who are likely to engage with the campaign while also avoiding false positives and false negatives. A high F1 score would have a positive impact on the business, as it would indicate that the model is effectively identifying individuals who are likely to engage with the campaign while also being able to identify all individuals who engaged with the campaign.

* **ROC AUC**: This metric indicates the ability of the model to distinguish between positive and negative instances. In a business setting, this would indicate the model's ability to correctly classify instances as positive (engaged with the campaign) or negative (did not engage with the campaign). A high ROC AUC score would have a positive impact on the business, as it would indicate that the model is able to correctly classify individuals as likely to engage with the campaign or not.

The evaluation metrics that would be considered for a positive business impact of an email campaign effectiveness in a classification model are **precision, recall** which combine to provide F1 score. These metrics would indicate the model's ability to identify individuals who are likely to engage with the campaign while also being able to identify all individuals who engaged with the campaign, and correctly classify instances as positive or negative.

### 2. Explain the model which you have used and the feature importance using any model explainability tool?

In [None]:
#all classifiers
Model = ["Logistic Regression","Random Forest","Xgboost"]

In [None]:
#creating dataframe for all classifiers using dictionary
#creating dataframe for all classifiers using dictionary
pd.DataFrame({"Model":Model,
'Train Accuracy':[0.502620,0.543123,0.697187],
'Test Accuracy':[0.603999,0.654914,0.732894],
'Precision':[0.777709,0.783135,0.780748],
'Recall':[0.603999,0.654914,0.732894],
'F1 Score':[0.670009,0.703793,0.752840],
'roc_auc_score':[0.726082,0.7654076,0.784439]})

We can see from above table that

1) Xgboost have highest Training and Testing Accuracy.

2) Xgboost also have best Recall score, F1 Score and Roc_auc_Score.

Hence we can say that Xgboost is the best Model. Thus, I have choosen XGBoost model which is hyperparameter optimized.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

In [None]:
#Feature Importance
feature_importances_xg = pd.DataFrame(xg_models.feature_importances_,
                                   index = X_train.columns,
                                    columns=['importance_xg']).sort_values('importance_xg',
                                                                        ascending=False)[:10]

plt.subplots(figsize=(17,6))
plt.title("Feature importances")
plt.bar(feature_importances_xg.index, feature_importances_xg['importance_xg'],
        color="green",  align="center")
plt.xticks(feature_importances_xg.index, rotation = 85)

plt.show()

# **Conclusion**

* ### As we observed that Email_Campaign_Type_3 was the most important feature. If your Email_Campaign_Type was 1, there is a 90% likelihood of your Email to be read/acknowledged.


* ### It was observed that both Time_Email_Sent and Customer_Location was insignificant in determining the Email status. The ratio of the Email Status was the same irrespective of the demographic location or the time frame the emails were sent on.

  ### Emails sent in category 2 during the middle of the day will undoubtedly receive more reading and acknowledgment than those sent in categories 1 and 2 during the day.

* ### As the word_count increases beyond the 600 mark we see that there is a high possibility of that email being ignored. The ideal mark is 400-600. No one is interested in reading long emails!

* ### Emails that were ignored contained more pictures.

* ### With the exception of Word Count, practically all continuous variables had outliers. After analysis, it was discovered that outliers account for more than 5% of the minority data and will affect the results in either direction, therefore it was preferable to leave them in.

* ### Although SMOTE appears to have performed much better, information loss is possible.

* ### Based on the metrics, XGBoost Classifier worked the best, giving a test score of 76% for F1 score and 78% for roc-auc score respectively.

----



`Thank-You  :-)  `