# **Email Engagement Prediction**

## **Introduction:**

In this notebook, we aim to build a machine learning model to predict the likelihood of email engagement, specifically focusing on whether the email will be 'clicked' or not. We will explore different ML algorithms to achieve the best possible accuracy.

## Table of Contents:

1. [**Loading and Preprocessing Data**](#loading-and-preprocessing-data)
2. [**Feature Selection and Splitting**](#feature-selection-and-splitting)
3. [**Model Training and Evaluation**](#model-training-and-evaluation)
4. [**Conclusion**](#conclusion)


## 1. Loading and Preprocessing Data <a name="loading-and-preprocessing-data"></a>


In [25]:
pip install --upgrade scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m83.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.2.2
    Uninstalling scikit-learn-1.2.2:
      Successfully uninstalled scikit-learn-1.2.2
Successfully installed scikit-learn-1.3.2


In [27]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn.metrics._plot.confusion_matrix import ConfusionMatrixDisplay

# Function to load and preprocess the dataset
def load_and_preprocess_data(file_path):
    df = pd.read_csv(file_path)
    # Add any additional preprocessing steps as needed
    return df

## 2. Feature Selection and Splitting <a name="feature-selection-and-splitting"></a>

In [16]:
# Function to split the dataset into features and target variable
def split_features_target(df, target_column, feature_columns):
    X = df[feature_columns]  # Features
    y = df[target_column]  # Target variable
    return X, y

# Function to split the dataset into train and test sets
def split_train_test(X, y, test_size=0.2, random_state=42):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    return X_train, X_test, y_train, y_test

## 3. Model Training and Evaluation <a name="model-training-and-evaluation"></a>

In [17]:
# Function to create and train the machine learning model
def train_model1(X_train, y_train):
    model = Pipeline([
        ('scaler', StandardScaler()),  # Standardize features
        ('classifier', RandomForestClassifier(random_state=42))  # RandomForestClassifier
    ])
    model.fit(X_train, y_train)
    return model

# Function to evaluate the model on the test set
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    classification_rep = classification_report(y_test, y_pred)
    confusion_mat = confusion_matrix(y_test, y_pred)
    return accuracy, classification_rep, confusion_mat

In [18]:
# Load and preprocess data
file_path = '/content/email_data.csv'
df = load_and_preprocess_data(file_path)

# Define the target variable and feature columns
target_column = 'clicked'
feature_columns = ['opened', 'responded', 'subject_len', 'body_len']

# Split features and target variable
X, y = split_features_target(df, target_column, feature_columns)

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = split_train_test(X, y)

In [21]:
# Train the model
trained_model1 = train_model1(X_train, y_train)

# Evaluate the model
accuracy, classification_rep, confusion_mat = evaluate_model(trained_model1, X_test, y_test)

# Display evaluation metrics
print(f"Accuracy: {accuracy}")
print("Classification Report:\n", classification_rep)
print("Confusion Matrix:\n", confusion_mat)

Accuracy: 0.5882352941176471
Classification Report:
               precision    recall  f1-score   support

           0       0.60      0.67      0.63         9
           1       0.57      0.50      0.53         8

    accuracy                           0.59        17
   macro avg       0.59      0.58      0.58        17
weighted avg       0.59      0.59      0.59        17

Confusion Matrix:
 [[6 3]
 [4 4]]


**Insights and Considerations:**


*   The confusion matrix shows that the model correctly predicted 6 instances of 'not clicked' and 4 instances of 'clicked'. However, it misclassified 3 instances of 'not clicked' as 'clicked' and 4 instances of 'clicked' as 'not clicked'.
*   The model shows reasonable performance but has room for improvement, particularly in balancing precision and recall for both classes.
*   Fine-tuning hyperparameters, experimenting with different algorithms, or adjusting the decision threshold may enhance model performance.
*   Consideration of additional features or feature engineering could also contribute to better predictive capabilities.








Improving Model accuracy by adjusting the hyperparameters of the RandomForestClassifier and using a broader set of algorithms for comparison.
 We'll also include feature scaling for algorithms that benefit from it. Additionally, we'll address class imbalance by adjusting class weights.




In [23]:
# Function to create and train a machine learning model with adjustments to improve

def train_model2(X_train, y_train, algorithm='random_forest'):
    if algorithm == 'random_forest':
        model = Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', RandomForestClassifier(random_state=42, class_weight='balanced', n_jobs=-1))
        ])
    elif algorithm == 'svm':
        model = Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', SVC(kernel='linear', class_weight='balanced'))
        ])
    elif algorithm == 'logistic_regression':
        model = Pipeline([
            ('scaler', StandardScaler()),
            ('classifier', LogisticRegression(class_weight='balanced'))
        ])
    else:
        raise ValueError("Invalid algorithm. Choose 'random_forest', 'svm', or 'logistic_regression'.")

    model.fit(X_train, y_train)
    return model

In [22]:
# Train the model (Random Forest as an example, you can try SVM or Logistic Regression)
trained_model2 = train_model2(X_train, y_train, algorithm='random_forest')

# Evaluate the model
accuracy, classification_rep, confusion_mat = evaluate_model(trained_model2, X_test, y_test)

# Display evaluation metrics
print(f"Accuracy: {accuracy}")
print("Classification Report:\n", classification_rep)

Accuracy: 0.6470588235294118
Classification Report:
               precision    recall  f1-score   support

           0       0.64      0.78      0.70         9
           1       0.67      0.50      0.57         8

    accuracy                           0.65        17
   macro avg       0.65      0.64      0.64        17
weighted avg       0.65      0.65      0.64        17



**Insights and Considerations:**



*   The model's accuracy has improved, particularly in identifying 'not clicked' instances.
*   Precision for both classes has increased, indicating better correctness in predictions.
*   The recall for 'clicked' instances remains a challenge, indicating potential areas for further improvement.


## 3. Conclusion <a name="#conclusion"></a>

The machine learning model has shown improvement in predicting email engagement, with the overall accuracy reaching approximately 65%. The precision and recall metrics have also shown enhancements, particularly in correctly identifying 'not clicked' instances. However, there is room for further improvement, especially in increasing recall for 'clicked' instances.

In the upcoming detailed analysis report, we will delve into the finer nuances of the model's performance and explore comprehensive strategies for further improvement.