# Spam Email Classification


## Introduction
- The proliferation of email as a primary communication tool has brought about significant convenience and challenges, such as the issue of spam emails. Spam emails not only clutter inboxes but also pose security risks by potentially containing malicious content.
- Therefore, effectively classifying emails as spam or non-spam is critical for enhancing user experience and maintaining cybersecurity. 
This notebook aims to classify emails into spam and non-spam categories using data analysis and machine learning technique.
- This notebook details the methodology, analysis, and conclusions drawn from the project. It covers the data preprocessing steps, visualizati n techniques used, and the statistical analysis conducted. Additionally, it discusses the challenges encountered and the solutions implemented to address they.


## Methodology

The notebook followed a structured approach consisting of the following steps:
- Data Collection and Preparation: Gather and prepare the data for analysis.
- Exploratory Data Analysis: Investigate data to find patterns, spot anomalies, and check assumptions using summary statistics and graphs.
- Feature Engineering: Create relevant features from the data.
- Model Development: Develop and train various machine learning models.
- Model Evaluation: Assess the performance of the models using various metrics.
- External Validation: Test the final model on new data to ensure its robustness and generalizability.

The detail of each step is given below.

---

## A. Import Libraries

- Importing essential libraries needed for data analysis, visualization, and machine learning tasks. It imports numpy and pandas, which are fundamental for handling numerical data and data frames, respectively.
- For visualization, matplotlib.pyplot and seaborn are imported, providing tools for creating a wide range of static, animated, and interactive plots.
- The train_test_split function from the sklearn.model_selection module is brought in to facilitate the division of the dataset into training and testing subsets, crucial for model evaluation.
- Lastly, the code imports classification_report and confusion_matrix from sklearn.metrics, which are used to assess the performance of classification models by generating detailed reports and confusion matrices, respectively.

In [None]:
# Import necessary libraries for data analysis and visualization
import numpy as np  
import pandas as pd  

# Import libraries for plotting and visualization
import matplotlib.pyplot as plt  
import seaborn as sb  

# Import train_test_split function for splitting data into training and test sets
from sklearn.model_selection import train_test_split

# Importing necessary functions for evaluating classification performance
from sklearn.metrics import classification_report, confusion_matrix

- Importing the Natural Language Toolkit (nltk) library, which is widely used for various natural language processing (NLP) tasks.
- Specifically, it imports the stopwords module from nltk.corpus, which provides a collection of common stop words that can be filtered out during text processing to improve the efficiency of NLP tasks.

In [None]:
# Import the Natural Language Toolkit (nltk) library for natural language processing tasks
import nltk
from nltk.corpus import stopwords  # Import the stopwords module to work with common stop words

# Download the stopwords dataset from nltk
nltk.download("stopwords")

---

## B. Data Collection and Preparation

- At first, load the dataset into a DataFrame using the Pandas library in Python. This is the first step in data exploration and preprocessing. It enables us to perform various operations, such as cleaning, transforming, and analyzing the data.

In [None]:
# Import dataset 
df = pd.read_csv("/kaggle/input/email-spam-classification-dataset-csv/emails.csv", index_col=0)
df

---

## C. Exploratory Data Analysis

- Afterward, checked the structure and missing values to prepare the data for further processing and modeling. Besides, calculate descriptive statistics to give a summary of the data and help in understanding its characteristics. They are useful for identifying trends, outliers, and the overall distribution of the data.

### 1. DataFrame Inspection

- Upon loading the dataset into a DataFrame, it is important to review its structure. This includes checking the column names to understand what features are present, verifying data types (e.g., numeric, categorical, text), and inspecting a sample of data entries to get an initial sense of the data. 

#### a. Display the First 10 Rows of the DataFrame

- The head method provides a quick overview of the dataset, showing the initial rows along with the column names and some of the data contained within.

In [None]:
df.head(10)

#### b. Display the Last 10 Rows of the DataFrame

- The tail method provides a quick look at the end of the dataset, showing the final rows along with the column names and some of the data contained within.

In [None]:
df.tail(10)

#### c. Checking Shape

- The df.shape returns a tuple representing the dimensions of the DataFrame.
- The first element of the tuple indicates the number of rows, and the second element indicates the number of columns.

In [None]:
df.shape

#### d. List of Columns

- The df.columns returns an Index object containing the column labels of the DataFrame.
- This allows us to see the names of all the columns. 

In [None]:
df.columns

#### e. Data Types of Each Feature

- The df.dtypes attribute returns a Series with the data type of each column in the DataFrame.
- This is useful for understanding the types of data we are working with, such as integers, floats, strings, or more complex types.
- Knowing the data types helps in performing appropriate data processing and analysis tasks, as certain operations are only applicable to specific data types.

In [None]:
df.dtypes

#### f. Information Summary

- The df.info() provides a concise summary of the DataFrame's structure including the total number of entries (rows), the number of non-null values in each column, the data type of each column, and the memory usage of the DataFrame.

In [None]:
df.info()

#### Observation after DataFrame Inspection

- By observing the dataset, see that it consists of 5172 emails represented as rows, where each email's features are counts of the 3000 most common words.
- This creates a high-dimensional feature space with sparse data, typical in text classification tasks.

### 2. Summary of DataFrame Statistics

- Generating descriptive statistics provides insights into the data’s central tendencies and variability.
- Because this dataset contains only numerical features, calculated the mean, median, standard deviation, and range.
- These statistics help in identifying patterns and potential anomalies in the data.

In [None]:
df.describe()

- The features show significant variation, with high mean values for common words, substantial standard deviations, and a wide range of minimum and maximum values.
- Quartile statistics reveal that many features have low occurrence rates for a large portion of the data, with the median and 75th percentile providing insights into data distribution and spread.
- From the summary statistics, see the high standard deviations and diverse range of minimum and maximum values highlight the richness and complexity of text data.

### 3. Checking Null Values
- Identifying and handling missing values is a crucial step.
- Missing values can arise from various sources and can lead to biased or inaccurate model predictions if not addressed.

In [None]:
df.isnull().sum()

- This indicates no missing (null) values in any of the columns.
- This means that the dataset is complete and there are no entries that need to be handled or imputed due to missing data.

### 4. Label Distribution

- Count of occurrences for each category in the 'Prediction' column
- Analyzing the distribution of the target labels (spam vs. non-spam) helps understand the balance of the dataset.

In [None]:
# Count the number of occurrences of each value in the 'Prediction' column
df['Prediction'].value_counts()

In [None]:
# Create a new figure with a specified size
plt.figure(figsize=(4, 2))

# Create a count plot using Seaborn to show the count of occurrences for each category in the 'Prediction' column
sb.countplot(x = 'Prediction',
            data = df)
plt.xticks([0,1],['Not Spam','Spam'])

# Display the plot
plt.show()

#### Observation after Checking Label Distribution
- Oserved that the dataset is imbalanced, meaning there are significantly more non-spam emails than spam ones.
- This imbalance can affect the performance of classification algorithms, as they might be biased towards the majority class (non-spam).

### 5. Create Distribution Plots for the First 12 Columns

- Generates distribution plots (histograms with Kernel Density Estimate overlays) for the first 12 columns
- This allows us to visually inspect how the values of each feature are distributed across non-spam and spam emails.

In [None]:
# Number of columns to plot
num_columns = 12

# Create a figure and axes
fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(12, 12))

# Flatten the axes array for easy iteration
axes = axes.flatten()

# Filter data based on spam and non-spam
spam_df = df[df['Prediction'] == 1]
non_spam_df = df[df['Prediction'] == 0]

# Loop through the first 12 columns and create distribution plots for spam and non-spam
for i, col in enumerate(df.columns[:num_columns]):
    # Plot for non-spam emails
    sb.histplot(non_spam_df[col], ax=axes[i], kde=True, color='blue', label='Non-Spam')
    # Plot for spam emails
    sb.histplot(spam_df[col], ax=axes[i], kde=True, color='red', label='Spam')
    axes[i].set_title(col)
    axes[i].legend()

# Adjust the layout
plt.tight_layout()
plt.show()

- Overall, these histograms illustrate that common words such as "the", "to", "etc", "and," "for," "of," "a," "you," "in," "on," and "is" appear more frequently in non-spam emails than in spam emails. However, the difference in word frequency distributions between spam and non-spam emails could be influenced by the prediction counts.

### 6. The Correlation Matrix for the Numeric Columns

- Visualizing the pairwise correlations between numeric features in the dataset.
- By plotting the heatmap, can easily identify strong and weak correlations, patterns, and relationships among features.

In [None]:
# Calculate the correlation matrix for the numeric columns in the dataset.
correlation_matrix = df.corr()

# Build a matrix of booleans (True, False) with the same shape as the data
ones_corr = np.ones_like(correlation_matrix, dtype=bool)

# The variable mask now contains the upper triangular matrix mask created in the previous step
mask = np.triu(ones_corr)

# Create a heatmap to visualize the correlation matrix.
sb.heatmap(correlation_matrix, mask=mask, annot=False, cmap="coolwarm")

# Customize the plot
plt.title("Correlation Heatmap")
plt.show()

- The heatmap shows that most word pairs have very low or no correlation, as indicated by the predominance of blue and white colors.
- There are a few small red and blue dots scattered, which indicate some word pairs have a noticeable positive or negative correlation, but these are relatively rare.

---

## D. Data Modelling

- Begin by applying a simple model to evaluate the performance of the data.
- This initial step involves using basic machine learning algorithms to establish a baseline for model performance.
- By assessing how well these simple models perform, can evaluate the effectiveness of the data and identify potential areas for improvement.
- This baseline also allows us to compare more complex models and techniques, ensuring that any advancements in performance are meaningful and not just a result of overfitting or data leakage.

### 1. Train-Test Sets Preparation

#### a. Create a New DataFrame X by Dropping the 'Prediction' Column from the Original DataFrame

In [None]:
X = df.drop("Prediction", axis=1)
X

#### b. Extract the 'Prediction' Column From the Original DataFrame and Store It in a New Variable y

In [None]:
y = df["Prediction"]
y

#### c. Split the Data Into Training and Test Sets

In [None]:
# Split the data into training and test sets
   # X_train and y_train are the training sets
   # X_test and y_test are the test sets
# test_size=0.2 means 20% of the data will be used for testing and 80% for training
# random_state=42 ensures reproducibility of the split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting training and test sets
print(f"Training Features Shape: {X_train.shape}")
print(f"Training Labels Shape: {y_train.shape}")
print(f"Test Features Shape: {X_test.shape}")
print(f"Test Labels Shape: {y_test.shape}")

In [None]:
# Determine the number of features (columns) in the DataFrame X
# Determine the unique classes in the Series y and assign it to n_classes
n_features, n_classes = X.shape[1], np.unique(y)

# Display the number of features and the unique classes
n_features, n_classes

### 2. Apply Multinomial Naive Bayes

- At first, apply the Naive Bayes algorithm because it is well-suited for text classification tasks, where each email is represented by word counts (or frequencies) of the most common words. 
- MultinomialNB is chosen because it works with features that represent counts, which aligns with the dataset structure where each cell represents the count of a specific word in an email. 
- Besides, MultinomialNB naturally supports binary classification tasks, where the goal is to predict whether an email is spam (1) or not spam (0)

In [None]:
# Importing the Multinomial Naive Bayes classifier from scikit-learn
from sklearn.naive_bayes import MultinomialNB

In [None]:
# Instantiate a Multinomial Naive Bayes classifier
nb = MultinomialNB()

# Train (fit) the classifier on the training data
nb.fit(X_train, y_train)

In [None]:
# Use the trained Naive Bayes classifier to predict the labels of the test data
y_pred_nb = nb.predict(X_test)

In [None]:
# Print the classification report, which includes precision, recall, F1-score, and support for each class
print("Classification report: ")
print(classification_report(y_test, y_pred_nb))

#### Explanation of Each Metric Result When Applying Multinomial Naive Bayes

- Precision: This measures the accuracy of positive predictions made by the model. For class 0 (not spam), the precision is 0.98, indicating that 98% of emails predicted as not spam were actually not spam. For class 1 (spam), the precision is 0.89, meaning that 89% of emails predicted as spam were actually spam.
- Recall: It measures the proportion of actual positives that were correctly identified by the model. For class 0, the recall is 0.95, indicating that 95% of actual not spam emails were correctly identified as not spam. For class 1, the recall is 0.96, meaning that 96% of actual spam emails were correctly identified as spam.
- F1-score: The F1-score is the harmonic mean of precision and recall, providing a single metric that balances both precision and recall. The F1-score for class 0 is 0.97, and for class 1, it is 0.92.
- Support: Support is the number of actual occurrences of each class in the test dataset. In this case, there are 739 instances of class 0 (not spam) and 296 instances of class 1 (spam).
- Accuracy: Accuracy measures the overall correctness of the model's predictions across all classes. Here, the overall accuracy is 0.95, meaning that the model correctly predicted 95% of the emails in the test set.

In [None]:
# Compute the confusion matrix to evaluate the performance of the Naive Bayes classifier
cm_nb = confusion_matrix(y_test, y_pred_nb)

# Create a DataFrame from the confusion matrix with labels for rows and columns
df_cm_nb = pd.DataFrame(cm_nb, columns=np.unique(y_test), index=np.unique(y_test))

# Set the names for the index (rows) and columns of the DataFrame
df_cm_nb.index.name = 'Actual'
df_cm_nb.columns.name = 'Predicted'

# Create a new figure for the heatmap with a specific size
plt.figure(figsize=(1.5, 1.5))

# Generate a heatmap visualization of the confusion matrix
sb.heatmap(df_cm_nb, annot=True, annot_kws={"size": 12}, cbar=False, square=True, fmt="d", cmap="Reds")

# Display the heatmap
plt.show()

#### Discussion of Applying Multinomial Naive Bayes

- By observing the confusion matrix, find that reducing false positives for class 1 (spam) predictions is necessary. 
- Multinomial Naive Bayes was initially chosen because it is well-suited for text classification tasks where features (word counts or frequencies) are non-negative integers. Given the class, it assumes that features are conditionally independent, which can work reasonably well for word count data.

### 3. Apply Logistic Regression

- Then apply the Logistic Regression because Logistic Regression can capture more complex relationships between features and the target variable compared to the assumption of independence in Multinomial Naive Bayes.
- Multinomial Naive Bayes assumes that features (word counts) are conditionally independent given the class label (spam or not spam). This means that the presence of one word is independent of the presence of other words, which is often not true in real-world text data because the occurrence of words in emails could be correlated or have complex interactions that Logistic Regression can potentially model more accurately.
- Logistic Regression, on the other hand, models the relationship between the features (word counts) and the probability of each class (spam or not spam) using a logistic function.

In [None]:
# Importing the Logistic Regression model from the scikit-learn library.
from sklearn.linear_model import LogisticRegression

In [None]:
# Initializing the Logistic Regression model with a fixed random state for reproducibility
# The max_iter parameter is set to 1000 to ensure the solver has sufficient iterations to converge
reg_log = LogisticRegression(random_state=42, max_iter=1000)

# Training the Logistic Regression model on the training data (X_train and y_train)
reg_log.fit(X_train, y_train)

In [None]:
# Predict the labels for the test data using the trained Logistic Regression model
y_pred_reg_log = reg_log.predict(X_test)

In [None]:
# Print the classification report to evaluate the performance of the Logistic Regression model
print("Classification report: ")
print(classification_report(y_test, y_pred_reg_log))

In [None]:
# Compute the confusion matrix to evaluate the performance of the Logistic Regression model
cm_reg_log = confusion_matrix(y_test, y_pred_reg_log)

# Create a DataFrame for the confusion matrix with appropriate labels for rows and columns
df_cm_reg_log = pd.DataFrame(cm_reg_log, columns=np.unique(y_test), index = np.unique(y_test))

# Set names for the index (rows) and columns of the DataFrame
df_cm_reg_log.index.name = 'Actual'
df_cm_reg_log.columns.name = 'Predicted'

# Create a new figure for the heatmap with specified dimensions
plt.figure(figsize = (1.5,1.5))

# Generate a heatmap for the confusion matrix with annotations and custom formatting
sb.heatmap(df_cm_reg_log, annot=True, annot_kws={"size": 12}, cbar=False, square=True, fmt="d", cmap="Reds")

# Display the heatmap
plt.show()

### 4. Comparison of Two Models: Multinomial Naive Bayes and Logistic Regression

- Accuracy: Logistic Regression has a higher accuracy (0.97) than Multinomial Naive Bayes (0.95).
- Precision: For Class 0, both models have a high precision of 0.98. For Class 1, Logistic Regression (0.94) has higher precision than Multinomial Naive Bayes (0.89).
- Recall: For Class 0, Logistic Regression has a slightly better recall (0.98) than Multinomial Naive Bayes (0.95).
- F1-Score: Logistic Regression has a better f1-score for both classes, particularly for Class 1 (0.95 vs. 0.92).

Therefore, Logistic Regression performs better than MultinomialNB across most metrics, particularly regarding precision and f1-score for Class 1, which suggests it handles the positive class better.

---

## E. Improve the Performance of the Logistic Regression Model by Feature Engineering

### 1. Remove Stop Words

- Stop words are common words that appear frequently in a text but carry little meaningful information, such as "the," "is," "in," "and," etc.
- Removing these words can improve the performance of machine learning models by reducing noise and dimensionality, leading to more meaningful feature representation and potentially better classification results.

In [None]:
# Import the list of stop words from the NLTK library
stop_words = list(stopwords.words('english'))

# Print the list of stop words
print(stop_words)

In [None]:
# Remove the stop words columns from the DataFrame
# The 'errors="ignore"' parameter ensures that non-existent columns are ignored
df_filtered_sw = df.drop(stop_words, axis=1, errors="ignore")

# Display the resulting DataFrame after stop words removal
df_filtered_sw

In [None]:
# Create a new DataFrame X_filtered_sw by dropping the 'Prediction' column from df_filtered_sw
X_filtered_sw = df_filtered_sw.drop("Prediction", axis=1)

# Display X_filtered_sw, which now contains the filtered features after stop words removal
X_filtered_sw

In [None]:
# Create a new Series y_filtered_sw containing the 'Prediction' column from df_filtered_sw
y_filtered_sw = df_filtered_sw["Prediction"]

# Display y_filtered_sw, which now contains the labels after stop words removal
y_filtered_sw

#### Split the Filtered Dataset Into Training and Testing Sets

In [None]:
X_train_filtered_sw, X_test_filtered_sw, y_train_filtered_sw, y_test_filtered_sw = train_test_split(X_filtered_sw, y_filtered_sw, test_size=0.2, random_state=42)

# Print the shapes of the training and testing sets to verify the split
print(X_train_filtered_sw.shape, y_train_filtered_sw.shape, X_test_filtered_sw.shape, y_test_filtered_sw.shape)

#### Apply Logistic Regression

In [None]:
# Initialize a Logistic Regression model with specified random state and maximum iterations
reg_log_filtered_sw = LogisticRegression(random_state=42, max_iter=1000)

# Train the Logistic Regression model on the filtered training data
reg_log_filtered_sw.fit(X_train_filtered_sw, y_train_filtered_sw)

In [None]:
# Predict the labels for the filtered test data using the trained Logistic Regression model
y_pred_reg_log_filtered_sw = reg_log_filtered_sw.predict(X_test_filtered_sw)

In [None]:
# Print a header for the classification report
print("Classification report: ")

# Print the classification report to evaluate the performance of the Logistic Regression model
print(classification_report(y_test_filtered_sw, y_pred_reg_log_filtered_sw))

In [None]:
# Compute the confusion matrix to evaluate the performance of the Logistic Regression model
cm_reg_log_filtered_sw = confusion_matrix(y_test_filtered_sw, y_pred_reg_log_filtered_sw)

df_cm_reg_log_filtered_sw = pd.DataFrame(cm_reg_log_filtered_sw, columns=np.unique(y_test), index = np.unique(y_test))

# Set the names for the DataFrame's index (rows) and columns
df_cm_reg_log_filtered_sw.index.name = 'Actual'
df_cm_reg_log_filtered_sw.columns.name = 'Predicted'

# Create a new figure for the heatmap 
plt.figure(figsize = (1.5,1.5))

# Generate the heatmap visualization
sb.heatmap(df_cm_reg_log_filtered_sw, annot=True, annot_kws={"size": 12}, cbar=False, square=True, fmt="d", cmap="Reds")

plt.show()

#### Discussion of Removing Stop Words

The model's performance did not improve after removing stop words. There could be several reasons why this happened:

- The dataset might not heavily rely on stop words for classification.
- Removing stop words might not have reduced the feature space significantly enough to improve model performance.

### 2. Scaling

- Afterwards, decided to use scaling, particularly min-max scaling, with Logistic Regression because it performs better when features are on a similar scale. Min-max scaling was chosen because the data distribution is not Gaussian (not normal).
- If features are on different scales (e.g., one feature ranges from 0 to 1000 and another from 0 to 10), features with larger scales can dominate the model's learning process. Scaling ensures that all features are transformed to a similar scale, preventing this dominance and allowing the model to learn from each feature more uniformly.
- Besides, scaling ensures that the optimization process converges more quickly and efficiently, leading to faster training times.

In [None]:
# Importing preprocessing module from sklearn
from sklearn import preprocessing

#### Min-Max Scaler

In [None]:
# Initialize MinMaxScaler for scaling features to a specified range (default is [0, 1])
min_max_scaler = preprocessing.MinMaxScaler()

In [None]:
# Fit and transform the training data using MinMaxScaler to scale features to [0, 1]
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax

In [None]:
# Transform the test data using the MinMaxScaler fitted on X_train
X_test_minmax = min_max_scaler.fit_transform(X_test)
X_test_minmax

#### Apply Logistic Regression

In [None]:
# Initialize Logistic Regression with MinMax scaled data
reg_log_minmax = LogisticRegression(random_state=42, max_iter=1000)
reg_log_minmax.fit(X_train_minmax, y_train)

In [None]:
# Predict using the logistic regression model with MinMax scaled test data
y_pred_reg_log_minmax = reg_log_minmax.predict(X_test_minmax)

In [None]:
# Print classification report to evaluate model performance
print("Classification report: ")
print(classification_report(y_test, y_pred_reg_log_minmax))

In [None]:
# Compute and print confusion matrix to assess model performance
cm_reg_log_minmax = confusion_matrix(y_test, y_pred_reg_log_minmax)

# Create a DataFrame from confusion matrix for visualization
df_cm_reg_log_minmax = pd.DataFrame(cm_reg_log_minmax, columns=np.unique(y_test), index=np.unique(y_test))

# Set the names for the DataFrame's index (rows) and columns
df_cm_reg_log_minmax.index.name = 'Actual'
df_cm_reg_log_minmax.columns.name = 'Predicted'

# Create a new figure for the heatmap visualization
plt.figure(figsize=(1.5, 1.5))

# Generate the heatmap using Seaborn
sb.heatmap(df_cm_reg_log_minmax, annot=True, annot_kws={"size": 12}, cbar=False, square=True, fmt="d", cmap="Reds")

# Display the heatmap
plt.show()

#### Discussion of the Performance after Using Min-Max Scaling

- Precision and recall for both classes (0 and 1) have improved in the current result. For class 0 (not spam), precision and recall increased from 0.98 to 0.99, indicating fewer false positives and better identification of actual not-spam emails. For class 1 (spam), precision and recall increased from 0.94 to 0.97, showing better performance in identifying spam emails correctly.
- The F1-scores have also improved across both classes, reflecting a better balance between precision and recall. Class 0's F1-score increased from 0.98 to 0.99, and class 1's F1-score improved from 0.95 to 0.97. This indicates a more robust performance in classification after scaling.
- Overall accuracy increased from 0.97 to 0.98. This means that the model is making correct predictions for 98% of the emails in the test set after scaling, compared to 97% previously.

---

## F. Models Comparision

In [None]:
# Create figure for the heatmaps
fig, axs = plt.subplots(1, 4, figsize=(20, 5))

# Heatmap for Naive Bayes
sb.heatmap(df_cm_nb, annot=True, annot_kws={"size": 20}, cbar=False, square=True, fmt="d", cmap="Reds", ax=axs[0])
axs[0].set_title('Naive Bayes', fontsize=18)
axs[0].set_xlabel('Predicted', fontsize=16)
axs[0].set_ylabel('Actual', fontsize=16)
axs[0].tick_params(axis='both', which='major', labelsize=16)

# Heatmap for Logistic Regression
sb.heatmap(df_cm_reg_log, annot=True, annot_kws={"size": 20}, cbar=False, square=True, fmt="d", cmap="Blues", ax=axs[1])
axs[1].set_title('Logistic Regression', fontsize=18)
axs[1].set_xlabel('Predicted', fontsize=16)
axs[1].set_ylabel('Actual', fontsize=16)
axs[1].tick_params(axis='both', which='major', labelsize=16)

# Heatmap for Logistic Regression on Filtered Stopwords
sb.heatmap(df_cm_reg_log_filtered_sw, annot=True, annot_kws={"size": 20}, cbar=False, square=True, fmt="d", cmap="Greens", ax=axs[2])
axs[2].set_title('Logistic Regression (Filtered Stopwords)', fontsize=18)
axs[2].set_xlabel('Predicted', fontsize=16)
axs[2].set_ylabel('Actual', fontsize=16)
axs[2].tick_params(axis='both', which='major', labelsize=16)

# Heatmap for Logistic Regression on Min-Max Scaled Features
sb.heatmap(df_cm_reg_log_minmax, annot=True, annot_kws={"size": 20}, cbar=False, square=True, fmt="d", cmap="Purples", ax=axs[3])
axs[3].set_title('Logistic Regression (Min-Max Scaled)', fontsize=18)
axs[3].set_xlabel('Predicted', fontsize=16)
axs[3].set_ylabel('Actual', fontsize=16)
axs[3].tick_params(axis='both', which='major', labelsize=16)

# Adjust layout
plt.tight_layout()
plt.show()

- In conclusion, Min-Max scaling has effectively contributed to the improved performance of the logistic regression model by enhancing its ability to learn from the data, leading to better classification results for identifying spam and non-spam emails in our dataset.
- The combination of applying Min-Max scaling and Logistic Regression archieved the highest performance.

---

## G. Cross-Validation

- Cross-validation is a technique used to evaluate the performance of a model by dividing the data into multiple subsets, training the model on some subsets, and testing it on the remaining subsets.
- This process is repeated several times to ensure the model's effectiveness and to prevent overfitting.

In [None]:
# Import cross-validation score evaluation function
from sklearn.model_selection import cross_val_score

In [None]:
# Apply MinMax scaling to the feature matrix X
X_minmax = min_max_scaler.fit_transform(X)
X_minmax

In [None]:
# Create a Logistic Regression model with specified parameters
log_reg = LogisticRegression(random_state=42, max_iter=1000)

# Perform cross-validation on the MinMax scaled data X_minmax and target y
reg_log_minmax_scores = cross_val_score(log_reg, X_minmax, y, cv=5, scoring='accuracy')

# Print the mean accuracy of cross-validation scores for Logistic Regression with MinMax scaling
print("Logistic Regression Minmax CV Accuracy: ", reg_log_minmax_scores.mean())

#### Discussion on the Cross-Validation

- The "Logistic Regression Minmax CV Accuracy: 0.95417281043553" represents the average accuracy of a Logistic Regression model trained on data that has been scaled using MinMax scaling, evaluated using 5-fold cross-validation.
- This metric indicates that, on average, the model correctly predicts the class of emails (spam or not spam) about 95.4% of the time across different folds of the dataset.
- This score provides a robust estimate of the model's performance and its ability to generalize to unseen data, accounting for variations in the training and validation subsets used in cross-validation.

---

## H. Testing with External Dataset

- Link of the dataset: https://www.kaggle.com/datasets/ozlerhakan/spam-or-not-spam-dataset
- The dataset contains 2,500 non-spam emails and 500 spam emails.

In [None]:
# Retrieve all columns except the last one from the dataframe `df`
columns = df.columns[:-1]
columns

In [None]:
import re  # Import the re module for working with regular expressions
from collections import Counter  # Import the Counter class from the collections module

#### Import External Data

In [None]:
# Read the CSV file "spam_or_not_spam.csv" to assume this is the new data
df_external = pd.read_csv("/kaggle/input/spam-or-not-spam-dataset/spam_or_not_spam.csv")
df_external

In [None]:
# Ensure all entries in the 'email' column are strings
df_external['email'] = df_external['email'].astype(str)

#### Data Preprocessing

In [None]:
# Define a function to check if the text contains only Latin characters
def contains_only_latin_characters(text):
    return bool(re.match(r'^[a-zA-Z\s]*$', text))

# Drop rows that do not contain only Latin characters
df_external = df_external[df_external['email'].apply(contains_only_latin_characters)]

# Reset the index of the DataFrame
df_external = df_external.reset_index(drop=True)

df_external

#### Feature Extraction

In [None]:
# Function to preprocess and tokenize text
def preprocess_text(text):  
    text = re.sub(r'[^a-z\s]', '', text.lower()) # Remove non-alphabetical characters and convert to lowercase
    text = re.sub(r"\W", " ", text) # Replace anything other than letters, digits, or underscore character with a white space
    text = re.sub(r'\s+', " ", text) # Remove extra white spaces
    
    # Tokenize the text
    tokens = text.split()
    return tokens

# Initialize a list to store the word frequency data for each email
data_list = []

# Process each email
for email in df_external["email"]:
    # Preprocess the text
    tokens = preprocess_text(email)
    # Count word frequencies
    word_counts = Counter(tokens)
    # Create a dictionary for the current email's word frequencies
    data = {word: word_counts.get(word, 0) for word in columns}
    data_list.append(data)

# Create a DataFrame from the list of dictionaries
X_test_external = pd.DataFrame(data_list)

X_test_external

In [None]:
# Extract the last column (assumed to be the target variable) from the DataFrame df_external
y_test_external = df_external.iloc[:,-1] 
y_test_external

In [None]:
# Create countplot
sb.countplot(x=y_test_external)
plt.title('Countplot of y')
plt.xlabel('y')
plt.ylabel('Count')
plt.show()

In [None]:
# Transform the features in X_test_external using the fitted MinMaxScaler
X_test_minmax_external = min_max_scaler.fit_transform(X_test_external)
X_test_minmax_external

In [None]:
# Predict using the logistic regression model with MinMax scaled test data
y_pred_reg_log_minmax_external = reg_log_minmax.predict(X_test_minmax_external)

In [None]:
# Print the classification report comparing y_test_external and y_pred_reg_log_minmax_external
print("Classification report: ")
print(classification_report(y_test_external, y_pred_reg_log_minmax_external))

In [None]:
# Compute and print confusion matrix to assess model performance
cm_reg_log_minmax_external = confusion_matrix(y_test_external, y_pred_reg_log_minmax_external)

# Create a DataFrame from the confusion matrix for visualization
df_reg_log_minmax_external = pd.DataFrame(cm_reg_log_minmax_external, columns=np.unique(y_test), index=np.unique(y_test))

# Set the names for the DataFrame's index (rows) and columns
df_reg_log_minmax_external.index.name = 'Actual'
df_reg_log_minmax_external.columns.name = 'Predicted'

# Create a new figure for the heatmap visualization
plt.figure(figsize=(1.5, 1.5))

# Generate the heatmap using Seaborn
sb.heatmap(df_reg_log_minmax_external, annot=True, annot_kws={"size": 12}, cbar=False, square=True, fmt="d", cmap="Reds")

# Display the heatmap
plt.show()

#### Discussion on the Result of Testing on the External Dataset

- Although the performance drops on the external dataset, the model still manages to identify a substantial portion of spam emails (recall of 0.75 for spam). This suggests that the model retains some generalization capabilities, which is promising for handling real-world data.
- The model maintains a high precision for not spam emails (0.91) in the external dataset, meaning that most emails predicted as not spam are indeed not spam. This is important for user trust, as it reduces the chances of important emails being incorrectly classified as spam.
-  To improve performance on new incoming emails, consider enhancing the training process with diverse data, better handling of class imbalance, and further model tuning.

---

## I. Findings and Insights

- Overall, the dataset consists of 5172 emails with features derived from counts of the 3000 most common words, creating a high-dimensional and sparse feature space. Exploratory Data Analysis revealed significant variation in feature values, with high means for common words and a wide range of values, yet no missing data was present. The label distribution is notably imbalanced, with a higher number of non-spam emails than spam, which can impact model performance by introducing bias towards the majority class.

- Regarding model performance, Multinomial Naive Bayes was effective for text classification but showed a need to reduce false positives in spam detection. Logistic Regression outperformed Naive Bayes across several metrics, including accuracy (0.97 vs. 0.95), precision for spam detection (0.94 vs. 0.89), recall, and F1-score. By conducting feature engineering, removing stop words did not improve model performance, possibly due to its limited impact on the feature space. However, applying Min-Max scaling led to enhanced performance, with increased precision, recall, and F1-scores for both classes, and an overall accuracy improvement from 0.97 to 0.98.

- Cross-validation results showed that Logistic Regression with Min-Max scaling achieved an average accuracy of 95.4% across 5 folds, indicating strong generalization. Despite a drop in performance on the external dataset (0.7 for the accuracy), the model maintained high precision for non-spam emails (0.91) and good recall for spam (0.75), suggesting it retains some generalization capabilities. Future improvements could focus on incorporating more diverse training data and addressing class imbalance more effectively.

- Challenges included the feature independence assumption in Naive Bayes, which might not capture all the complexities of the data. Applied Logistic Regression to address this issue. While Min-Max scaling improved the performance of Logistic Regression, further exploration of feature engineering techniques could provide additional benefits. Another challenge was managing class imbalance to avoid bias towards non-spam emails. Attempted both under-sampling and over-sampling, but both approaches yielded worse results with the external dataset. Therefore, may consider alternative methods such as advanced resampling techniques or incorporating class weights for further improvement.