In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/spam-mails-dataset/spam_ham_dataset.csv


In [3]:
import string

import re

import pandas as pd
import numpy as np

import matplotlib as plt

import nltk

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

### Essential Library Imports for Email Classification

To effectively preprocess the email text data and train our classification model, we need to import several important Python libraries:

- **string**: Provides access to a list of common punctuation characters, which will be useful for filtering punctuation from the text.
- **re**: Enables the use of regular expressions for pattern matching, which is essential for text cleaning tasks such as punctuation removal.
- **pandas**: A powerful library for data manipulation and analysis, which allows us to handle our dataset as a DataFrame.
- **numpy**: Provides support for large, multi-dimensional arrays and matrices, and includes mathematical functions to operate on these arrays.
- **matplotlib**: A popular data visualization library, which we'll use to create visual representations of our data.
- **nltk**: The Natural Language Toolkit, which contains tools like stopwords and stemmers to process and simplify text data.
- **sklearn**: The Scikit-learn library, essential for machine learning tasks. We use `train_test_split` to divide the dataset into training and testing sets.

These libraries form the foundation of our spam classification project and will assist in various tasks like text preprocessing, data handling, and model training. Below is the code to import them:

---

In [4]:
df = pd.read_csv('/kaggle/input/spam-mails-dataset/spam_ham_dataset.csv')

### Loading the Spam/Ham Dataset

In this step, we will load the **Spam/Ham Dataset** using the `pandas` library. This dataset contains labeled email data, where each email is classified as either "spam" or "ham" (non-spam). The dataset will be used to train a machine learning model that can automatically classify emails based on their content.

The file is read into a **DataFrame** using `pd.read_csv()`, allowing us to efficiently manipulate, preprocess, and analyze the data in a tabular format. Once loaded, we can begin to clean and prepare the text data for model training.

---

In [5]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5171 entries, 0 to 5170
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  5171 non-null   int64 
 1   label       5171 non-null   object
 2   text        5171 non-null   object
 3   label_num   5171 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 161.7+ KB


Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0


### Exploring the Dataset: Structure and First Look

Before we begin processing the data, it's important to understand the structure and contents of our **Spam/Ham Dataset**. The following steps will help us get an overview of the data:

- **`df.info()`**: This command provides essential information about the dataset, including the number of entries, column names, data types, and whether there are any missing values.
  
- **`df.head()`**: This function allows us to preview the first five rows of the dataset, giving us an initial glimpse into the data and its structure. We'll use this to check the content of the emails and their corresponding labels (spam or ham).

These initial exploratory steps are crucial for ensuring that the dataset is properly loaded and formatted before we proceed with preprocessing and model training.

---

In [6]:
stemmer = PorterStemmer()

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Initializing Text Preprocessing Tools

In this step, we will initialize key tools from the **Natural Language Toolkit (NLTK)** to help us preprocess the email text data:

- **PorterStemmer**: A stemming algorithm that reduces words to their root form. For example, words like "running", "runner", and "runs" will be reduced to "run". This simplifies the dataset and reduces redundancy in word forms.
  
- **Stopwords**: Stopwords are common words like "the", "is", "in", and "and", which do not contribute much to the meaning of the text. We'll download the list of English stopwords from NLTK and remove them from the email content during preprocessing to focus on the more significant words.

These tools are essential for cleaning and simplifying the text, which will improve the accuracy of our machine learning model. Here’s the code to initialize them:

---

In [7]:
def preprocess_text(text):
    
    text = text.lower()
    text = re.sub(f"[{string.punctuation}]", " ", text)
    
    words = text.split()
    filtered_words = [word for word in words if word not in stop_words]
    
    stemmed_words = [stemmer.stem(word) for word in filtered_words]
    
    return " ".join(stemmed_words)

df['simplified_text'] = df['text'].apply(preprocess_text)

print(df[['text', 'simplified_text']].head(3))

                                                text  \
0  Subject: enron methanol ; meter # : 988291\r\n...   
1  Subject: hpl nom for january 9 , 2001\r\n( see...   
2  Subject: neon retreat\r\nho ho ho , we ' re ar...   

                                     simplified_text  
0  subject enron methanol meter 988291 follow not...  
1  subject hpl nom januari 9 2001 see attach file...  
2  subject neon retreat ho ho ho around wonder ti...  


### Preprocessing Email Text Data for Spam Classification

Before training our machine learning model to classify emails as spam or not, we need to preprocess the text data to improve its quality and ensure the model performs well. The preprocessing steps include:

1. **Lowercasing**: Convert all text in the email to lowercase to maintain uniformity and avoid case-sensitive mismatches.
2. **Removing Punctuation**: Punctuation marks do not contribute meaningfully to the spam classification task, so they will be removed using regular expressions.
3. **Tokenization and Stopword Removal**: The text will be split into individual words (tokens), and common English stopwords (e.g., "the", "is", "and") will be removed to focus on more meaningful words.
4. **Stemming**: Words will be reduced to their root form using the Porter Stemmer (e.g., "running" becomes "run"), which reduces the dimensionality of the feature space.

By applying these preprocessing steps, we aim to simplify and clean the data, making it more suitable for machine learning algorithms. The code below applies these transformations to the `text` column of our dataset, creating a new column `simplified_text` with the processed text.

---

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['simplified_text'])

### TF-IDF Vectorization of Text Data

In this step, we're using `TfidfVectorizer` from the `sklearn.feature_extraction.text` module to convert the textual data into numerical features. TF-IDF (Term Frequency-Inverse Document Frequency) helps in weighing the importance of words relative to each document in the dataset.

- **`TfidfVectorizer()`**: Initializes the TF-IDF vectorizer.
- **`fit_transform(df['simplified_text'])`**: 
  - **`fit`**: Learns the vocabulary from the 'simplified_text' column of our dataframe.
  - **`transform`**: Converts the text into a TF-IDF-weighted term-document matrix.

The result, `X`, is a sparse matrix where each row represents a document and each column represents a unique word from the corpus, with the corresponding values being their TF-IDF scores.

---

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, df['label'], test_size=0.2, random_state=42)

### Splitting the Dataset into Training and Testing Sets

In this step, we split our dataset into training and testing subsets using `train_test_split` from the `sklearn.model_selection` module. This ensures we have separate data for training the model and for evaluating its performance.

- **`train_test_split(X, df['label'], test_size=0.2, random_state=42)`**:
  - **`X`**: The TF-IDF matrix representing the features (transformed text data).
  - **`df['label']`**: The target labels associated with each document (spam or not spam).
  - **`test_size=0.2`**: 20% of the data will be used for testing, and 80% for training.
  - **`random_state=42`**: Sets a seed for reproducibility so that the same random splitting can be achieved every time the code is run.

The output variables:
- **`X_train`**: Training set features (80% of the TF-IDF data).
- **`X_test`**: Testing set features (20% of the TF-IDF data).
- **`y_train`**: Training set labels corresponding to `X_train`.
- **`y_test`**: Testing set labels corresponding to `X_test`.

---

In [12]:
model = RandomForestClassifier()
model.fit(X_train, y_train)

### Training the Random Forest Classifier

In this step, we are using a Random Forest Classifier to train our model on the training data. Random Forest is an ensemble learning method that creates multiple decision trees during training and outputs the class that is the mode of the classes predicted by individual trees.

- **`RandomForestClassifier()`**: Initializes the Random Forest Classifier.
- **`fit(X_train, y_train)`**: 
  - **`X_train`**: The training features (TF-IDF matrix for the training data).
  - **`y_train`**: The training labels (whether the emails are spam or not).

The model learns patterns in the training data to make predictions on unseen data in the future.

---

In [13]:
from sklearn.metrics import classification_report, accuracy_score
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))

              precision    recall  f1-score   support

         ham       0.99      0.98      0.99       742
        spam       0.96      0.97      0.96       293

    accuracy                           0.98      1035
   macro avg       0.97      0.98      0.98      1035
weighted avg       0.98      0.98      0.98      1035

Accuracy: 0.9797101449275363


### Evaluating the Model Performance (Accuracy: %97.97101)

After training the Random Forest Classifier, we now evaluate its performance on the test data using various metrics:

- **`model.predict(X_test)`**: Generates predictions (`y_pred`) for the test set based on the patterns learned from the training data.
  
- **`classification_report(y_test, y_pred)`**: Provides detailed performance metrics including:
  - **Precision**: The proportion of correctly predicted positive observations out of all predicted positives.
  - **Recall (Sensitivity)**: The proportion of correctly predicted positive observations out of all actual positives.
  - **F1-Score**: The harmonic mean of precision and recall, giving a single performance score for each class.
  - **Support**: The number of occurrences of each class in the true labels.

- **`accuracy_score(y_test, y_pred)`**: Calculates the overall accuracy of the model, which is the proportion of correctly predicted labels out of the total test samples.

This evaluation helps in understanding the model’s performance in terms of both accuracy and how well it balances precision and recall for each class.

---

In [14]:
from sklearn.model_selection import GridSearchCV
param_grid = {'n_estimators': [100, 200, 300], 'max_depth': [None, 10, 20, 30]}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)

{'max_depth': None, 'n_estimators': 100}


### Hyperparameter Tuning with Grid Search

In this step, we use **GridSearchCV** to perform an exhaustive search over specified hyperparameters for the Random Forest Classifier, optimizing the model's performance.

- **`param_grid`**: A dictionary defining the hyperparameters to be tested:
  - **`n_estimators`**: Number of trees in the forest (100, 200, or 300).
  - **`max_depth`**: The maximum depth of each tree (None, 10, 20, or 30). `None` allows nodes to expand until all leaves are pure or contain fewer than the minimum number of samples required to split.

- **`GridSearchCV()`**: 
  - Initializes a grid search with the Random Forest Classifier.
  - **`cv=5`**: Performs 5-fold cross-validation, splitting the training set into 5 subsets and training the model on different combinations to validate performance.

- **`fit(X_train, y_train)`**: Trains multiple models with different combinations of the specified hyperparameters on the training data.

- **`grid_search.best_params_`**: Outputs the best combination of hyperparameters after the search.

This method helps in finding the most effective set of parameters to improve model performance.

---

In [15]:
optimized_model = RandomForestClassifier(max_depth=None, n_estimators=100)

optimized_model.fit(X_train, y_train)

y_pred_optimized = optimized_model.predict(X_test)

print(classification_report(y_test, y_pred_optimized))
print("Accuracy:", accuracy_score(y_test, y_pred_optimized))

              precision    recall  f1-score   support

         ham       0.99      0.99      0.99       742
        spam       0.96      0.98      0.97       293

    accuracy                           0.98      1035
   macro avg       0.98      0.98      0.98      1035
weighted avg       0.98      0.98      0.98      1035

Accuracy: 0.9835748792270531


### Training and Evaluating the Optimized Model (Accuracy: %98.35748)

After identifying the optimal hyperparameters through Grid Search, we train a new Random Forest model with the best parameters and evaluate its performance.

- **`RandomForestClassifier(max_depth=None, n_estimators=100)`**: 
  - Initializes the Random Forest Classifier with the best parameters found from the Grid Search.
  - **`max_depth=None`**: Allows the trees to expand fully.
  - **`n_estimators=100`**: Specifies the number of trees in the forest as 100.

- **`fit(X_train, y_train)`**: Trains the optimized model on the training data.

- **`y_pred_optimized = optimized_model.predict(X_test)`**: Predicts the labels for the test set using the optimized model.

- **`classification_report(y_test, y_pred_optimized)`**: Displays precision, recall, F1-score, and support for each class (spam or not spam).

- **`accuracy_score(y_test, y_pred_optimized)`**: Calculates the overall accuracy of the optimized model.

This step helps evaluate whether the hyperparameter tuning has improved the model’s performance compared to the default configuration.

---