<a href="https://colab.research.google.com/github/kevin9549/Data-Science-Project/blob/main/Classifying_Spam_Emails_using_Text_Analytics_and_Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Email Spam Classification Using Naive Bayes<h1>

#1. Introduction


In this project, we will build a classification model to identify whether an email is spam or not. The dataset consists of two columns: **message_content** (the email body) and ***is_spam*** (a binary label indicating whether the email is spam). We will use text processing techniques and the **Naive Bayes** algorithm for the classification task.



# 2. Step-by-Step Workflow


We will follow these steps to complete the project:

1. Import necessary libraries.
2. Load the dataset.
3. Preprocess the text data (tokenization, lowercasing, removing stopwords).
4. Convert the text into numerical features using TF-IDF Vectorization.
5. Build the Naive Bayes classification model.
6. Evaluate the model using accuracy and classification metrics.
7. Predict whether new email messages are spam or not.

In [None]:
# Step 1: Importing the necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [None]:
# Downloading NLTK resources
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# 3. Loading and Exploring the Dataset

Here we will load the dataset, inspect its structure, and understand its composition. Ensure your dataset contains the **message_content** and **is_spam** columns.

In [None]:
# Step 2: Loading the dataset
df = pd.read_csv('spam_dataset.csv')

# Display the first few rows of the dataset to understand its structure
df.head()


Unnamed: 0,message_content,is_spam
0,"Hello Lonnie,\n\nJust wanted to touch base reg...",0
1,"Congratulations, you've won a prize! Call us n...",1
2,You have been pre-approved for a credit card w...,1
3,"Limited time offer, act now! Only a few spots ...",1
4,Your loan has been approved! Transfer funds to...,1


# 4. Text Preprocessing
Raw text data needs to be processed before feeding it into a machine learning model. We'll apply the following preprocessing steps:

1. Lowercasing: Convert all text to lowercase for uniformity.
2. Tokenization: Split text into individual words (tokens).
3. Stopwords Removal: Remove common words like "the", "is", etc., that do not contribute to the meaning of the text.

In [None]:
# Step 3: Preprocessing the text
def preprocess_text(text):
    # Convert text to lowercase
    tokens = word_tokenize(text.lower())

    # Remove non-alphabetic characters
    tokens = [word for word in tokens if word.isalpha()]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Join tokens back to form the preprocessed text
    return " ".join(tokens)

# Apply the preprocessing function to the message_content column
df['message_content'] = df['message_content'].apply(preprocess_text)

# Display the first few preprocessed messages
df.head()


Unnamed: 0,message_content,is_spam
0,hello lonnie wanted touch base regarding proje...,0
1,congratulations prize call us claim account se...,1
2,credit card high limit special offer available...,1
3,limited time offer act spots left immediate ac...,1
4,loan approved transfer funds today hurry oppor...,1


# 5. Splitting the Dataset
We'll split the dataset into **training**  and **testing** sets. The training set will be used to train the model, while the testing set will be used to evaluate its performance.

In [None]:
# Step 4: Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(df['message_content'], df['is_spam'], test_size=0.2, random_state=42)

# Check the shape of the training and testing sets
print(f"Training set size: {X_train.shape}")
print(f"Testing set size: {X_test.shape}")


Training set size: (800,)
Testing set size: (200,)


# 6. TF-IDF Vectorization
Text data cannot be directly fed into machine learning algorithms. We use TF-IDF (Term Frequency-Inverse Document Frequency) to convert text data into numerical features that can be used by the classifier.

In [None]:
# Step 5: Applying TF-IDF Vectorization
vectorizer = TfidfVectorizer()

# Transform the training and testing data
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Check the size of the TF-IDF matrices
print(f"TF-IDF training matrix shape: {X_train_tfidf.shape}")
print(f"TF-IDF testing matrix shape: {X_test_tfidf.shape}")


TF-IDF training matrix shape: (800, 1571)
TF-IDF testing matrix shape: (200, 1571)


# 7. Building the Naive Bayes Classifier
Naive Bayes is a simple yet effective classification algorithm for text data, particularly for tasks like spam detection. We will train the model using the training data and then evaluate its performance on the test set.

In [None]:
# Step 6: Training the Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)

# Predicting the test set results
y_pred = model.predict(X_test_tfidf)

# Evaluating the model's performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


Accuracy: 1.0
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        99
           1       1.00      1.00      1.00       101

    accuracy                           1.00       200
   macro avg       1.00      1.00      1.00       200
weighted avg       1.00      1.00      1.00       200



In [None]:
import joblib

# Menyimpan model ke dalam format .joblib
joblib.dump(model, 'spam_classifier.joblib')

# Menyimpan TF-IDF vectorizer
joblib.dump(vectorizer, 'tfidf_vectorizer.joblib')


['tfidf_vectorizer.joblib']

# 8. Predicting New Emails
Once the model is trained, we can use it to predict whether new emails are spam or not.

In [None]:
# Step 7: Predicting on new email content
new_email = ["End of Summer Sale: 50% OFF Lifetime Access!"]
new_email_processed = vectorizer.transform([preprocess_text(new_email[0])])

# Making the prediction
prediction = model.predict(new_email_processed)

print("Prediction (0 = Not Spam, 1 = Spam):", prediction)


Prediction (0 = Not Spam, 1 = Spam): [1]


# 9. Conclusion
In this project, we built an email spam classifier using Naive Bayes. We processed the text data with TF-IDF vectorization and trained the model using a labeled dataset. The model achieved a certain level of accuracy, and we can use it to classify unseen emails as spam or not.

Further improvements could be made by tuning the model, trying other algorithms (like Logistic Regression, SVM, etc.), or using more advanced techniques like neural networks.