<a href="https://colab.research.google.com/github/keerthana6126/FMML_Projects_and_labs/blob/main/Spam_Email_Detector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This project applies Machine Learning for text classification using the well-known SMS Spam dataset. The goal is to classify messages as either spam or ham (non-spam).

🔹 Key Steps:

Created a clean dataset (200 rows, balanced ham/spam) for demonstration and training.

Applied TF-IDF vectorization to transform SMS text into numerical features.

Implemented two models:

Logistic Regression → Baseline linear classifier

Random Forest → Ensemble learning for better accuracy

Evaluated using precision, recall, F1-score, ROC-AUC, and confusion matrix visualization.

Saved trained models (.pkl files) for future inference and deployment.

🔹 Outcome:
The project demonstrates proficiency in NLP preprocessing, model comparison, and spam detection. It can be extended to larger datasets or deployed as a simple web app/API for real-world use.

In [4]:
# 📌 SMS Spam Classification with Logistic Regression & Random Forest

# Imports
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from google.colab import files

# 🔹 Step 1: Upload the CSV file
uploaded = files.upload()   # Choose SMSSpamCollection.csv from your system

# 🔹 Step 2: Load dataset
df = pd.read_csv(next(iter(uploaded)), encoding="latin-1")

# Convert label to numeric (1 = spam, 0 = ham)
df['y'] = (df['label'] == 'spam').astype(int)

# Features and target
X = df['text']
y = df['y']

# TF-IDF Vectorizer
vec = TfidfVectorizer(max_features=5000, stop_words='english')
X_vec = vec.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.2, random_state=42)

# 🔹 Logistic Regression baseline
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

print("=== Logistic Regression ===")
print(classification_report(y_test, y_pred_lr))
print(confusion_matrix(y_test, y_pred_lr))

# 🔹 Random Forest Classifier
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("\n=== Random Forest ===")
print(classification_report(y_test, y_pred_rf))
print(confusion_matrix(y_test, y_pred_rf))


Saving SMSSpamCollection_large.csv to SMSSpamCollection_large.csv
=== Logistic Regression ===
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        22
           1       1.00      1.00      1.00        18

    accuracy                           1.00        40
   macro avg       1.00      1.00      1.00        40
weighted avg       1.00      1.00      1.00        40

[[22  0]
 [ 0 18]]

=== Random Forest ===
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        22
           1       1.00      1.00      1.00        18

    accuracy                           1.00        40
   macro avg       1.00      1.00      1.00        40
weighted avg       1.00      1.00      1.00        40

[[22  0]
 [ 0 18]]
