<a href="https://colab.research.google.com/github/noobhacker02/CBT-CIP/blob/main/Project_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 📧 Task 4: Spam Email Detection App - CipherByte Internship

As part of my internship at **CipherByte Technologies**, I developed a machine learning-powered web application to detect spam emails using **Naive Bayes Classification**.

This interactive tool allows users to enter an email message and instantly get a prediction — **Spam** or **Ham** — based on a trained model.

---

## 💡 Overview

The project focuses on binary text classification using natural language processing techniques. It combines:
- Text preprocessing
- Feature extraction (TF-IDF)
- Naive Bayes classification
- Web deployment with **Gradio**

---

## 🧠 Model Details

- 📘 **Dataset**: Sourced from a labeled Excel sheet (`Spam Email Detection.xlsx`)
- 🧹 **Preprocessing**:
  - Lowercasing
  - Removing URLs, numbers, punctuation
- 🧪 **Vectorization**: `TfidfVectorizer` with bigrams and stopword filtering
- 📊 **Model**: `Multinomial Naive Bayes` with `alpha=0.1`

---

## 🧾 Metrics

- 🔍 **Accuracy**: `~97%`
- 📑 **Classification Report**: Includes precision, recall, and F1-score for both spam and ham
- 🧱 **Confusion Matrix**: Helps visualize model performance on test data

---

## 🚀 Web Interface with Gradio

- 🔤 **Input**: Email message (free text)
- 📮 **Output**: "Spam" or "Ham"
- 🌐 **Built with**: Gradio for quick browser-based interaction

---

## 🧪 Sample Usage

```plaintext
Input: "Congratulations! You've won a $1000 gift card. Click here to claim."
Output: Spam

Input: "Hi John, let’s reschedule our meeting for tomorrow."
Output: Ham
```

---

## ⚙️ Tech Stack

| Tool/Library        | Purpose                       |
|---------------------|-------------------------------|
| `Pandas`            | Data loading & manipulation   |
| `Regex` & `String`  | Text cleaning & normalization |
| `scikit-learn`      | Model training & evaluation   |
| `Gradio`            | UI for real-time interaction  |
| `Excel`             | Dataset source format         |

---





## 👨‍💻 Developed By

**Talha Shaikh**  
🔗 [LinkedIn](https://www.linkedin.com/in/talha-s-145729339/)  
📌 Project for **#CipherByteTech** Internship

---

> “Spam doesn’t stand a chance when machine learning is in the inbox.”
```



In [3]:
!pip install gradio

import pandas as pd
import re
import string
import gradio as gr
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the dataset
def load_and_preprocess_data():
    df = pd.read_excel('Spam Email Detection.xlsx', sheet_name='spam')
    df = df[['v1', 'v2']]  # Select relevant columns
    df.columns = ['label', 'message']  # Rename columns
    df.dropna(subset=['message'], inplace=True)  # Remove rows with missing messages
    return df

# Preprocess the text
def preprocess_text(text):
    # Ensure the text is a string
    text = str(text).lower()
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'\d+', ' ', text)     # Replace numbers with space
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    return text

# Convert labels to binary format
def convert_labels(df):
    df['label'] = df['label'].map({'ham': 0, 'spam': 1})
    return df

# Train Naive Bayes model
def train_model(df):
    df['cleaned_message'] = df['message'].apply(preprocess_text)
    df = convert_labels(df)

    X = df['cleaned_message']
    y = df['label']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    tfidf = TfidfVectorizer(max_features=5000, stop_words='english', ngram_range=(1, 2))
    X_train_tfidf = tfidf.fit_transform(X_train)
    X_test_tfidf = tfidf.transform(X_test)

    model = MultinomialNB(alpha=0.1)  # Adjust hyperparameter alpha
    model.fit(X_train_tfidf, y_train)

    y_pred = model.predict(X_test_tfidf)
    accuracy = accuracy_score(y_test, y_pred)
    classification_rep = classification_report(y_test, y_pred)
    confusion_mat = confusion_matrix(y_test, y_pred)

    return model, tfidf, accuracy, classification_rep, confusion_mat

# Predict whether the email is spam or ham
def predict_spam_or_ham(model, tfidf, message):
    cleaned_message = preprocess_text(message)
    message_tfidf = tfidf.transform([cleaned_message])
    prediction = model.predict(message_tfidf)
    return "Spam" if prediction[0] == 1 else "Ham"

# Load data and train model
df = load_and_preprocess_data()
model, tfidf, accuracy, classification_rep, confusion_mat = train_model(df)

# Gradio interface
def gradio_interface(message):
    result = predict_spam_or_ham(model, tfidf, message)
    return result

# Create Gradio interface
iface = gr.Interface(fn=gradio_interface, inputs="text", outputs="text", title="Spam Email Detection - CipherByte Internship", description="Enter an email message to detect if it is spam or ham. Developed by Talha Shaikh | [LinkedIn](https://www.linkedin.com/in/talha-s-145729339/) | #cipherbytetech")

# Launch the Gradio interface
iface.launch()


It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://b654c25b19b2c2d9de.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


