<a href="https://colab.research.google.com/github/itsayushi0/CODTECH/blob/main/Task_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📧 Spam Email Detection - Machine Learning Model

## 📌 Project Overview
This project is part of **CODTECH Python Internship - Task 4**, where the goal is to **create a predictive machine learning model** using **Scikit-learn** to classify or predict outcomes from a dataset.  
We have implemented a **Spam Email Detection** system that classifies SMS messages as either **Spam** or **Ham (Not Spam)**.

---

## 🛠️ Technologies Used
- **Python 3**
- **Google Colab** (Jupyter Notebook environment)
- **Libraries:**
  - `pandas` - Data handling
  - `numpy` - Numerical operations
  - `scikit-learn` - Machine learning model building
  - `matplotlib` & `seaborn` - Data visualization

---

## 📂 Dataset
We used the **SMS Spam Collection** dataset:
- **Source:** [GitHub Link](https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv)
- **Description:** Each row contains a message and a label (`ham` or `spam`).

---

## 🔍 Workflow
1. **Load Dataset** from URL.
2. **Explore Data** (shape, class distribution, sample messages).
3. **Preprocess Data:**
   - Convert labels (`ham` → 0, `spam` → 1)
   - Split into training and test sets
4. **Feature Extraction:**
   - Convert text into numeric vectors using **CountVectorizer** (Bag-of-Words)
5. **Model Training:**
   - Use **Multinomial Naive Bayes** classifier
6. **Model Evaluation:**
   - Accuracy Score
   - Precision, Recall, F1-score
   - Confusion Matrix visualization
7. **Result Interpretation & Conclusion**

---

## 📊 Results
- **Accuracy:** ~98%
- **Precision (Spam):** High, meaning fewer false spam detections.
- **Recall (Spam):** High, meaning most spam messages are correctly caught.
- **Confusion Matrix:** Shows low misclassification rates.

---





In [None]:
# Step 1: Install necessary libraries
!pip install scikit-learn pandas numpy matplotlib seaborn

# Step 2: Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

# Step 3: Load dataset
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
df = pd.read_csv(url, sep='\t', names=['label', 'message'])

# Step 4: Data overview
print(df.head())
print("\nDataset Shape:", df.shape)
print("\nClass Distribution:\n", df['label'].value_counts())

# Step 5: Convert labels to binary
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

# Step 6: Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    df['message'], df['label'], test_size=0.2, random_state=42)

# Step 7: Text vectorization
vectorizer = CountVectorizer()
X_train_vect = vectorizer.fit_transform(X_train)
X_test_vect = vectorizer.transform(X_test)

# Step 8: Train Naive Bayes model
model = MultinomialNB()
model.fit(X_train_vect, y_train)

# Step 9: Predictions
y_pred = model.predict(X_test_vect)

# Step 10: Evaluation
print("\nAccuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=['Ham', 'Spam'], yticklabels=['Ham', 'Spam'])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
