# ðŸ”¥ Real-World Spam Detection with Naive Bayes (SMS Spam)
Build a **spam detector** using the classic **SMS Spam Collection** dataset (ham vs spam).

You will:
1) Download dataset
2) Load & clean text
3) Text â†’ numbers (TFâ€‘IDF)
4) Train **Multinomial Naive Bayes**
5) Evaluate (accuracy + precision/recall/F1 + confusion matrix)
6) Inspect top spam/ham words
7) Try your own messages


## 1) Download & load the dataset

In [None]:
import io, zipfile, requests
import pandas as pd

url = 'https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip'
r = requests.get(url, timeout=60)
r.raise_for_status()

z = zipfile.ZipFile(io.BytesIO(r.content))
print('Files in zip:', z.namelist())

with z.open('SMSSpamCollection') as f:
    raw = f.read().decode('utf-8', errors='replace')

rows = [line.split('\t', 1) for line in raw.splitlines() if '\t' in line]
df = pd.DataFrame(rows, columns=['label', 'text'])
df['label'] = df['label'].str.strip()
df['text'] = df['text'].str.strip()

df.head()

In [None]:
print('Dataset size:', len(df))
print(df['label'].value_counts())

## 2) Train/test split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df['text'], df['label'], test_size=0.2, random_state=42, stratify=df['label']
)
print('Train:', len(X_train), 'Test:', len(X_test))

## 3) Text â†’ numbers (TFâ€‘IDF)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    lowercase=True,
    stop_words='english',
    ngram_range=(1, 2),
    min_df=2
)

X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

print('Vectorized shape:', X_train_vec.shape)

## 4) Train Naive Bayes (MultinomialNB)

In [None]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB(alpha=0.5)  # smoothing
nb.fit(X_train_vec, y_train)
print('Model trained!')

## 5) Evaluate

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

y_pred = nb.predict(X_test_vec)

print('Accuracy:', accuracy_score(y_test, y_pred))
print('\nConfusion Matrix (rows=true, cols=pred):')
print(confusion_matrix(y_test, y_pred, labels=['ham','spam']))

print('\nReport:')
print(classification_report(y_test, y_pred))

## 6) Inspect top spam/ham words (what the model learned)

In [None]:
import numpy as np

feature_names = np.array(vectorizer.get_feature_names_out())
class_index = {c:i for i,c in enumerate(nb.classes_)}
spam_i = class_index['spam']
ham_i  = class_index['ham']

# Higher = more spammy
scores = nb.feature_log_prob_[spam_i] - nb.feature_log_prob_[ham_i]

top_spam = feature_names[np.argsort(scores)[-20:]][::-1]
top_ham  = feature_names[np.argsort(scores)[:20]]

print('Top indicators of SPAM:')
print(', '.join(top_spam))
print('\nTop indicators of HAM:')
print(', '.join(top_ham))

## 7) Try your own messages

In [None]:
def predict_message(msg: str):
    vec = vectorizer.transform([msg])
    pred = nb.predict(vec)[0]
    proba = nb.predict_proba(vec)[0]
    return pred, {cls: float(p) for cls, p in zip(nb.classes_, proba)}

examples = [
    'WIN a free vacation now!!! Click here to claim',
    'Hey are we still on for dinner at 7?',
    'URGENT! You have won a 1000 cash prize. Call now',
]

for m in examples:
    pred, proba = predict_message(m)
    print('Message:', m)
    print('Prediction:', pred)
    print('Probabilities:', proba)
    print('-'*80)

## 8) Next upgrades (optional)
- Compare TFâ€‘IDF vs CountVectorizer
- Tune `alpha`
- Try Logistic Regression / Linear SVM
- Add text cleaning (URLs, phone numbers, etc.)
