#Spam vs. Ham Classifier using scikit-learn
**Procedure/Pipeline**
<ol><li>Install/import necessary libraries
<li>Loads the hatespeech dataset.
<li>Preprocesses it
<li>Splits into training and test sets.
<li>Converts text to TF-IDF vectors.
<li>Trains a Naive Bayes, SVM, RF and other classifiers.
<li>Evaluates accuracy + precision/recall/F1.
<li>Document your results and recommend the best model.</ol>

In [None]:
# Step 1: Install dependencies (if not already installed)
# pip install pandas scikit-learn

import pandas as pd #Imports the pandas library and gives it the shorter name pd. pandas helps you load, view, and manipulate tabular data (like spreadsheet rows/columns).
from sklearn.model_selection import train_test_split # Imports a helper function train_test_split from scikit-learn. This function splits your data into training and testing sets so you can evaluate how well a model generalizes.
from sklearn.feature_extraction.text import TfidfVectorizer #Imports TfidfVectorizer, which converts raw text (sentences) into numeric feature vectors using the TF-IDF weighting scheme
from sklearn.naive_bayes import MultinomialNB #Imports the MultinomialNB classifier (a Naive Bayes variant). This is a simple, fast model that works well for text classification.
from sklearn.metrics import classification_report, accuracy_score #Imports functions to evaluate the model: accuracy_score (overall percent correct) and classification_report (precision/recall/F1 per class).


In [None]:
# Step 2: Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"

# Download and extract the dataset
import requests
import zipfile
import io

response = requests.get(url)
with zipfile.ZipFile(io.BytesIO(response.content)) as z:
    z.extractall()

# Read the dataset
'''sep='\t' — the file uses a tab (\t) between fields.
header=None — the file has no header row, so pandas shouldn’t treat the first line as column names.
names=["label", "message"] — give the two columns friendly names: label (spam/ham) and message (text of the SMS).'''

df = pd.read_csv("SMSSpamCollection", sep='\t', header=None, names=["label", "message"])

print("Dataset sample:")
print(df.head()) # Prints the first 5 rows by default.

Dataset sample:
  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   label    5572 non-null   object
 1   message  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [None]:
df['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
ham,4825
spam,747


In [None]:
# Step 3: Encode labels (ham = 0, spam = 1) Replaces the string labels with numbers: "ham" → 0, "spam" → 1.
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

# Step 4: Train-test split: Splits the dataset into features (X) and labels (y) for training and testing.
X_train, X_test, y_train, y_test = train_test_split(
    df['message'], df['label'], test_size=0.2, random_state=42
)

**vectorizer = TfidfVectorizer(stop_words='english')**<br>
Creates a TfidfVectorizer instance and stores it in vectorizer.
TF-IDF stands for Term Frequency – Inverse Document Frequency: it scores words by how important they are to a document relative to the whole corpus. Common words that appear in nearly every message get lower weight.<br>
stop_words='english' tells the vectorizer to remove common English stop words (like "the", "is", "and") before building features.

**fit_transform does two things:**<br>
<ul><li>fit — builds the vocabulary and IDF statistics from the training text (learns which words exist and how common they are), and
<li>transform — converts each training message into a numeric vector (sparse matrix) using those learned weights. <br>
The result X_train_tfidf is a numeric matrix that machine learning models can use.

**X_test_tfidf = vectorizer.transform(X_test)**

Converts the test messages to numeric vectors using the same vocabulary and IDF learned from the training data.<br>
Important: do not call fit_transform on test data — that would leak information from the test set into the model.

In [None]:
# Step 5: Vectorize text (TF-IDF)
vectorizer = TfidfVectorizer(stop_words='english')

X_train_tfidf = vectorizer.fit_transform(X_train)

X_test_tfidf = vectorizer.transform(X_test)

**clf = MultinomialNB()**

Creates an instance of the Multinomial Naive Bayes classifier and stores it in clf. This model is simple and commonly used for text classification because it works with discrete features (like word counts or TF-IDF).

**clf.fit(X_train_tfidf, y_train)**

Trains (fit) the classifier on the training features and labels. The model learns how TF-IDF patterns map to spam or ham.

In [None]:
# Step 6: Train a Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)

**y_pred = clf.predict(X_test_tfidf)**

Uses the trained model to predict labels for the test set.<br>
y_pred contains the predicted 0/1 values.

**print("\nClassification Report:\n", classification_report(y_test, y_pred))**

Prints a detailed report with these metrics (for each class):
<ul>
<li>Precision — of messages predicted as spam, how many were actually spam?
<li>Recall — of all actual spam messages, how many did the model find?
<li>F1-score — harmonic mean of precision and recall (a balanced single-number summary).
<li>Support — number of true instances per class in the test set.</ul>

In [None]:
# Step 7: Evaluate
y_pred = clf.predict(X_test_tfidf)
print("\nAccuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.97847533632287

Classification Report:
               precision    recall  f1-score   support

           0       0.98      1.00      0.99       966
           1       1.00      0.84      0.91       149

    accuracy                           0.98      1115
   macro avg       0.99      0.92      0.95      1115
weighted avg       0.98      0.98      0.98      1115

