# **“This notebook compares CountVectorizer vs. TF-IDF features for spam detection using Multinomial Naive Bayes.”**





# **“Spam classification using countvectorizer features and Naive Bayes.**

Step 1 – Load and inspect data
upload code for reading the sms.tsv file and basic checks.

Step 2 – Pre-processing

Lowercase

Remove punctuation

Label-encode

Step 3 – Train/Test split
Same as before.

Step 4 – Count Vectorization + Model Training

Step 5 – Results & Comparison


In [None]:
#loading data
import pandas as pd
url="http://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
df=pd.read_csv(url, sep='\t',header=None, names=['label','messages'])
df.head()
df.info()
df.describe()
df['label'].value_counts()
df.isna().sum()
#Label encoding
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
df['label_num']=le.fit_transform(df['label'])
df[['label','label_num']]
# lowercase messages (recommended for text modeling like counter vectorization)
df['messages']=df['messages'].str.lower().str.replace(r'[^\w\s]',' ',regex=True)
#Train/Test Split
from sklearn.model_selection import train_test_split
#features and target
x=df['messages']
y=df['label_num']
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42,stratify=y)
len(x_train),len(y_train),len(x_test),len(y_test)

#CounterVectorization
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english',ngram_range=(1,2))
x_train_cv = cv.fit_transform(x_train)
x_test_cv  = cv.transform(x_test)

print(x_train_cv.shape, x_test_cv.shape)
#Train a classifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
model=MultinomialNB()
model.fit(x_train_cv,y_train)
#predictions
y_pred=model.predict(x_test_cv)
#evaluation
print("Accuracy :" , accuracy_score(y_test,y_pred))
print("\nClassification Report \n ",classification_report(y_test,y_pred))
print("\nConfusion Matrix \n",confusion_matrix(y_test,y_pred))

### 🟢 Bottom Line

* The detector is **very accurate**.
* Only a handful of normal messages were falsely flagged (FP).
* It caught almost all spam, missing only a small number (FN).


# **“Spam classification using TF-IDF features and Naive Bayes. Goal: compare with previous CountVectorizer model.”**

Step 1 – Load and inspect data
Reuse my earlier code for reading the sms.tsv file and basic checks.

Step 2 – Pre-processing

Lowercase

Remove punctuation

Label-encode

Step 3 – Train/Test split
Same as before.

Step 4 – TF-IDF Vectorization + Model Training


Step 5 – Results & Comparison


In [None]:
#loading data
import pandas as pd
url="http://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
df=pd.read_csv(url, sep='\t',header=None, names=['label','messages'])
df.head()
df.info()
df.describe()
df['label'].value_counts()
df.isna().sum()
#Label encoding
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
df['label_num']=le.fit_transform(df['label'])
df[['label','label_num']]
# lowercase messages (recommended for text modeling like counter vectorization)
df['messages']=df['messages'].str.lower().str.replace(r'[^\w\s]',' ',regex=True)
#Train/Test Split
from sklearn.model_selection import train_test_split
#features and target
x=df['messages']
y=df['label_num']
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42,stratify=y)
len(x_train),len(y_train),len(x_test),len(y_test)
#CounterVectorization
from sklearn.feature_extraction.text import TfidfVectorizer

 #TF-IDF features
tfidf = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
x_train_tfidf = tfidf.fit_transform(x_train)
x_test_tfidf  = tfidf.transform(x_test)

#Train a classifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
model=MultinomialNB()
model.fit(x_train_tfidf,y_train)
#predictions
y_pred=model.predict(x_test_tfidf)
#evaluation
print("Accuracy :" , accuracy_score(y_test,y_pred))
print("\nClassification Report \n ",classification_report(y_test,y_pred))
print("\nConfusion Matrix \n",confusion_matrix(y_test,y_pred))

## ***🟢 Bottom Line – TF-IDF Model***

The TF-IDF version is similarly high-performing, with accuracy on par (or slightly higher) than CountVectorizer.

Because TF-IDF down-weights very common words, it focuses more on distinctive terms, leading to even fewer false positives in many runs.

Spam detection remains strong—almost every spam message is caught, with only a small number of false negatives.