# Spam Ham Detection using Naive Bayes by kuldeep sharma


Hi, I'm Kuldeep Sharma. I'm working on a project to detect spam and ham messages using Python.
It's basically an automated system that classifies incoming messages as either spam or ham based on their content.
It's a pretty interesting project, and I'm excited to see how it turns out.

# code

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# load the data into a pandas DataFrame
data = pd.read_csv('./SMSSpamCollection.tsv', sep='\t', header=None)
data.columns = ['label', 'text']

# preprocess the text by converting to lowercase, removing punctuation, and tokenizing
data['text'] = data['text'].str.lower()
data['text'] = data['text'].str.replace('[^\w\s]', '', regex=False)
data['text'] = data['text'].str.split()

# convert the text into a matrix of token counts
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(data['text'].apply(lambda x: ' '.join(x)))
y = np.array(data['label'])

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# train the Naive Bayes classifier on the training set
clf = MultinomialNB()
clf.fit(X_train, y_train)

# evaluate the performance of the classifier on the testing set
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label='spam')
recall = recall_score(y_test, y_pred, pos_label='spam')
f1 = f1_score(y_test, y_pred, pos_label='spam')

print(f'Accuracy: {accuracy:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F1 score: {f1:.2f}')

# tune the hyperparameters of the classifier using cross-validation
scores = cross_val_score(clf, X, y, cv=5)
print(f'Cross-validation scores: {scores}')
print(f'Average cross-validation score: {np.mean(scores):.2f}')


Accuracy: 0.98
Precision: 0.89
Recall: 0.95
F1 score: 0.92
Cross-validation scores: [0.98114901 0.98025135 0.97396768 0.97843666 0.98203055]
Average cross-validation score: 0.98
