## **Spam Detection**

This project aims to detect spam emails by taking the email's message as input and predicting whether it is spam or not.  
I used the Multinomial Naive Bayes algorithm and achieved an **F1 score of 99!**

In [1]:
import pandas as pd 
import numpy as np 
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, classification_report


In [2]:
Df = pd.read_csv(r"spam.csv")
Df

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


preprocessing step

In [3]:
replace_map = {"spam": 1, "ham" : 0}
Df['spam'] = Df['Category'].replace(replace_map)
Df.drop(['Category'], axis = 1, inplace = True)
Df.drop_duplicates()
Df.isna().sum()

  Df['spam'] = Df['Category'].replace(replace_map)


Message    0
spam       0
dtype: int64

In [4]:
X = Df['Message']
y = Df.drop(["Message"], axis= 1)

print (f"the messages data set is \n {X.head}")
print (f"the target data set is \n {y.head}")

the messages data set is 
 <bound method NDFrame.head of 0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                 Will ü b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: Message, Length: 5572, dtype: object>
the target data set is 
 <bound method NDFrame.head of       spam
0        0
1        0
2        1
3        0
4        0
...    ...
5567     1
5568     0
5569     0
5570     0
5571     0

[5572 rows x 1 columns]>


In [5]:
X_train, X_test , y_train, y_test = train_test_split(X, y, test_size= 0.2, shuffle= True ,random_state = 123)

In [6]:
from sklearn.pipeline import Pipeline
classifier =Pipeline([
    ('vectorizer',CountVectorizer()),
    ('nb',MultinomialNB())
])

param_grid = {
    'vectorizer__max_features': [5000, 10000, None],
    'vectorizer__ngram_range': [(1, 1), (1, 2)],
    'vectorizer__stop_words': [None, 'english'],
    'vectorizer__min_df': [1, 2, 5],
    'vectorizer__max_df': [0.75, 0.85, 1.0],
    'vectorizer__binary': [False, True],
    'nb__alpha': [0.1, 0.5, 1.0],
    'nb__fit_prior': [True, False]
}
grid_search = GridSearchCV(
    classifier, 
    param_grid, 
    cv=5, 
    scoring='accuracy', 
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

# Evaluate the best model on the test set
y_pred = grid_search.predict(X_test)
print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Best Parameters: {'nb__alpha': 0.5, 'nb__fit_prior': True, 'vectorizer__binary': False, 'vectorizer__max_df': 0.75, 'vectorizer__max_features': None, 'vectorizer__min_df': 1, 'vectorizer__ngram_range': (1, 2), 'vectorizer__stop_words': 'english'}
Best Score: 0.9874358935644436
Test Accuracy: 0.9865470852017937
Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.99      0.99       962
           1       0.97      0.93      0.95       153

    accuracy                           0.99      1115
   macro avg       0.98      0.96      0.97      1115
weighted avg       0.99      0.99      0.99      1115



  y = column_or_1d(y, warn=True)
