INTRODUCTION ON SPAM DETECTOR MODEL USING MACHINE LEARNING

Introduction on Spam Detector

In the digital age, the vast volume of online communication—particularly through emails, messaging platforms, and social media—has led to a significant increase in unwanted and potentially harmful messages, commonly known as spam. A spam detector is a software system designed to automatically identify and filter out such messages to protect users and enhance their digital experience.

Spam detectors use a combination of rule-based filtering and machine learning algorithms to analyze the content, sender information, and metadata of messages. Traditional methods relied on keyword matching and blacklists, while modern systems leverage advanced techniques such as Natural Language Processing (NLP) and classification algorithms (e.g., Naive Bayes, Support Vector Machines, and Neural Networks) to distinguish between spam and legitimate messages with high accuracy.

Effective spam detection not only improves user productivity and safety but also helps reduce exposure to phishing, scams, and malware. As cyber threats evolve, continuous training of spam detection models using real-world data becomes essential to maintaining their effectiveness.

Objectives

.	Enhance user experience by reducing the volume of unwanted or irrelevant messages.
.	Improve security by filtering out potentially malicious content, such as phishing attempts and malware.
.	Increase efficiency in communication systems by ensuring that users only receive legitimate and relevant messages.
.	Adapt to new patterns in spam content by continuously learning from data and updating its predictions accordingly.

In [14]:
#import the libraries
import pandas as pd 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

In [16]:
#load the dataset
df = pd.read_csv("spam.csv", encoding='latin1')[['v1', 'v2']]
df.columns = ['label', 'message']

In [18]:
df

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [20]:
df['label'] = df['label'].map({'ham':0, 'spam':1})

In [22]:
#train the model
X_train, X_test, y_train, y_test =train_test_split(df['message'], df['label'], test_size=0.2, random_state=42)

In [24]:
#instantiate the model
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

In [26]:
#fit the model
model = MultinomialNB()
model.fit(X_train_vec, y_train)

In [28]:
#predict the target variables to get the accuracy and classification report
y_pred = model.predict(X_test_vec)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.9838565022421525
Classification Report:
               precision    recall  f1-score   support

           0       0.98      1.00      0.99       965
           1       0.99      0.89      0.94       150

    accuracy                           0.98      1115
   macro avg       0.98      0.95      0.96      1115
weighted avg       0.98      0.98      0.98      1115



In [32]:
#call a function to test the spam and ham messages
def predict_spam(text):
    text_vec = vectorizer.transform([text])
    prediction = model.predict(text_vec)
    return"spam" if prediction[0] == 1 else "Ham"
print(predict_spam("you have won 30gig of data! send 312 to claim."))

spam


Conclusion

In conclusion, the development of a spam detector model plays a vital role in enhancing digital communication by automatically identifying and filtering out unwanted or harmful messages. By leveraging machine learning techniques and real-world data, the model can effectively distinguish between spam and legitimate content, improving both security and user experience.

As spam tactics continue to evolve, it is essential to continuously update and retrain the model to maintain its accuracy and reliability. With further refinement, such systems can be integrated into various platforms, providing scalable and intelligent solutions to combat spam in real time.