<a href="https://colab.research.google.com/github/mahadev-k-anil/python-tutorial/blob/main/spam_filter_naive_bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Spam filter using Multinomial Naive Bayes with scikit-learn

##Step 1: Import the necessary libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

##Step 2: Create a sample dataset
For this basic example, a small dataset is created directly. In a real-world application, you would load a larger dataset from a CSV file.

In [2]:
data = {
    'message': [
        'Free entry in our weekly prize draw!',
        'Hey man, how about a game of golf tomorrow?',
        'WINNER! You have been selected to win a prize!',
        'Hi, are you free for a meeting today?',
        'Your free mobile phone just for being a loyal customer!',
        'Hey mate, can you call me on the weekend?',
        'CONGRATULATIONS! You have won a FREE holiday.',
        'Lunch tomorrow? Just checking in.',
        'Get a free prize now! No purchase necessary.',
        'Can you send me the report by end of day?',
    ],
    'label': [
        'spam',
        'ham',
        'spam',
        'ham',
        'spam',
        'ham',
        'spam',
        'ham',
        'spam',
        'ham',
    ]
}
df = pd.DataFrame(data)

print(df)

                                             message label
0               Free entry in our weekly prize draw!  spam
1        Hey man, how about a game of golf tomorrow?   ham
2     WINNER! You have been selected to win a prize!  spam
3              Hi, are you free for a meeting today?   ham
4  Your free mobile phone just for being a loyal ...  spam
5          Hey mate, can you call me on the weekend?   ham
6      CONGRATULATIONS! You have won a FREE holiday.  spam
7                  Lunch tomorrow? Just checking in.   ham
8       Get a free prize now! No purchase necessary.  spam
9          Can you send me the report by end of day?   ham


##Step 3: Split the data into training and test sets
This step separates the dataset into two parts: one for training the model and one for evaluating its performance.

In [3]:
X = df['message']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")


Training set size: 7
Test set size: 3


##Step 4: Create a feature vector using CountVectorizer
Machine learning models cannot process raw text directly. CountVectorizer converts a collection of text documents into a matrix of token counts, also known as a bag-of-words model.

In [4]:
vectorizer = CountVectorizer()

# Fit the vectorizer on the training data and transform the training text to vectors
X_train_vec = vectorizer.fit_transform(X_train)

# Transform the test text to vectors using the *same* vectorizer
X_test_vec = vectorizer.transform(X_test)

# Display the feature names for a better understanding
print("Feature names (vocabulary):", vectorizer.get_feature_names_out())


Feature names (vocabulary): ['are' 'been' 'being' 'by' 'can' 'checking' 'congratulations' 'customer'
 'day' 'draw' 'end' 'entry' 'for' 'free' 'have' 'hi' 'holiday' 'in' 'just'
 'loyal' 'lunch' 'me' 'meeting' 'mobile' 'of' 'our' 'phone' 'prize'
 'report' 'selected' 'send' 'the' 'to' 'today' 'tomorrow' 'weekly' 'win'
 'winner' 'won' 'you' 'your']


##Step 5: Train the Multinomial Naive Bayes classifier
The MultinomialNB model is trained on the vectorized training data and their corresponding labels.

In [5]:
model = MultinomialNB()
model.fit(X_train_vec, y_train)


##Step 6: Make predictions and evaluate the model
Finally, evaluate the model's performance on the unseen test data. The classification_report provides a summary of the precision, recall, and F1-score for each class.

In [6]:
predictions = model.predict(X_test_vec)

print("Predictions:", predictions)
print("Actual labels:", y_test.values)

print("\nClassification Report:")
print(classification_report(y_test, predictions))


Predictions: ['spam' 'ham' 'ham']
Actual labels: ['spam' 'ham' 'ham']

Classification Report:
              precision    recall  f1-score   support

         ham       1.00      1.00      1.00         2
        spam       1.00      1.00      1.00         1

    accuracy                           1.00         3
   macro avg       1.00      1.00      1.00         3
weighted avg       1.00      1.00      1.00         3



##Step 7: Test with new messages
Use the trained model to classify a new, unseen message.

In [7]:
new_message = ['You have won a free lottery prize! Call now to claim it.']
new_message_vec = vectorizer.transform(new_message)
prediction = model.predict(new_message_vec)

print(f"The message '{new_message[0]}' is classified as: {prediction[0]}")


The message 'You have won a free lottery prize! Call now to claim it.' is classified as: spam
