# spam detection system using Python, the Natural Language Toolkit (NLTK), and the Scikit-learn library.

## Here's a summary of the spam detection model;

1. Loading the Data; The model starts by importing the SMS Spam Collection dataset, which contains SMS messages marked as either 'ham' (not spam) or 'spam'.

2. Preparing the Data; The labels are converted into values with 'ham' represented as 0 and 'spam' represented as 1.

3. Splitting the Data; The dataset is divided into two parts. A training set and a testing set. 80% of the data is used for training while the remaining 20% is used to evaluate its performance.

4. Vectorization; To process the text of SMS messages we utilize Scikit learns CountVectorizer. This technique transforms text into a matrix that comprises counts allowing it to be used as input, for our machine learning model.

5. Training the Model; We employ a Multinomial Naive Bayes classifier to train our model using the training data. This classifier works well for classification tasks involving features such as word counts.

6. Prediction; Once our model is trained we apply it to predict labels (i.e. whether a message is classified as 'ham' or 'spam') for the test set.

7. Evaluation; Lastly we assess our models performance by comparing its predicted labels, against the labels in order to determine its effectiveness.
The classification report provides metrics such, as precision, recall and F1 score, for both the 'ham' and 'spam' categories. It also includes the accuracy of the model.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

# Load the dataset
df = pd.read_csv('SMSSpamCollection', sep='\t', names=["label", "message"])

# Map the label to a binary variable
df['label'] = df.label.map({'ham': 0, 'spam': 1})

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], test_size=0.2)

# Vectorize the texts
vectorizer = CountVectorizer()
X_train_transformed = vectorizer.fit_transform(X_train)
X_test_transformed = vectorizer.transform(X_test)

# Train a Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X_train_transformed, y_train)

# Predict the labels for the test set
y_pred = classifier.predict(X_test_transformed)

# Print the classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99       970
           1       0.98      0.93      0.95       145

    accuracy                           0.99      1115
   macro avg       0.98      0.96      0.97      1115
weighted avg       0.99      0.99      0.99      1115



## The classification report you're viewing provides a summary of how your spam detection system performed on the test data. Lets go through the meaning of each term, in relation to your system;

- Precision; This measures the proportion of identified class items (true positives) out of the total number of items labeled as positive by the model, including both true positives and false positives. In your case for the 'ham' class (0) the precision is 0.99 indicating that 99% of messages labeled as 'ham' were actually 'ham'. Similarly for the 'spam' class (1) the precision is 0.98 meaning that 98% of messages labeled as 'spam' were indeed 'spam'.

- Recall; This calculates the ratio of positives to the sum of positives and false negatives (items falsely classified as negative). For your system the recall, for the 'ham' class is 1.00 indicating that all actual 'ham' messages were correctly identified. However for the 'spam' class it stands at 0.93 suggesting that some spam messages might have been misclassified.This indicates that the model accurately classified all 'ham' messages and 93% of the 'spam' messages.

- F1 score; The F1 score is a measure of precision and recall. It achieves its value at 1 ( precision and recall) and lowest, at 0. In your situation the F1 score, for the 'ham' category is 0.99 while for the 'spam' category it stands at 0.95, which's commendable.

- Support; The support refers to the number of occurrences of each category, in the data. In your case there were 970 'ham' messages and 145 'spam' messages in the test data.

- Macro avg; The macro average represents the value of a metric for each category without considering the proportion of each category in the data.

- Weighted avg; The weighted average represents the average value of a metric for each category taking into account the proportion of each category, in the actual data.

- Accuracy; Accuracy is calculated by dividing the number of predictions by the total number of predictions. In your case with an accuracy score of 0.99 it means that the model accurately classified 99% of the messages.