# SMS Spam Classifier: Phase 4
## In this phase, we'll continue building our spam classifier by:


In [None]:
# Import necessary packages
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

# Load the training data from a CSV file
training_data = pd.read_csv('sms_spam.csv', encoding='ISO-8859-1')

## Section 1: Model Selection

### Choosing the Machine Learning Algorithm

For our SMS text classification project, the selection of the appropriate machine learning algorithm is a critical decision that significantly influences the model's effectiveness. We have chosen the Multinomial Naive Bayes algorithm for several reasons:

### Multinomial Naive Bayes Algorithm

Multinomial Naive Bayes is a probabilistic algorithm that finds applications in various fields, with particular suitability for text classification tasks. Here's a brief overview of the algorithm:

1)Multinomial Distribution: The algorithm is founded on the assumption that the features (in our case, words in SMS messages) are generated from a multinomial distribution. This assumption makes it well-suited for discrete data, common in text analysis, where we count the occurrences of words.

2)Naive Bayes Assumption: The "naive" part of its name comes from the assumption of conditional independence between features. It operates under the presumption that the presence or absence of a particular word in a message is independent of the presence or absence of other words, given the class label. This simplification streamlines the probability calculations, making it computationally efficient.

3)Feature Vector Representation: In text classification, each document (SMS message) is represented as a feature vector, with each feature corresponding to a word or term. The value of each feature represents the frequency (or other statistics) of that word in the document.

4)Parameter Estimation: The algorithm estimates parameters from the training data, including the probabilities of each word occurring in spam and non-spam messages. These estimates are used to make predictions.

Our choice of Multinomial Naive Bayes is driven by its efficiency and relevance to text classification. However, we acknowledge its limitations and plan to interpret results while considering them.

In [1]:
from sklearn.naive_bayes import MultinomialNB

# Initialize the Multinomial Naive Bayes classifier
model = MultinomialNB()

## Section 2: Model Training

### Data Splitting

To train and evaluate our Multinomial Naive Bayes model, we began by splitting the dataset into a training set and a testing set. This step is essential for assessing the model's performance.

In [None]:
X_train = training_data['text']
y_train = training_data['type'] 

X_train_vectorized = tfidf_vectorizer.transform(X_train)

X_train, X_test, y_train, y_test = train_test_split(X_train_vectorized, y_train, test_size=0.2, random_state=42)

##  Model Training
With the training data prepared, we proceeded to train our Multinomial Naive Bayes model.

In [None]:
# Model Training
model.fit(X_train, y_train)

The model is now ready for evaluation in the next phase.

## Section 3: Model Evaluation

### Performance Metrics

To assess the performance of our Multinomial Naive Bayes model, we employed common text classification metrics. These metrics provide a quantitative measure of how well the model can distinguish between "spam" and "ham" messages. We focused on the following metrics:

1)Accuracy: Measures the proportion of correctly classified messages.

2)Precision: Measures the ability of the model to correctly identify "spam" messages, minimizing false positives.

3)Recall: Measures the model's ability to identify all "spam" messages, minimizing false negatives.

4)F1-Score: Combines precision and recall into a single metric, considering both false positives and false negatives.

In [3]:
# Section 3: Model Evaluation
# Performance Metrics

# Predict using the trained model
y_pred = model.predict(X_test)

# Calculate performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label="spam")
recall = recall_score(y_test, y_pred, pos_label="spam")
f1 = f1_score(y_test, y_pred, pos_label="spam")

# Results
print(f"Accuracy: {accuracy * 100:.2f}%")
print(f"Precision: {precision * 100:.2f}%")
print(f"Recall: {recall * 100:.2f}%")
print(f"F1-Score: {f1 * 100:.2f}%")


Accuracy: 97.04%
Precision: 100.00%
Recall: 78.00%
F1-Score: 87.64%


These metrics provide a comprehensive view of the model's performance in distinguishing between "spam" and "ham" messages.