# SMS spam classifier
## project by Mwafag Malik Omer Ahmed

This project aims to develop an ML model that can accurately classify whether an SMS message is spam(ham) or not. I used the SMS Spam Collection Dataset from the UCI Machine Learning Repository, and the bigger scope of this project is to showcase my competence in machine learning, as the making of the model required me to use skills such as **python programming, scikit-learn, data preprocessing, feature extraction, model training, and evaluation.**

# importing necessary libraries
Here, we import the necessary libraries for our project:
- `numpy` and `pandas` for data manipulation
- `scikit-learn` for machine learning algorithms and evaluation metrics

In [7]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Load dataset
In this section, we load the SMS Spam Collection Dataset into a pandas DataFrame. This dataset contains SMS messages labeled as spam or ham (not spam).

In [6]:
df = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'Message'])

# Display the first few rows of the dataset
df.head()

Unnamed: 0,Label,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


# Data preprocessing
We preprocess the dataset by converting the labels to binary values and splitting the data into training and test sets. We then convert the text messages into numerical feature vectors using `CountVectorizer`.

In [8]:
# Converting labels to binary values
df['Label'] = df['Label'].map({'spam': 1, 'ham': 0})

# Split the data
X = df['Message']
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Convert text data to feature vectors
vectorizer = CountVectorizer()
X_train_vectors = vectorizer.fit_transform(X_train)
X_test_vectors = vectorizer.transform(X_test)


## Train the Model

Here, we train a Naive Bayes classifier using the training data. The Naive Bayes algorithm is chosen for its efficiency and effectiveness in text classification tasks.

In [11]:
# Train the Naive Bayes model
model = MultinomialNB()
model.fit(X_train_vectors, y_train)

## Evaluate the Model

In this section, we evaluate the performance of our trained model using the test data. We calculate the accuracy and print a classification report which includes precision, recall, and F1-score.

In [12]:
# Predict on the test set
y_pred = model.predict(X_test_vectors)

# Calculate accuracy and print classification report
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print('Classification Report:')
print(report)

Accuracy: 0.9904306220095693
Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      1448
           1       0.98      0.95      0.96       224

    accuracy                           0.99      1672
   macro avg       0.99      0.97      0.98      1672
weighted avg       0.99      0.99      0.99      1672



# Conclusion

In this project, we successfully developed a machine learning model to classify SMS messages as spam or not spam using the Naive Bayes algorithm. The process involved loading and preprocessing the SMS Spam Collection Dataset, converting the text data into numerical feature vectors, training a Naive Bayes classifier, and evaluating its performance.

The model's performance was evaluated using the test dataset, achieving an accuracy of approximately 0.99(99% accuracy). The classification report provided further insights into the model's performance, with precision, recall, and F1-score values indicating how well the model distinguishes between spam and ham messages. 

The results demonstrate the effectiveness of the Naive Bayes algorithm in text classification tasks, particularly in identifying spam messages. This project showcases the practical application of machine learning techniques to solve real-world problems and highlights essential skills in data preprocessing, feature extraction, model training, and evaluation.

This project not only underscores the power of machine learning in automating and improving decision-making processes but also illustrates the importance of rigorous evaluation to ensure the reliability and accuracy of the model.
