# Gaussian Naive Bayes Lab

The goal of this lab is to classify emails as either "spam" or "not spam". We will use the Gaussian Naive Bayes algorithm to build our classifier. 

The dataset used for this lab can be found here: [Spam.csv Dataset](https://www.kaggle.com/datasets/mfaisalqureshi/spam-email?resource=download)

### Import Libraries


In [14]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

### Data Preprocessing

In [15]:
# Load the data
raw_df = pd.read_csv('spam.csv', encoding='latin-1')
raw_df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [16]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [17]:
# Convert category column to binary
raw_df['Category'] = raw_df['Category'].map({'ham': 0, 'spam': 1})
raw_df.head()

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [18]:
labels = raw_df['Message']

In [19]:
# Feature extraction
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(labels).toarray()
y = raw_df['Category']

In [20]:
# Split the data into training and testing sets; 80% for training and 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Modelling and Evaluation

In [12]:
# Instantiate the model
gnb = GaussianNB()

# Train the model using the training sets
gnb.fit(X_train, y_train)

# Predict the response for test dataset
y_pred = gnb.predict(X_test)

In [29]:
# Evaluate the model performance
def print_metrics(labels, preds):
    print("Precision Score: {}".format(precision_score(labels, preds)))
    print("Recall Score: {}".format(recall_score(labels, preds)))
    print("Accuracy Score: {}".format(accuracy_score(labels, preds)))
    print("F1 Score: {}".format(f1_score(labels, preds)))


In [30]:
print_metrics(y_test, y_pred)

Precision Score: 0.5702479338842975
Recall Score: 0.9261744966442953
Accuracy Score: 0.8968609865470852
F1 Score: 0.7058823529411765


### Interpretation of Results


1. **Precision Score**: The precision score measures the ratio of true positive predictions to the total number of positive predictions. In this context, a precision score of approximately 0.57 means that out of all the emails predicted as spam, about 57% of them are actually spam.

2. **Recall Score**: The recall score measures the ratio of true positive predictions to the total number of actual positive instances. A recall score of approximately 0.93 indicates that the model is able to correctly identify about 93% of all the actual spam emails.

3. **Accuracy Score**: The accuracy score measures the ratio of correct predictions to the total number of predictions. An accuracy score of approximately 0.90 means that the model is able to correctly classify about 90% of all emails as either spam or not spam.

4. **F1 Score**: The F1 score is the harmonic mean of precision and recall. It provides a balance between precision and recall. A higher F1 score indicates better model performance. Here, the F1 score is approximately 0.71, which is a decent balance between precision and recall.

In summary, the model seems to perform reasonably well in terms of recall and accuracy. However, there is room for improvement in precision, which suggests that there may be some false positive predictions. Depending on the specific requirements and constraints of your application, you may want to further tune the model or explore different algorithms to improve its performance.

## Note
This is a basic implementation of the Gaussian Naive Bayes algorithm for spam classification. For more advanced applications, you may consider additional preprocessing steps, feature engineering, hyperparameter tuning, and model evaluation techniques to further enhance the model's performance.