# Spam or Ham detection using Naive-Bayes classifier

Task description: <br>
3. Klasyfikacja za pomocą naiwnej metody bayesowskiej (rozkłady dyskretne). Implementacja powinna założyć, że cechy są dyskretne/jakościowe. Na wejściu oczekiwany jest zbiór, który zawiera p-cech dyskretnych/jakościowych, wektor etykiet oraz wektor prawdopodobieństw a priori dla klas. Na wyjściu otrzymujemy prognozowane etykiety oraz prawdopodobieństwa a posteriori. Dodatkową wartością odpowiednia wizualizacja.

Karol Cyganik 148250

## Data preparation

### Data reading

In [1]:
import pandas as pd

data = pd.read_csv('spam.csv', encoding='latin-1')
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


### Data stats & preprocessing

In [2]:
data = data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
data.columns = ['label', 'text']
data.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
import plotly.express as px

fig = px.histogram(data, x='label', title='Distribution of Spam and Ham Messages')
fig.show()

In [4]:
def show_most_common_words(data, title='Most Common Words'):
    most_common_words = pd.Series(' '.join(data).lower().split()).value_counts()[:20]
    fig = px.bar(x=most_common_words.index, y=most_common_words.values, title=title)
    fig.show()

In [5]:
show_most_common_words(data['text'], title='Most Common Words in the Dataset')

In [6]:
show_most_common_words(data[data['label'] == 'spam']['text'], title='Most Common Words in Spam Messages')

In [7]:
show_most_common_words(data[data['label'] == 'spam']['text'], title='Most Common Words in Spam Messages')

We can see that in both classes there are many stopwords. We will remove them.

In [8]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

nltk.download('stopwords')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)

def clean_text(text):
    text = text.lower()
    text = word_tokenize(text)
    text = [word for word in text if word not in stop_words and word not in punctuation]
    text = ' '.join(text)
    return text

cleaned_data = data.copy()
cleaned_data['text'] = cleaned_data['text'].apply(clean_text)
cleaned_data.head()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,label,text
0,ham,go jurong point crazy .. available bugis n gre...
1,ham,ok lar ... joking wif u oni ...
2,spam,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,u dun say early hor ... u c already say ...
4,ham,nah n't think goes usf lives around though


In [9]:
show_most_common_words(cleaned_data['text'], title='Most Common Words in the cleaned Dataset')

In [10]:
show_most_common_words(cleaned_data[cleaned_data['label'] == 'spam']['text'], title='Most Common Words in Spam Messages')

In [11]:
show_most_common_words(cleaned_data[cleaned_data['label'] == 'ham']['text'], title='Most Common Words in Ham Messages')

### Bag-of-words vector creation

In [12]:
#text to bow
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
bow_cleaned = vectorizer.fit_transform(cleaned_data['text'])
bow = vectorizer.fit_transform(data['text'])

print(bow_cleaned.shape)
print(bow.shape)

(5572, 8606)
(5572, 8672)


### Transform data to X,y format

In [13]:
from sklearn.model_selection import train_test_split

X_train_cleaned, X_test_cleaned, y_train_cleaned, y_test_cleaned = train_test_split(bow_cleaned, cleaned_data['label'], test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(bow, data['label'], test_size=0.2, random_state=42)
print(X_train_cleaned.shape, X_test_cleaned.shape)
print(X_train.shape, X_test.shape)

(4457, 8606) (1115, 8606)
(4457, 8672) (1115, 8672)


## Model training

### Naive Bayes classifier from sklearn for reference

#### Cleaned data without punctuation marks and stopwords

In [14]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

nb = MultinomialNB()
nb.fit(X_train_cleaned, y_train_cleaned)
y_pred_cleaned = nb.predict(X_test_cleaned)

print('Accuracy: ', accuracy_score(y_test_cleaned, y_pred_cleaned))
print(classification_report(y_test_cleaned, y_pred_cleaned))

Accuracy:  0.9748878923766816
              precision    recall  f1-score   support

         ham       0.99      0.98      0.99       965
        spam       0.90      0.92      0.91       150

    accuracy                           0.97      1115
   macro avg       0.94      0.95      0.95      1115
weighted avg       0.98      0.97      0.98      1115



#### Raw data

In [15]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

nb = MultinomialNB()
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)

print('Accuracy: ', accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy:  0.97847533632287
              precision    recall  f1-score   support

         ham       0.99      0.99      0.99       965
        spam       0.91      0.93      0.92       150

    accuracy                           0.98      1115
   macro avg       0.95      0.96      0.95      1115
weighted avg       0.98      0.98      0.98      1115



We can see that the data cleaning useful in most machine learning tasks slighly deteriorates the performance of the Naive Bayes classifier just for the spam class.

In [16]:
probabilities = nb.predict_proba(X_test)
print(probabilities[:5])

[[1.32325954e-01 8.67674046e-01]
 [1.00000000e+00 4.20758364e-11]
 [6.01285769e-04 9.99398714e-01]
 [1.00000000e+00 9.21224812e-16]
 [6.20362416e-26 1.00000000e+00]]


### Custom implementation of Naive Bayes classifier

In [17]:
def naive_bayes():
    pass

## Explainability of sklearn nb model

In [18]:
import numpy as np

feature_names = np.array(vectorizer.get_feature_names_out())
log_prob = nb.feature_log_prob_
feature_importance = feature_names[np.argsort(log_prob)]

#plotly plot of feature importance
for i, label in enumerate(nb.classes_):
    fig = px.bar(x=feature_importance[i][-10:], y=np.exp(log_prob[i][-10:]), title=f'Feature Importance for {label}')
    fig.show()

## Synthetic data generation

To even more improve the performance of the Naive Bayes classifier we can generate synthetic data using locally a generative model.
TODO???