# Building a Spam Filter with Multinomial Naive Bayes

The purpose of this project is to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The goal is to write a program that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

To train the algorithm, we'll use a dataset of 5,572 SMS messages that are already classified by humans. The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). The data collection process is described in more details on [this page](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition), where you can also find some of the papers authored by Tiago A. Almeida and José María Gómez Hidalgo.

# Exploring the Dataset

The first step is to explore the data, starting with reading in the dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import operator
from wordcloud import WordCloud, STOPWORDS

In [None]:
sms = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])


print(f'Number of SMS messages: {sms.shape[0]:,}')
print(f'Number of missing values in the dataframe: {sms.isnull().sum().sum()}\n')

def pretty_print_table(df, substring):
    '''Pretty-prints a table of the result of `value_counts` method (in % and
    rounded) on the `Label` column of an input dataframe. Prints the title of
    the table with an input substring incorporated.
    '''
    print(f'Spam vs. ham {substring}, %')
    spam_ham_pct = round(df['Label'].value_counts(normalize=True)*100, 0)
    print(spam_ham_pct.to_markdown(tablefmt='pretty', headers=['Label', '%']))

# Pretty-printing % of spam and ham messages
pretty_print_table(df=sms, substring='(non-spam)')

# Plotting % of spam and ham messages
spam_pct = round(sms['Label'].value_counts(normalize=True)*100, 0)
fig, ax = plt.subplots(figsize=(8,2))
spam_pct.plot.barh(color='slateblue')
ax.set_title('Spam vs. ham, %', fontsize=25)
ax.set_xlabel(None)
ax.tick_params(axis='both', labelsize=16, left=False)
for side in ['top', 'right', 'left']:
    ax.spines[side].set_visible(False)
plt.show()

sms.head()

A plenary glance at the data identifies that about 87% of the messages are ham, while the remaining 13% is spam.  At a high-level, this tracks with experience, since most messages that people receive are, in fact, ham.

## Training and Test Set

Splitting the dataset into a training and a test set is next in the process, where the training set accounts for 80% of the data, and the test set for the remaining 20%.

In [None]:
sms_randomized = sms.sample(frac=1, random_state=1)

# Creating a training set (80%) and a test set (20%)
training_set = sms_randomized[:4458].reset_index(drop=True)
test_set = sms_randomized[4458:].reset_index(drop=True)

# Finding the % of spam and ham in both sets
pretty_print_table(df=training_set, substring='in the training set')
print('\n')
pretty_print_table(df=test_set, substring='in the test set')

The datasets track with the expected results.

# Data Cleaning

This next step requires the calculation of all probabilities the algorithm will need.  In order to do so, however, it is wise to clean the data appropriately.

The main goal is to have a count of each unique word in the SMS.

## Letter Case and Punctuation

First up is clearing both punctuation and ensuring all letters are lower-case.

In [None]:
# Before cleaning
training_set.head()

In [None]:
# Removing punctuation and making all the words lower case
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ').str.lower()
training_set.head()

## Creating the Vocabulary

Next up is creating the lexicon, the list of unique words in our training set.

In [None]:
training_set['SMS'] = training_set['SMS'].str.split()
training_set.head(3)

In [None]:
vocabulary = []
for sms in training_set['SMS']:
    for word in sms:
        vocabulary.append(word)
vocabulary = list(set(vocabulary))
print(f'Number of unique words in the vocabulary of the training set: {len(vocabulary):,}')

### The Final Training Set

This final step includes using the vocabulary from above to make the final data transformation.

In [None]:
# Creating a dictionary where each key is a unique word from the vocabulary,
# and each value is a list of the frequencies of that word in each message
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}
for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index]+=1
        
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head(3)

In [None]:
training_set_final = pd.concat([training_set, word_counts], axis=1)
training_set_final.head(3)

## The Most Frequent Words in Spam Messages

Having a count of the most frequently used words in the spam messages will provide some solid insight for testing the filter.

In [None]:
spam_sms = training_set_final[training_set_final['Label']=='spam']
ham_sms = training_set_final[training_set_final['Label']=='ham']

In [None]:
# Creating a dictionary of words from all spam messages with their frequencies
spam_dict = {}
for sms in spam_sms['SMS']:
    for word in sms:
        if word not in spam_dict:
            spam_dict[word]=0
        spam_dict[word]+=1

In [None]:
# Sorting the dictionary in descending order of word frequencies 
sorted_spam_dict = dict(sorted(spam_dict.items(), key=operator.itemgetter(1), reverse=True))

In [None]:
selected = ['call', 'free', 'stop', 'mobile', 'text', 'claim', 'www', 
            'prize', 'send', 'cash', 'nokia', 'win', 'urgent', 'service',
            'contact', 'com', 'msg', 'chat', 'guaranteed', 'customer', 
            'awarded', 'sms', 'ringtone', 'video', 'rate', 'latest', 
            'award', 'code', 'camera', 'chance', 'apply', 'valid', 'selected',
            'offer', 'tones', 'collection', 'mob', 'network', 'attempt', 
            'bonus', 'delivery', 'weekly', 'club', 'http', 'help', 'dating',
            'vouchers', 'poly', 'auction', 'ltd', 'pounds', 'special',
            'services', 'games', 'await', 'double', 'unsubscribe', 'hot',
            'price', 'sexy', 'camcorder', 'content', 'top', 'calls', 
            'account', 'private', 'winner', 'savamob', 'offers', 'pobox',
            'gift', 'net', 'quiz', 'expires', 'freemsg', 'play', 'ipod',
            'last', 'order', 'anytime', 'congratulations', 'caller', 'points',
            'identifier', 'voucher', 'statement', 'operator', 'real', 
            'mobiles', 'important', 'join', 'rental', 'valued', 'congrats',
            'final', 'enjoy', 'unlimited', 'tv', 'charged', 'sex']

# Extracting only the 100 most frequent spam words with their frequencies
filtered_sorted_spam_dict = {}
for word in selected:
    if word in sorted_spam_dict:
        filtered_sorted_spam_dict[word]=sorted_spam_dict[word] 
        
print(f'The number of the most popular spam words selected: {len(filtered_sorted_spam_dict)}')

In [None]:
# Creating a word cloud
fig = plt.subplots(figsize=(12,10)) 
wordcloud = WordCloud(width=1000, height=700,
                      background_color='white', 
                      random_state=1).generate_from_frequencies(filtered_sorted_spam_dict)
plt.title('The most frequent words in spam messages\n', fontsize=29)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

## Calculating Constants

Now, it's time to start creating the spam filter.

When a new message comes in, the Naive Bayes algorithm will make the classification based on the probabilities it gets to these two equations:

![alt text](formula1.png "Classification Formulae A")

to calculate <i>P(wi|Spam)</i> and <i>P(wi|Ham)</i> inside the formulas above:

![alt text](formula2.png "Classification Formulae B")


where:


* <i>N<sub>wi|Spam</sub></i> — the number of times the word wi occurs in spam messages,
* <i>N<sub>wi|Ham</sub></i> — the number of times the word wi occurs in ham messages,
* <i>N<sub>Spam</sub></i> — total number of words in spam messages,
* <i>N<sub>Ham</sub></i> — total number of words in ham messages,
* <i>N<sub>Vocabulary</sub></i> — total number of unique words in the vocabulary,
* <i>α</i> — a smoothing parameter.

Of course some of these will have the same value for every new message: <i>P(Spam)</i>, <i>P(Ham)</i>, <i>N<sub>Spam</sub></i>, <i>N<sub>Ham</sub></i>, <i>N<sub>Vocabulary</sub></i>.  We can use Laplace smoothing and set our <i>α</i> value to 1.

Now to calculate the constants:

In [None]:
p_spam = training_set_final['Label'].value_counts()['spam']/len(training_set_final)
p_ham = training_set_final['Label'].value_counts()['ham']/len(training_set_final)

n_spam = 0
n_ham = 0
for i in range(len(training_set_final)):
    row = list(training_set_final.iloc[i].values)
    for j in range(2,len(row)):
        if row[0]=='spam':
            n_spam+=row[j]
        else:
            n_ham+=row[j]
            
n_vocabulary = len(vocabulary)
alpha = 1

print(f'p_spam: {p_spam:.2f}\n'
      f'p_ham: {p_ham:.2f}\n'
      f'n_spam: {n_spam:,}\n'
      f'n_ham: {n_ham:,}\n'
      f'n_vocabulary: {n_vocabulary:,}\n'
      f'alpha: {alpha}')

## Calculating Parameters

The parameters <i>P(wi|Spam)</i> and <i>P(wi|Ham)</i> will vary depending on the individual words. However, both probabilities for each individual word remain constant for every new message, since they only depend on the training set. This means that we can use our training set to calculate both probabilities for each word in our vocabulary beforehand, which makes the Naive Bayes algorithm very fast compared to other algorithms. When a new message comes in, most of the needed computations are already done, which enables the algorithm to almost instantly classify the new message.

There are 7,783 words in our vocabulary, hence we'll need to calculate a total of 15,566 probabilities <i>(P(wi|Spam)</i> and <i>P(wi|Ham)</i> for each word) using the following equations:

![alt text](formula3.png "Parameter Calculation")

In [None]:
p_wi_spam = {}
p_wi_ham = {}

for word in vocabulary:
    p_wi_spam[word] = (spam_sms[word].sum()+alpha)/(n_spam+alpha*n_vocabulary)
    p_wi_ham[word] = (ham_sms[word].sum()+alpha)/(n_ham+alpha*n_vocabulary)

## Classifying a New Message

With the constants and parameters calculated, these can be converted into a spam filter.  The definition for this product is two-fold:

* Ingests a new message as input
* Calculates <i>P(Spam|message)</i> and <i>P(Ham|message)</i> using the following formulas:

![alt text](formula4.png)

* Compares both values and:
    * if <i>P(Ham|message)</i> > <i>P(Spam|message)</i>, then the message is classified as ham,
    * if <i>P(Ham|message)</i> < <i>P(Spam|message)</i>, then the message is classified as spam,
    * if <i>P(Ham|message)</i> = <i>P(Spam|message)</i>, then the algorithm may request human help.

If a new message contains some words that are not in the vocabulary, these words will be ignored for the purposes of calculating probabilities.

And we can test the function with obviously spam or ham messages:

In [None]:
def classify_test_set(message):
    '''Takes in a message as a string, removes punctuation, and makes all the
    words lower case, calculates P(Spam|message) and P(Ham|message) based on
    the constants and parameters calculated earlier in the project, compares
    the two values and classifies the message as spam or ham, or requires 
    human classification. 
    '''
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    for word in message:
        if word in p_wi_spam:
            p_spam_given_message*=p_wi_spam[word]
        if word in p_wi_ham:
            p_ham_given_message*=p_wi_ham[word]
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

# Testing the function
print(classify_test_set('Do you want to win an amazing super-prize today?'))
print(classify_test_set('Ian, you look super-amazing today!'))

The algorithm has distinguished the meaning successfully.

## Measuring the Spam Filter's Accuracy

From the previous work, we have a test set of messages.  The algorithm will treat each message as new since it was not in the training data set.  The output will be a classification label which we can use to compare to the human-assigned label.

In [None]:
test_set['Predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

And we can compare the accuracy of predicted vs. actual labels:

In [None]:
# Calculating the accuracy of the spam filter
correct = 0
total = len(test_set)        # number of sms in the test set
for row in test_set.iterrows():
    if row[1]['Predicted']==row[1]['Label']:
        correct+=1
accuracy = correct/total*100
print(f'The accuracy of the spam filter: {accuracy:.2f}%')

According to the result, our assumption of 80% has been surpassed.

## Incorrectly-Classified Messages

We can see that there were some messages that were classified incorrectly.  Some manual review will help understand what went wrong.

In [None]:
false_spam = test_set[(test_set['Predicted']=='spam')&(test_set['Label']=='ham')].reset_index(drop=True)
false_ham = test_set[(test_set['Predicted']=='ham')&(test_set['Label']=='spam')].reset_index(drop=True)
unclear = test_set[test_set['Predicted']=='needs human classification'].reset_index(drop=True)

print('Total number of wrongly classified messages: ', len(false_spam)+len(false_ham)+len(unclear))
print('_________________________________________________________________________\n')
print('FALSE SPAM MESSAGES:')
for row in false_spam.iterrows():
    print(f'{row[0]+1}. ', row[1]['SMS'])
print('_________________________________________________________________________\n')
print('FALSE HAM MESSAGES:')
for row in false_ham.iterrows():
    print(f'{row[0]+1}. ', row[1]['SMS'])
print('_________________________________________________________________________\n')
print('UNCLEAR MESSAGES:')
for row in unclear.iterrows():
    print(f'{row[0]+1}. ', row[1]['SMS'])
print('_________________________________________________________________________')

* In very rare occasions, ham messages can be incorrectly detected as spam when they are very short (considering also that some of the words from a new message can be absent in the vocabulary) and, at the same time, contain suspicious ad-style words, like unlimited, phone, calls, messages, contact, sent, that were previously found mostly in spam messages and that we observed, indeed, earlier on the word cloud. In addition, false spam messages, being very short, can contain seemingly neutral words (like July) but which were found in the training set only 1-2 times and, by coincidence, in spam messages.
* Spam messages incorrectly detected as ham tend to be rather long and have a high percentage of "normal" words, which allows them to override the system. They usually contain contact details, websites, mentioning sums of money, words like asap, or they just can, in case of being short, consist of the words absent in the vocabulary.
* The message that was not identified at all (and originally it was a ham message) is quite long and characterized by heavy usage of slang and abbreviations most probably absent in the vocabulary. As for the other words, the majority of them look neutral and could have been detected both in spam and ham messages. There are, though, some potentially suspicious words (saved, boost, secret, energy, instantly) that increased the probability of spam for this message up to being equal to that of ham.

# Making the Algorithm Case-Sensitive



In [None]:
training_set_exp = sms_randomized[:4458].reset_index(drop=True)
test_set_exp = sms_randomized[4458:].reset_index(drop=True)
training_set_exp['SMS'] = training_set_exp['SMS'].str.replace('\W', ' ')

vocabulary_exp = []
for sms in training_set_exp['SMS']:
    for word in sms:
        vocabulary_exp.append(word)
vocabulary_exp = list(set(vocabulary_exp))

word_counts_per_sms_exp = {unique_word: [0] * len(training_set_exp['SMS']) for unique_word in vocabulary_exp}
for index, sms in enumerate(training_set_exp['SMS']):
    for word in sms:
        word_counts_per_sms_exp[word][index]+=1
        
word_counts_exp = pd.DataFrame(word_counts_per_sms_exp)

training_set_final_exp = pd.concat([training_set_exp, word_counts_exp], axis=1)
    
spam_sms_exp = training_set_final_exp[training_set_final_exp['Label']=='spam']
ham_sms_exp = training_set_final_exp[training_set_final_exp['Label']=='ham']

p_spam_exp = training_set_final_exp['Label'].value_counts()['spam']/len(training_set_final_exp)
p_ham_exp = training_set_final_exp['Label'].value_counts()['ham']/len(training_set_final_exp)

n_spam_exp = 0
n_ham_exp = 0
for i in range(len(training_set_final_exp)):
    row = list(training_set_final_exp.iloc[i].values)
    for j in range(2,len(row)):
        if row[0]=='spam':
            n_spam_exp+=row[j]
        else:
            n_ham_exp+=row[j]
            
n_vocabulary_exp = len(vocabulary_exp)
alpha = 1

p_wi_spam_exp = {}
p_wi_ham_exp = {}
for word in vocabulary_exp:
    p_wi_spam_exp[word] = (spam_sms_exp[word].sum()+alpha)/(n_spam_exp+alpha*n_vocabulary_exp)
    p_wi_ham_exp[word] = (ham_sms_exp[word].sum()+alpha)/(n_ham_exp+alpha*n_vocabulary_exp)
    
def classify_test_set_exp(message):
    message = re.sub('\W', ' ', message)
    message = message.split()
    p_spam_given_message_exp = p_spam_exp
    p_ham_given_message_exp = p_ham_exp
    for word in message:
        if word in p_wi_spam_exp:
            p_spam_given_message_exp*=p_wi_spam_exp[word]
        if word in p_wi_ham_exp:
            p_ham_given_message_exp*=p_wi_ham_exp[word]
    if p_ham_given_message_exp > p_spam_given_message_exp:
        return 'ham'
    elif p_spam_given_message_exp > p_ham_given_message_exp:
        return 'spam'
    else:
        return 'needs human classification'
    
test_set_exp['Predicted'] = test_set_exp['SMS'].apply(classify_test_set_exp)

correct_exp = 0
total_exp = len(test_set_exp)

for row in test_set_exp.iterrows():
    if row[1]['Predicted']==row[1]['Label']:
        correct_exp+=1
accuracy_exp = correct_exp/total_exp*100
print(f'The accuracy of the spam filter: {accuracy_exp:.2f}%')

We see that the experiment on making the filtering system more complex by introducing letter case sensitivity ended up rendering our spam filter much less efficient in labeling a new message (the accuracy has dropped by 13.5%), even though it's still more efficient than 80% of accuracy that we aimed at the beginning. It seems that the letter case doesn't really make any valuable difference when it comes to distinguishing between spam and ham messages. Hence, for further classifying new messages, we can approve the previous spam filter with 98.74% of accuracy.

## Conclusion

In this project, we created a highly accurate spam filter based on the multinomial Naive Bayes algorithm and a dataset of labeled 5,572 SMS. The spam filter takes in a new message and classifies it as spam or ham. We managed to reach an accuracy of 98.74%, which is almost 20% higher than our initial focus. Below are some additional conclusions and insights from this project:

* A few messages classified incorrectly have some features in common. False spam messages tend to be very short, have the words absent in the vocabulary, contain typical spam-like words, or neutral words previously detected, by coincidence, only in spam messages. False ham messages tend to be rather long and have a high percentage of neutral words or the words absent in the vocabulary. In the undefined messages, we can expect an approximately proportional mixture of neutral and spam-like words.
* The attempt to increase the accuracy even further by making the algorithm sensitive to letter case resulted, just the opposite, in rendering the spam filter much less efficient, with the accuracy dropped by 13.5%. It seems that the letter case doesn't make any valuable difference when it comes to distinguishing between spam and ham messages.
* The 100 most popular meaningful spam-prone words revealed the following patterns: encouraging people to do further actions, promising them something alluring, urging them, having sexual context, inviting to visit some web resources, advertising various digital devices and products.

In [None]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
def load_data(filepath):
    return pd.read_csv(filepath, encoding='latin-1')

data = load_data("spam.csv")


In [None]:
def preprocess_message(message):
    return re.sub('\W', ' ', message.lower())

data['processed'] = data['message_column_name'].apply(preprocess_message)  ## Replace 'message_column_name' with the appropriate column name in your dataset that contains the SMS messages


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    data['processed'], 
    data['label_column_name'], ## Replace 'label_column_name' with the column name in your dataset that contains the labels (i.e., 'spam' or 'ham').
    test_size=0.2, 
    random_state=42)

In [None]:
vectorizer = CountVectorizer()
X_train_transformed = vectorizer.fit_transform(X_train)
X_test_transformed = vectorizer.transform(X_test)

In [None]:
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_transformed, y_train)

In [None]:
y_pred = nb_classifier.predict(X_test_transformed)

In [None]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")


In [None]:
conf_matrix = confusion_matrix(y_test, y_pred)

def print_confusion_matrix(conf_matrix):
    print("Confusion Matrix:")
    print("                 Predicted:")
    print("                 Ham | Spam")
    print("--------------------------")
    print(f"Actual Ham  |  {conf_matrix[0][0]}   | {conf_matrix[0][1]}")
    print("--------------------------")
    print(f"Actual Spam |  {conf_matrix[1][0]}   | {conf_matrix[1][1]}")

print_confusion_matrix(conf_matrix)


In [2]:
# NOTE:
# The confusion matrix is arranged as:

# [TN, FP]
# [FN, TP]


In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
# Tokenize and pad sequences
tokenizer = Tokenizer(num_words=10000, oov_token="<OOV>")
tokenizer.fit_on_texts(messages)  # Assuming 'messages' is your list of SMSes

sequences = tokenizer.texts_to_sequences(messages)
padded_sequences = pad_sequences(sequences, maxlen=100, padding="post", truncating="post")

In [None]:
# Vanilla RNN model
model_rnn = tf.keras.Sequential([
    tf.keras.layers.Embedding(10000, 64, input_length=100),
    tf.keras.layers.SimpleRNN(64),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model_rnn.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_rnn.summary()

In [None]:
# LSTM model
model_lstm = tf.keras.Sequential([
    tf.keras.layers.Embedding(10000, 64, input_length=100),
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model_lstm.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_lstm.summary()


In [None]:
# CNN model for text
model_cnn = tf.keras.Sequential([
    tf.keras.layers.Embedding(10000, 64, input_length=100),
    tf.keras.layers.Conv1D(128, 5, activation='relu'),
    tf.keras.layers.GlobalMaxPooling1D(),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model_cnn.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_cnn.summary()

In [None]:
model_rnn.fit(padded_sequences, labels, epochs=10, validation_split=0.1)  # Assuming 'labels' is your list of binary labels (1 for spam, 0 for ham)

In [None]:
from sklearn.svm import SVC

In [None]:
# Initializing the SVM classifier. Using the 'linear' kernel is common for text classification.
svm_classifier = SVC(kernel='linear', random_state=42)
svm_classifier.fit(X_train_transformed, y_train)

In [None]:
# Making predictions on the test set
y_pred_svm = svm_classifier.predict(X_test_transformed)

# Calculating accuracy
accuracy_svm = accuracy_score(y_test, y_pred_svm)
print(f'SVM Model Accuracy: {accuracy_svm * 100:.2f}%')

# Displaying the confusion matrix for the SVM model
conf_matrix_svm = confusion_matrix(y_test, y_pred_svm)
print_confusion_matrix(conf_matrix_svm)

In [None]:
# Required Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, recall_score, f1_score
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from torch.utils.data import Dataset, DataLoader

# Data Preparation
data = pd.read_csv('spam.csv', encoding='latin-1')
data = data[['v1', 'v2']]
data.columns = ['label', 'message']

# Train-Validation-Test Split
X_train, X_temp, y_train, y_temp = train_test_split(data['message'], data['label'], test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# 1. Deep Learning Models:
# TODO: Implement the RNN, LSTM, and CNN models here. 
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, LSTM, Conv1D, MaxPooling1D, Dense, Flatten

# Tokenization and Sequence Padding
max_words = 5000
max_len = 100
tokenizer = Tokenizer(num_words=max_words, oov_token='<OOV>')
tokenizer.fit_on_texts(X_train)
sequences_train = tokenizer.texts_to_sequences(X_train)
sequences_val = tokenizer.texts_to_sequences(X_val)

X_train_padded = pad_sequences(sequences_train, maxlen=max_len, padding='post', truncating='post')
X_val_padded = pad_sequences(sequences_val, maxlen=max_len, padding='post', truncating='post')

# Vanilla RNN
model_rnn = Sequential([
    Embedding(max_words, 32, input_length=max_len),
    SimpleRNN(32),
    Dense(1, activation='sigmoid')
])
model_rnn.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_rnn.fit(X_train_padded, y_train, epochs=5, validation_data=(X_val_padded, y_val))

# LSTM
model_lstm = Sequential([
    Embedding(max_words, 32, input_length=max_len),
    LSTM(32),
    Dense(1, activation='sigmoid')
])
model_lstm.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_lstm.fit(X_train_padded, y_train, epochs=5, validation_data=(X_val_padded, y_val))

# CNN
model_cnn = Sequential([
    Embedding(max_words, 32, input_length=max_len),
    Conv1D(32, 5, activation='relu'),
    MaxPooling1D(5),
    Flatten(),
    Dense(1, activation='sigmoid')
])
model_cnn.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_cnn.fit(X_train_padded, y_train, epochs=5, validation_data=(X_val_padded, y_val))


# 2. Word Embeddings - BERT:
# Define dataset for BERT
class SMSTextDataset(Dataset):
    def __init__(self, messages, labels, tokenizer, max_len):
        self.messages = messages
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.messages)

    def __getitem__(self, item):
        message = str(self.messages[item])
        label = self.labels[item]
        encoding = self.tokenizer.encode_plus(
            message,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            pad_to_max_length=True,
            return_attention_mask=True,
            return_tensors='pt',
        )
        return {
            'message_text': message,
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

# Training BERT for SMS Classification
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

# Simplifying training loop for BERT (detailed code with Trainer would be added)
# TODO: Train the BERT model using Trainer or custom training loop.
from transformers import TrainingArguments, Trainer

# Data Preprocessing for BERT
label_map = {'ham': 0, 'spam': 1}
train_encodings = tokenizer(list(X_train), truncation=True, padding='max_length', max_length=100)
val_encodings = tokenizer(list(X_val), truncation=True, padding='max_length', max_length=100)
train_labels = y_train.map(label_map).tolist()
val_labels = y_val.map(label_map).tolist()

class SMSTextDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

train_dataset = SMSTextDataset(train_encodings, train_labels)
val_dataset = SMSTextDataset(val_encodings, val_labels)

# BERT Training
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_dir='./logs',
    logging_steps=100,
    evaluation_strategy="steps",
    save_steps=500
)

trainer = Trainer(
    model=bert_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

trainer.train()


# 3. Ensemble Methods:
# Convert messages into TF-IDF representation
vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_val_tfidf = vectorizer.transform(X_val)
X_test_tfidf = vectorizer.transform(X_test)

# Random Forest
rf = RandomForestClassifier()
rf.fit(X_train_tfidf, y_train)

# XGBoost
xgb = GradientBoostingClassifier()
xgb.fit(X_train_tfidf, y_train)

# AdaBoost
ada = AdaBoostClassifier()
ada.fit(X_train_tfidf, y_train)

# 4. SVM with TF-IDF:
svm = SVC(probability=True)  # Set probability=True for AUC-ROC later on
svm.fit(X_train_tfidf, y_train)

# 5. Regularization Techniques:
# TODO: While implementing deep learning models, add dropout and other regularization methods.
from tensorflow.keras import regularizers

model_rnn = Sequential([
    Embedding(max_words, 32, input_length=max_len),
    SimpleRNN(32, dropout=0.2, recurrent_dropout=0.2, kernel_regularizer=regularizers.l2(0.01)),
    Dense(1, activation='sigmoid', kernel_regularizer=regularizers.l2(0.01))
])

model_lstm = Sequential([
    Embedding(max_words, 32, input_length=max_len),
    LSTM(32, dropout=0.2, recurrent_dropout=0.2, kernel_regularizer=regularizers.l2(0.01)),
    Dense(1, activation='sigmoid', kernel_regularizer=regularizers.l2(0.01))
])

model_cnn = Sequential([
    Embedding(max_words, 32, input_length=max_len),
    Conv1D(32, 5, activation='relu', kernel_regularizer=regularizers.l2(0.01)),
    MaxPooling1D(5),
    Flatten(),
    Dense(1, activation='sigmoid', kernel_regularizer=regularizers.l2(0.01))
])

from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures
from tensorflow.keras.layers import Dropout
from tensorflow.keras import regularizers

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Dropout

# Get BERT's output
output = bert_model.layers[0].output

# Add dropout for regularization
output = Dropout(0.2)(output[1])  # [1] captures the pooled_output from BERT

# Add a dense classification layer with L2 regularization
classifier = Dense(1, activation='sigmoid', kernel_regularizer=regularizers.l2(0.01))(output)

# Construct the new model
model = Model(inputs=bert_model.input, outputs=classifier)

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# ... continue with fine-tuning


# 6. Transfer Learning:
# Implemented above with BERT

# 7. Data Augmentation:
# TODO: Implement data augmentation techniques. 
from nltk.corpus import wordnet

def synonym_replacement(text):
    words = text.split()
    new_words = words.copy()
    random_word_list = list(set([word for word in words]))
    num_replaced = 0
    for random_word in random_word_list:
        synonyms = wordnet.synsets(random_word)
        if len(synonyms) >= 1:
            synonym = synonyms[0].lemmas()[0].name()
            new_words = [synonym if word == random_word else word for word in new_words]
            num_replaced += 1
    if num_replaced > 0:
        new_text = ' '.join(new_words)
        return new_text
    else:
        return text

# Example:
# augmented_text = synonym_replacement("This is a sample text")

# Using googletrans library for this example, but there are many others
from googletrans import Translator

def back_translate(text, lang='es'):
    translator = Translator()
    translated_text = translator.translate(text, dest=lang).text
    translated_back_text = translator.translate(translated_text, dest='en').text
    return translated_back_text

# Example:
# augmented_text = back_translate("This is a sample text")

import random

def random_deletion(text, p=0.2):  # p is the probability of deleting a word
    words = text.split()
    if len(words) == 1:
        return words
    remaining = list(filter(lambda x: random.uniform(0,1) > p, words))
    if len(remaining) == 0:  # If all words are deleted, choose one at random
        return [random.choice(words)]
    else:
        return ' '.join(remaining)

# Example:
# augmented_text = random_deletion("This is a sample text")

def random_swap(text):
    words = text.split()
    if len(words) < 2:
        return text
    idx1, idx2 = random.sample(range(len(words)), 2)
    words[idx1], words[idx2] = words[idx2], words[idx1]
    return ' '.join(words)

# Example:
# augmented_text = random_swap("This is a sample text")


# 8. Model Stacking:
# Getting predictions on validation set
val_preds_rf = rf.predict(X_val_tfidf)
val_preds_xgb = xgb.predict(X_val_tfidf)
val_preds_ada = ada.predict(X_val_tfidf)
# TODO: Get BERT and SVM predictions on validation set
from transformers import BertTokenizer, TFBertForSequenceClassification
from sklearn.preprocessing import LabelEncoder
import numpy as np
import tensorflow as tf

# Assuming you've already loaded the tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertForSequenceClassification.from_pretrained('path_to_saved_model')  # Replace 'path_to_saved_model' with the path to your fine-tuned BERT model

def get_bert_predictions(texts):
    inputs = tokenizer(texts, return_tensors="tf", truncation=True, padding=True, max_length=128)
    outputs = model(inputs['input_ids'], attention_mask=inputs['attention_mask'])
    logits = outputs[0]
    probs = tf.nn.softmax(logits, axis=1)
    predictions = np.argmax(probs, axis=1)  # Convert softmax outputs to class labels
    return predictions

bert_predictions = get_bert_predictions(validation_texts)  # Replace 'validation_texts' with your actual validation data


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC

# Assuming you've already instantiated and fit a TfidfVectorizer and trained an SVM
vectorizer = TfidfVectorizer(max_features=5000)  # Example initialization
vectorizer.fit(training_texts)  # Replace 'training_texts' with your training data texts

svm_model = SVC(probability=True)
svm_model.fit(X_train_tfidf, y_train)  # Assuming X_train_tfidf is your training data transformed by the vectorizer

def get_svm_predictions(texts):
    tfidf_texts = vectorizer.transform(texts)
    predictions = svm_model.predict(tfidf_texts)
    return predictions

svm_predictions = get_svm_predictions(validation_texts)


# Stacking models
X_train_stack = np.column_stack((val_preds_rf, val_preds_xgb, val_preds_ada))  # Add other models' predictions as needed
model_stack = LogisticRegression().fit(X_train_stack, y_val)

# Predictions on test set
test_preds_rf = rf.predict(X_test_tfidf)
test_preds_xgb = xgb.predict(X_test_tfidf)
test_preds_ada = ada.predict(X_test_tfidf)
# TODO: Get BERT and SVM predictions on test set

# You've already loaded the tokenizer and model during the validation step

def get_bert_test_predictions(texts):
    inputs = tokenizer(texts, return_tensors="tf", truncation=True, padding=True, max_length=128)
    outputs = model(inputs['input_ids'], attention_mask=inputs['attention_mask'])
    logits = outputs[0]
    probs = tf.nn.softmax(logits, axis=1)
    predictions = np.argmax(probs, axis=1)  # Convert softmax outputs to class labels
    return predictions

bert_test_predictions = get_bert_test_predictions(test_texts)  # Replace 'test_texts' with your actual test data


def get_svm_test_predictions(texts):
    tfidf_texts = vectorizer.transform(texts)
    predictions = svm_model.predict(tfidf_texts)
    return predictions

svm_test_predictions = get_svm_test_predictions(test_texts)


X_test_stack = np.column_stack((test_preds_rf, test_preds_xgb, test_preds_ada))
y_pred_stack = model_stack.predict(X_test_stack)

# Evaluation
# TODO: Evaluate all models using the mentioned metrics (AUC-ROC, precision, recall, F1 score, and accuracy).
from sklearn.metrics import roc_auc_score, precision_score, recall_score, f1_score, accuracy_score

def evaluate_model(y_true, y_pred):
    # AUC-ROC
    # If your labels are binary (e.g., 0 or 1), then you can directly compute AUC-ROC
    auc = roc_auc_score(y_true, y_pred)

    # Precision
    precision = precision_score(y_true, y_pred)

    # Recall
    recall = recall_score(y_true, y_pred)

    # F1 Score
    f1 = f1_score(y_true, y_pred)

    # Accuracy
    accuracy = accuracy_score(y_true, y_pred)

    return {
        "AUC-ROC": auc,
        "Precision": precision,
        "Recall": recall,
        "F1 Score": f1,
        "Accuracy": accuracy
    }

# Assuming `y_test` contains your true labels and `bert_test_predictions` & `svm_test_predictions` are your model predictions
bert_metrics = evaluate_model(y_test, bert_test_predictions)
svm_metrics = evaluate_model(y_test, svm_test_predictions)

# Print out the metrics
print("BERT Model Evaluation:")
for metric, value in bert_metrics.items():
    print(f"{metric}: {value:.4f}")

print("\nSVM Model Evaluation:")
for metric, value in svm_metrics.items():
    print(f"{metric}: {value:.4f}")

    padding=True, max_length=128)
    outputs = model(inputs['input_ids'], attention_mask=inputs['attention_mask'])
    logits = outputs[0]
    probs = tf.nn.softmax(logits, axis=1)
    predictions = np.argmax(probs, axis=1)
    return predictions

bert_test_predictions = get_bert_test_predictions(X_test)

# Since you've already instantiated and fit a TfidfVectorizer and trained an SVM before, no need to reinitialize them
def get_svm_test_predictions(texts):
    tfidf_texts = vectorizer.transform(texts)
    predictions = svm_model.predict(tfidf_texts)
    return predictions

svm_test_predictions = get_svm_test_predictions(X_test)

# Stacking models for predictions on test set
X_test_stack = np.column_stack((test_preds_rf, test_preds_xgb, test_preds_ada, bert_test_predictions, svm_test_predictions))
stacked_test_predictions = model_stack.predict(X_test_stack)

# To evaluate
accuracy = accuracy_score(y_test, stacked_test_predictions)
roc_auc = roc_auc_score(y_test, stacked_test_predictions)
precision = precision_score(y_test, stacked_test_predictions)
recall = recall_score(y_test, stacked_test_predictions)
f1 = f1_score(y_test, stacked_test_predictions)

print(f"Accuracy: {accuracy}")
print(f"ROC-AUC: {roc_auc}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
