# Introduction
In this notebook, we'll develop a classifier that distinguishes between "ham" (legitimate) and "spam" (unsolicited) messages. The ability to classify messages accurately is essential for various applications such as email filtering, SMS filtering, and more.

## Import dependencies

In [45]:
#Data cleaning and preprocessing
import re
import nltk
import os
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from sklearn.svm import LinearSVC
from sklearn.utils import class_weight
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix,accuracy_score, precision_score, recall_score, f1_score,make_scorer
import seaborn as sns
from sklearn.model_selection import GridSearchCV
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/p/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Data acquisition

In [10]:
dfs = []
for file in os.listdir('./smsspamcollection/'):
  dfs.append(pd.read_csv('./smsspamcollection/'+file, sep='\t',dtype=(str,str),
                           names=["label", "message"]))
  
messages_df = pd.concat(dfs,axis=0).reset_index(drop=True)

messages_df

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
11141,spam,"""This is the 2nd time we have tried 2 contac..."
11142,ham,"Will �_ b going to esplanade fr home?,,,"
11143,ham,"""Pity, * was in mood for that. So...any othe..."
11144,ham,The guy did some bitching but I acted like i...


## Validate dataset 
Lets check if there is any missing values or wrong values in our dataset

In [11]:
display(messages_df['label'].value_counts())
print("check missing values in messages")
display(messages_df[messages_df['message'].isnull() ])

label
ham     9652
spam    1494
Name: count, dtype: int64

check missing values in messages


Unnamed: 0,label,message


## Preprocessing and Feature engineering

### Data Preprocessing
Before building the classifier, we'll perform data cleaning and preprocessing steps. This includes tasks such as:

- Mapping labels to binary values (0 for "ham", 1 for "spam").
- Discarding messages that contain no alphanumeric characters.
- Cleaning and transforming text data, including removing special characters, converting to lowercase, tokenization, removing stopwords, and lemmatization.

### Feature Engineering
We'll extract features from the text data to represent it in a format suitable for machine learning algorithms. Common feature engineering techniques for NLP tasks include TF-IDF (Term Frequency-Inverse Document Frequency).

In [12]:
def data_preprocessing(df):
  # map labels into 0 and 1
  df.label = df.label.map({'ham':0, 'spam':1})

  # discard those messages which has no aplphbetic or numerical values
  df['has_alphanumeric'] = df['message'].apply(lambda x: bool(re.search(r'[a-zA-Z0-9]', str(x))))
  df = df[df['has_alphanumeric']].drop(columns=['has_alphanumeric'])
  
  # message cleansing
  corpus = []
  STOPWORDS = stopwords.words('english') + ['u', 'ü', 'ur', '4', '2', 'im', 'dont', 'doin', 'ure']
  lemmatizer = WordNetLemmatizer()
  for i in range(0, len(df)):
    review = re.sub(r'[^a-zA-Z0-9]', ' ', df['message'].iloc[i])
    review = review.lower()
    review = review.split()
    review = [lemmatizer.lemmatize(word) for word in review if not word in STOPWORDS]
    review = ' '.join(review)
    corpus.append(review)
    df.loc[df.index[i], 'cleaned_message'] = review


  # Feature engineering
  tfidf = TfidfVectorizer()
  features= tfidf.fit_transform(corpus).toarray()
  return features, df.label

features, labels = data_preprocessing(messages_df)



In [13]:
def data_splitting(features,labels):
  X_train, X_test, y_train, y_test = train_test_split(features,labels, test_size = 0.20, random_state = 0)
  return X_train, X_test, y_train, y_test

## Training & Evaluation
### Model Building
We'll train and evaluate a `Multinomial Naive Bayes` model with an attention mechanism to handle the imbalance between "ham" and "spam" classes. The attention mechanism will allow the model to focus more on the minority class samples during training, leading to improved classification performance and more reliable outcomes.

### Model Evaluation
We'll evaluate the performance of each model using appropriate evaluation metrics such as accuracy, precision, recall, F1-score, and confusion matrix. Additionally, we'll use techniques such as hyperparameter tuning to optimize the model. 

In [59]:
class Model:
    def __init__(self, features, labels) -> None:
        self.X_train, self.X_test, self.y_train, self.y_test = data_splitting(features, labels)
        
        # Convert classes to numpy array
        classes = np.array([0, 1])
        
        # Compute class weights
        class_weights = class_weight.compute_class_weight('balanced', classes=classes, y=labels)
        self.class_weights = {0: 1/class_weights[0], 1: 5/class_weights[1]}
        
        # Initialize model with class weights
        self.model = MultinomialNB(class_prior=[self.class_weights[0], self.class_weights[1]])
        
    def train(self):
        self.model = self.model.fit(self.X_train, self.y_train)
    
    def evaluate(self):
        y_pred = self.model.predict(self.X_test)
        accuracy = accuracy_score(y_true=self.y_test, y_pred=y_pred)
        precision = precision_score(y_true=self.y_test, y_pred=y_pred, average='weighted')
        recall = recall_score(y_true=self.y_test, y_pred=y_pred, average='weighted')
        f1 = f1_score(y_true=self.y_test, y_pred=y_pred, average='weighted')
        confusion_m = confusion_matrix(y_true=self.y_test, y_pred=y_pred)    
        
        return {
            'confusion_matrix': confusion_m,
            'accuracy': accuracy,
            'precision': precision,
            'recall': recall,
            'f1_score': f1
        }
    def hyperparameter_tuning(self,params):
        # Define scoring metrics
        scoring = {'accuracy': make_scorer(accuracy_score),
                    'precision': make_scorer(precision_score, average='weighted'),
                    'recall': make_scorer(recall_score, average='weighted'),
                    'f1_score': make_scorer(f1_score, average='weighted'),
                    }
        
        # Perform grid search with cross-validation
        grid_search = GridSearchCV(estimator=self.model,
                                    param_grid=param_grid,
                                    scoring=scoring,
                                    refit='precision',
                                    cv=5,
                                    return_train_score=True)
        
        grid_search.fit(self.X_train, self.y_train)
        
        # Get best parameters
        self.best_params = grid_search.best_params_
        
        # Set model with best parameters
        self.model = grid_search.best_estimator_

        return self.best_params,self.model
    
model = Model(features, labels)
print("Train and evaluate a simple model:\n")
model.train()
for key, value in model.evaluate().items():
    if key == 'confusion_matrix':
        print(key," : \n",value)
    else:
        print(key," : ",value)


print("\n\nhyperparameter tunning:\n")
param_grid = {'alpha': [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]}
best_params = model.hyperparameter_tuning(param_grid)
print('best parameters are:', best_params)
for key, value in model.evaluate().items():
    if key == 'confusion_matrix':
        print(key," : \n",value)
    else:
        print(key," : ",value)


Train and evaluate a simple model:

confusion_matrix  : 
 [[1878   57]
 [   6  288]]
accuracy  :  0.971736204576043
precision  :  0.9754435471396843
recall  :  0.971736204576043
f1_score  :  0.9726753811792016


hyperparameter tunning:

best parameters are: ({'alpha': 0.1}, MultinomialNB(alpha=0.1, class_prior=[1.7318014540885018, 1.3409927295574904]))
confusion_matrix  : 
 [[1893   42]
 [   3  291]]
accuracy  :  0.9798115746971736
precision  :  0.9819906729736021
recall  :  0.9798115746971736
f1_score  :  0.9803366841921854
