# SPAM vs NOT-SPAM with LigthGBM

In this notebook, I am using classical Machine Learning method to classify spam messages. The model used is LightGBM which achieves an F1-Score of 0.96. 

### Import stuff

In [1]:
import gc
import re
import scipy
import joblib
import numpy as np
import pandas as pd
import lightgbm as lgb
import matplotlib.pyplot as plt
from sklearn import model_selection
from tqdm import tqdm_notebook as tqdm
from sklearn import metrics, preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer

from nltk.tokenize import MWETokenizer

gc.enable()

In [2]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

Defining the configurations for training and the LightGBM model.

In [3]:
Config = {
    'num_folds': 6,
    'seed': 541,
    'TARGET': 'target',
}

LGBM_PARAMS = {
    'seed': Config['seed'],
    'objective': 'binary',
    'early_stopping_round': 1000,
    'verbosity': -1,
    'n_estimators': 2000,
}

The following function is used to create K-Fold split on the data, stratified on the target variable. For reproducbility, we will shuffle the dataset with a Seed before creating the folds.

In [4]:
def create_folds(data):
    
    data['kfold'] = -1
    
    kf = model_selection.StratifiedKFold(n_splits=Config['num_folds'], shuffle=True, random_state=Config['seed'])
    for f, (t_, v_) in enumerate(kf.split(X=data, y=data[Config['TARGET']])):
        data.loc[v_, 'kfold'] = f
    
    return data

This function basically returns the sparse form of Tf-Idf vectors. 

* I experimented with different *n_gram* ranges but the performance is better without it. 
* Moreover, I am not using the *max_features* parameter because I am using sparse form and the text samples are small, so no need to limit the number of features.
* Tf-Idf will remove Stop Words before computing the vectors. I have used the Tf-Idf parameter to removing the English Stop Words.
* Lemmatization is not helping. Score dips by 0.02.

In [5]:
def tokenize(text):
    mtokenizer = MWETokenizer()
    mwe = mtokenizer.tokenize(text.split())
    words =[]
    for t in mwe:
        if t.isalpha():
            words.append(t)
    return words

def lemmatize(text): 
    tokens = tokenize(text)
    lemmatized_sentence = []
    
    for token in tokens:
        lemmatized_sentence.append(lemmatizer.lemmatize(token))

    return ' '.join(lemmatized_sentence)

def vectorize(train_text, test_text, lemmatize=False):
    if lemmatize:
        for i, text in enumerate(train_text):
            train_text[i] = lemmatize(text)
        for i, text in enumerate(test_text):
            test_text[i] = lemmatize(text)
    
    vectorizer = TfidfVectorizer(stop_words='english')
    
    train_vectors = vectorizer.fit_transform(train_text) # .toarray() 
    test_vectors = vectorizer.transform(test_text) # .toarray()
    
    return train_vectors, test_vectors

* The dataset is highly imbalaced. Hence, a powerful metric to evaluate the model is F1-Score. The *get_score* function evaluates the predictions and returns the F1-Score. If *report* parameter is set to *True*, it also prints the Classification Report. 
* The *lgb_f1_score* is the custom metric funcion for the LightGBM model to utilize it's **early_stopping** feature. (There isn't a default option).
* *train_fn* takes the training data and trains the model for 1 Fold. As in, if Fold 0 is given, the model will be trained on Folds 1, 2, 3, 4, 5 and evaluated on the unseed Fold 0. This exercise is repeated for all folds.
* The *run* function just loops over to train for all folds and calculate the average score for all folds.

In [6]:
def get_score(y_true, y_preds, report=False):
    y_true = [id_to_class[i] for i in y_true]
    y_preds = [id_to_class[i] for i in y_preds]
    
    if report:
        print(metrics.classification_report(y_true, y_preds))  
    return round(metrics.f1_score(y_true, y_preds, average='macro'), 2)

def lgb_f1_score(y_hat, data):
    y_true = data.get_label()
    y_hat = (y_hat > 0.5) * 1 
    return 'f1', get_score(y_true, y_hat), True

def train_fn(train_vectors, train_folds, fold):
    train_indices = train_folds['kfold'] != fold
    valid_indices = train_folds['kfold'] == fold
    
    x_train = train_vectors[train_indices]
    y_train = train_folds[Config['TARGET']][train_indices].values
    
    x_valid = train_vectors[valid_indices]
    y_valid = train_folds[Config['TARGET']][valid_indices].values
        
    lgb_train = lgb.Dataset(x_train, y_train)
    lgb_valid = lgb.Dataset(x_valid, y_valid, reference=lgb_train)
    
    model = lgb.train(LGBM_PARAMS,  lgb_train, valid_sets=[lgb_valid], feval=lgb_f1_score, verbose_eval=False)
    y_preds = model.predict(x_valid)

    y_preds = (y_preds > 0.5) * 1
    
    score = get_score(y_valid, y_preds)
    print(f'Fold: {fold}, Score: {score}')
    
    joblib.dump(model, f'model_{fold}.bin')
    
    del model
    del train_indices, valid_indices
    del x_train, y_train, x_valid, y_valid
    gc.collect()
    
    return score


def run(train_vectors, train_folds): 
    score_avg = 0
    for fold in range(Config['num_folds']):
        score_avg += train_fn(train_vectors, train_folds, fold)
        
    score_avg /= Config['num_folds']
    print('Average Score', round(score_avg, 2))

Importing the data. Encoding used while reading the file is *latin-1* because the text in the file causes an error while using *utf* encoding (mainly because of the commas in the text, ugh!)

* The dataset has weird column names. Firstly, I am just going to rename the features.
* I have created the *class_to_id* and *id_to_class* dictionary that maps the target to a numeric encoding and vice-versa.
* I am also creating 2 extra features: *num_words* has the number of words in the text calculated using the python .split() function and *num_characters* has the number of characters in the text. 

In [7]:
df = pd.read_csv('../input/sms-spam-collection-dataset/spam.csv', encoding='latin-1')\
        .rename(columns={'v1': 'target', 'v2': 'text'})[['target', 'text']]

df['num_words'] = df['text'].apply(lambda x: len(x.split()))
df['num_characters'] = df['text'].apply(lambda x: len(x))

class_to_id = {'ham': 0, 'spam': 1}
id_to_class = {id_: class_ for class_, id_ in class_to_id.items()}

df['target'] = df['target'].map(class_to_id)

I am splitting the data to leave out 0.15 of the entire data for testing the model. These samples will never be seen by the model. Using a seed to for reproducible results. The K-Fold split will be computed on the remaing 0.85 of the data. I have chosen K = 6.

In [8]:
TRAIN_DATA, TEST_DATA = model_selection.train_test_split(
    df, 
    test_size=0.15, 
    stratify=df[Config['TARGET']].values,
    random_state=Config['seed'],
)

TRAIN_DATA = TRAIN_DATA.reset_index(drop=True)
TEST_DATA = TEST_DATA.reset_index(drop=True)

TRAIN_DATA = create_folds(TRAIN_DATA)

* Using the *vectorize* function defined above, I will get the sparse form of train and test vectors. 
* Then, since I want to use *num_words* and *num_characters* features, I will convert them to sparse matrices too.
* I will concat the vectors with the sparse form of the 2 features.

In [9]:
train_vectors, test_vectors = vectorize(TRAIN_DATA.text.tolist(), TEST_DATA.text.tolist())

train_sparse = scipy.sparse.csr_matrix(TRAIN_DATA[['num_words', 'num_characters']].values)
test_sparse = scipy.sparse.csr_matrix(TEST_DATA[['num_words', 'num_characters']].values)

train_sparse = scipy.sparse.hstack([train_sparse, train_vectors]).tocsr()
test_sparse = scipy.sparse.hstack([test_sparse, test_vectors]).tocsr()

Let the training begin!

The model has achieved an Average Cross-Validation F1-Score of 0.96.

In [10]:
run(train_sparse, TRAIN_DATA)



Fold: 0, Score: 0.96
Fold: 1, Score: 0.95
Fold: 2, Score: 0.97
Fold: 3, Score: 0.95
Fold: 4, Score: 0.96
Fold: 5, Score: 0.95
Average Score 0.96


Inference Time.

* I will load the trained models, calulate the probabilites and keep adding them to *test_preds*. 
* Dividing it by *num_folds* will give the average predicted probabilities.
* When these predictions are evaluated, the F1-Score is 0.96.  

**This means that the CV score is the same as the Test Score. That's a great achievement as it means there's no overfittig and the model is performing well.**

In [11]:
model_path = 'model_{}.bin'
test_preds = np.zeros((TEST_DATA.shape[0]))

for fold in range(Config['num_folds']):
    model = joblib.load(model_path.format(fold))    
    test_preds += model.predict(test_sparse)

test_preds /= Config['num_folds']

In [12]:
test_classes = (test_preds > 0.5) * 1
print('Test F1-Score:', get_score(TEST_DATA[Config['TARGET']], test_classes, report=True))

              precision    recall  f1-score   support

         ham       0.98      1.00      0.99       724
        spam       0.97      0.88      0.93       112

    accuracy                           0.98       836
   macro avg       0.98      0.94      0.96       836
weighted avg       0.98      0.98      0.98       836

Test F1-Score: 0.96
