<a href="https://colab.research.google.com/github/mkaramib/NLP/blob/main/Classification/spam_classifier_pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spam Classifier 
In this colab, a spam classifier has been implemented using BERT and pytorch.

## Libraries

In [None]:
# install transformers
!pip install transformers==3.0.0

In [3]:
# import libraries
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import transformers
from transformers import AutoModel, BertTokenizerFast
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import AdamW   # hugging face transformers
from sklearn.utils.class_weight import compute_class_weight


# specify GPU
device = torch.device("cuda")

## Data
In this section, like most of ML application, the data set will be loaded. We use a spam data in a CSV file. 

In [None]:
# load data in data-frame
spam_df = pd.read_csv("./sample_data/spams_dataset.csv")
#spam_df.head()
spam_df.shape

In [None]:
# check the balance of data
spam_df['label'].value_counts(normalize =True)

### Data Split
In this section, we will split the data into the train, validation, and test set using sklean library.

In [6]:
# step 1: split into train and non-train data-sets.
train_text, other_text, train_labels, other_labels = train_test_split(spam_df['text'], spam_df['label'], 
                                                                    random_state=2020, 
                                                                    test_size=0.3, 
                                                                    stratify=spam_df['label'])

# step 2: split the other sets intp validation and test sets
val_text, test_text, val_labels, test_labels = train_test_split(other_text, other_labels, 
                                                                random_state=2020, 
                                                                test_size=0.5, 
                                                                stratify=other_labels)

## Bert Model
In this section, we need to import the BERT model. 


### Import BERT


In [None]:
# import BERT-base pretrained model
bert = AutoModel.from_pretrained('bert-base-uncased')

# Load the BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

### Tokenization
In this section, we show how to use BERT for tokenization.

#### Samples
In this section, we show the tokenization for sample sentences.

In [None]:
# sample data
samples = ["this is a bert model tutorial.", "we will fine-tune a bert model.", "bert model is the most well known models in NLP."]

# encode text
samples_ids = tokenizer.batch_encode_plus(samples, padding=True, return_token_type_ids=False)

# print ids
print(samples_ids)

#### Max Length sequence
We need to find the maximum length of train samples.

In [None]:
# get length of all the messages in the train set
train_lengths = [len(sample.split()) for sample in train_text]

# plot the histogram of length of trian samples
pd.Series(train_lengths).hist(bins = 40)

# find actual max length
max_l = max(train_lengths)
print(max_l)

Although the maximum length of training samples is 125, but we can define a pre-defined max length for padding. We will assign a smaller value for the max-length for padding. 
If we do not define it, it will pad to the longest in the sequence.

In [10]:
# define a max-legnth for padding
max_seq_lenght = 30

#### Data Set Tokenization
In this step, we will tokenize the traing, validation, and test sets.

In [11]:
# tokenize and encode sequences in the training set
train_tokens = tokenizer.batch_encode_plus(
    train_text.tolist(),
    #max_length = max_seq_lenght,
    padding =True, # or padding = 'longest'
    # padding = 'max_length'   # in the case using pre-defined max-legnth
    return_token_type_ids=False
)

# tokenize and encode sequences in the validation set
val_tekens = tokenizer.batch_encode_plus(
    val_text.tolist(),
    padding =True, # or padding = 'longest'
    return_token_type_ids=False
)

# tokenize and encode sequences in the test set
test_tokens = tokenizer.batch_encode_plus(
    test_text.tolist(),
    padding =True, # or padding = 'longest'
    return_token_type_ids=False
)

In [None]:
train_tokens[0]

### Build Tensors
After tokenization(generating vectors or samples), we need to convert their sequences which are integers into Tensors.

In [13]:
# Build tensors of train data
train_seq = torch.tensor(train_tokens['input_ids'])
train_mask = torch.tensor(train_tokens['attention_mask'])
train_y = torch.tensor(train_labels.tolist())

# Build tensors of validation data
val_seq = torch.tensor(val_tekens['input_ids'])
val_mask = torch.tensor(val_tekens['attention_mask'])
val_y = torch.tensor(val_labels.tolist())

# Build tensors of test data
test_seq = torch.tensor(test_tokens['input_ids'])
test_mask = torch.tensor(test_tokens['attention_mask'])
test_y = torch.tensor(test_labels.tolist())

### Build Data Loaders


In [14]:
#define a batch size
batch_size = 32

# wrap tensors
train_data = TensorDataset(train_seq, train_mask, train_y)

# sampler for sampling the data during training
train_sampler = RandomSampler(train_data)

# dataLoader for train set
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# wrap tensors
val_data = TensorDataset(val_seq, val_mask, val_y)

# sampler for sampling the data during training
val_sampler = SequentialSampler(val_data)

# dataLoader for validation set
val_dataloader = DataLoader(val_data, sampler = val_sampler, batch_size=batch_size)

### BERT Fine Tune
In this section, required steps will be followed for fine tuning the BERT.

1.   Freezing BERT parameters
2.   Define Model Architecture
3.   Find Class Weights
4.   Fine-Tune process
5.   Train The model



#### Step 1: Freezing BERT parameters

In [15]:
# freeze all the parameters
for param in bert.parameters():
    param.requires_grad = False

#### Step 2: Define Model Architecture

In [16]:
class BERT_Arch(nn.Module):

    def __init__(self, bert):
      
      super(BERT_Arch, self).__init__()

      self.bert = bert 
      
      # dropout layer
      self.dropout = nn.Dropout(0.1)
      
      # relu activation function
      self.relu =  nn.ReLU()

      # dense layer 1
      self.fc1 = nn.Linear(768,512)
      
      # dense layer 2 (Output layer)
      self.fc2 = nn.Linear(512,2)

      #softmax activation function
      self.softmax = nn.LogSoftmax(dim=1)

    #define the forward pass
    def forward(self, sent_id, mask):

      #pass the inputs to the model  
      _, cls_hs = self.bert(sent_id, attention_mask=mask)
      
      x = self.fc1(cls_hs)

      x = self.relu(x)

      x = self.dropout(x)

      # output layer
      x = self.fc2(x)
      
      # apply softmax activation
      x = self.softmax(x)

      return x

In [17]:
# pass the pre-trained BERT to our define architecture
model = BERT_Arch(bert)

# push the model to GPU
model = model.to(device)

# define the optimizer
optimizer = AdamW(model.parameters(), lr = 1e-3)

In [None]:
print(model)

#### Step 3: Find Class Weights

In [19]:
#compute the class weights
class_wts = compute_class_weight('balanced', np.unique(train_labels), train_labels)

print(class_wts)

[0.57743559 3.72848948]


In [20]:
# convert class weights to tensor
weights= torch.tensor(class_wts,dtype=torch.float)
weights = weights.to(device)

# loss function
cross_entropy  = nn.NLLLoss(weight=weights) 

# number of training epochs
epochs = 10

#### Step 4: Fine-Tune process

In [21]:
# function to train the model
def train():
  
  model.train()

  total_loss, total_accuracy = 0, 0
  
  # empty list to save model predictions
  total_preds=[]
  
  # iterate over batches
  for step,batch in enumerate(train_dataloader):
    
    # progress update after every 50 batches.
    if step % 50 == 0 and not step == 0:
      print('  Batch {:>5,}  of  {:>5,}.'.format(step, len(train_dataloader)))

    # push the batch to gpu
    batch = [r.to(device) for r in batch]
 
    sent_id, mask, labels = batch

    # clear previously calculated gradients 
    model.zero_grad()        

    # get model predictions for the current batch
    preds = model(sent_id, mask)

    # compute the loss between actual and predicted values
    loss = cross_entropy(preds, labels)

    # add on to the total loss
    total_loss = total_loss + loss.item()

    # backward pass to calculate the gradients
    loss.backward()

    # clip the the gradients to 1.0. It helps in preventing the exploding gradient problem
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

    # update parameters
    optimizer.step()

    # model predictions are stored on GPU. So, push it to CPU
    preds=preds.detach().cpu().numpy()

    # append the model predictions
    total_preds.append(preds)

  # compute the training loss of the epoch
  avg_loss = total_loss / len(train_dataloader)
  
  # predictions are in the form of (no. of batches, size of batch, no. of classes).
  # reshape the predictions in form of (number of samples, no. of classes)
  total_preds  = np.concatenate(total_preds, axis=0)

  #returns the loss and predictions
  return avg_loss, total_preds

In [22]:
# function for evaluating the model
def evaluate():
  
  print("\nEvaluating...")
  
  # deactivate dropout layers
  model.eval()

  total_loss, total_accuracy = 0, 0
  
  # empty list to save the model predictions
  total_preds = []

  # iterate over batches
  for step,batch in enumerate(val_dataloader):
    
    # Progress update every 50 batches.
    if step % 50 == 0 and not step == 0:
      
      # Calculate elapsed time in minutes.
      elapsed = format_time(time.time() - t0)
            
      # Report progress.
      print('  Batch {:>5,}  of  {:>5,}.'.format(step, len(val_dataloader)))

    # push the batch to gpu
    batch = [t.to(device) for t in batch]

    sent_id, mask, labels = batch

    # deactivate autograd
    with torch.no_grad():
      
      # model predictions
      preds = model(sent_id, mask)

      # compute the validation loss between actual and predicted values
      loss = cross_entropy(preds,labels)

      total_loss = total_loss + loss.item()

      preds = preds.detach().cpu().numpy()

      total_preds.append(preds)

  # compute the validation loss of the epoch
  avg_loss = total_loss / len(val_dataloader) 

  # reshape the predictions in form of (number of samples, no. of classes)
  total_preds  = np.concatenate(total_preds, axis=0)

  return avg_loss, total_preds

#### Step 5: Train the model

In [None]:
# set initial loss to infinite
best_valid_loss = float('inf')

# empty lists to store training and validation loss of each epoch
train_losses=[]
valid_losses=[]

#for each epoch
for epoch in range(epochs):
     
    print('\n Epoch {:} / {:}'.format(epoch + 1, epochs))
    
    #train model
    train_loss, _ = train()
    
    #evaluate model
    valid_loss, _ = evaluate()
    
    #save the best model
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'saved_weights.pt')
    
    # append training and validation loss
    train_losses.append(train_loss)
    valid_losses.append(valid_loss)
    
    print(f'\nTraining Loss: {train_loss:.3f}')
    print(f'Validation Loss: {valid_loss:.3f}')

### Load Trained Model

In [None]:
#load weights of best model
path = 'saved_weights.pt'
model.load_state_dict(torch.load(path))

### Evaluate Model
In this section, the trained model will be evaluated.

In [25]:
# get predictions for test data
with torch.no_grad():
  preds = model(test_seq.to(device), test_mask.to(device))
  preds = preds.detach().cpu().numpy()

The following part will analyze the performance.

In [None]:
# model's performance
preds = np.argmax(preds, axis = 1)
print(classification_report(test_y, preds))

The following code will printout the confusion matrix.

In [None]:
# confusion matrix
pd.crosstab(test_y, preds)