# Classify FLS with DistilBERT

This notebook implements the classification model to classify FLS with DistilBERT. The data is the output from the notebook ``01_extract_sentences``. The outline of this notebook is as follows: 

**Step 1:** Prepare datasets by tokenization, balancing, converting to tensors as well as load and iterate through data in batches. 

**Step 2:** Train FLS classifier 

## Import packages

In [2]:
import pandas as pd
import numpy as np
import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, AdamW
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader
import time
from tqdm import tqdm
import tensorflow as tf
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# If there's a GPU available...
if torch.cuda.is_available():        
    # Tell PyTorch to use the GPU    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 4 GPU(s) available.
We will use the GPU: Tesla V100-SXM2-32GB


## Import data

In [4]:
# Load train data
df = pd.read_csv('../../data/01_interim/train.csv')

# Report the number of sentences
print('Number of training sentences: {:,}\n'.format(df.shape[0]))

Number of training sentences: 294,202



## Train validation split

In [5]:
# Define some variables
random_state = 42

# Split data into train and validation sets
train_text, validation_text, train_labels, validation_labels = train_test_split(df['Sentence'], df['Label'], 
                                                            random_state=random_state, test_size=0.1)


In [6]:
print(f"Number of training data: {len(train_text)} sentences")
print(f"Number of validation data: {len(validation_text)} sentences")

Number of training data: 264781 sentences
Number of validation data: 29421 sentences


## Prepare dataset

### Tokenize text

In [7]:
# Load the BERT tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', do_lower_case=True)

# Tokenize all of the sentences and map the tokens to their word IDs
train_text_encoded = [tokenizer.encode(sent, 
                                 add_special_tokens=True, # Add '[CLS]' and '[SEP]'
                                 padding='max_length', # Pad & truncate all sentences to set max length.
                                 truncation=True,
                                 max_length=512, # max length is set to 512 (max length of BERT model) to capture as much information as possible
                                 ) for sent in train_text]
validation_text_encoded = [tokenizer.encode(sent,
                                    add_special_tokens=True, # Add '[CLS]' and '[SEP]'
                                    padding='max_length', # Pad & truncate all sentences.
                                    truncation=True,
                                    max_length=512,
                                    ) for sent in validation_text]



### Balance dataset

In [9]:
# As the data is extremely imbalanced, we will use SMOTE to balance the data
sm = SMOTE(random_state=random_state, sampling_strategy='auto')
train_text_resampled, train_labels_resampled = sm.fit_resample(train_text_encoded, train_labels)

In [19]:
print(f"Total length of train data after resampling: {len(train_text_resampled)}")
print("Distribution of labels after resampling: \n")
train_labels_resampled.value_counts()

Total length of train data after resampling: 466892
Distribution of labels after resampling: 



1    233446
0    233446
Name: Label, dtype: int64

### Convert to tensors and load data in DataLoader

Convert the training and validation data into tensors, which are multi-dimensional arrays that can be processed by a neural network.

In [11]:
# Train and validation data to tensors
train_text_tensor = torch.tensor(train_text_resampled).to(device)
train_labels_tensor = torch.tensor(train_labels_resampled).to(device)

validation_text_tensor = torch.tensor(validation_text_encoded).to(device)
validation_labels_tensor = torch.tensor(validation_labels.values).to(device)

Define the batch size for training and validation data. The batch size determines the number of samples that will be processed together in each iteration during training.

In [12]:
# Define the batch size
batch_size = 32

# Create the DataLoader for our training set
train_data = TensorDataset(train_text_tensor, train_labels_tensor)
train_dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=True)

# Create the DataLoader for our validation set
validation_data = TensorDataset(validation_text_tensor, validation_labels_tensor)
validation_dataloader = DataLoader(validation_data, batch_size=batch_size, shuffle=False)

### Train model

In [None]:
# Instantiate the model
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2,output_attentions=False)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
# Initialize the best validation accuracy.
best_validation_accuracy = 0.0

model.to(device) # send the model to GPU
model.train() # switch to train mode i.e. forward, backward, optimization
optimizer = AdamW(model.parameters(), lr=2e-5) # choose an optimizer for the gradient descent
loss_values = [] # accumulate the losses, can be used with a validation set to choose the epochs so as to avoid overfitting

# define number of epochs
epochs = 3
for epoch in range(epochs): #number of epochs i.e. how many times is the whole dataset passed through the architecture
      # =================================
      #              Training
      # =================================
      
      print("epoch: ", epoch+1)
      print("Training...")
      # capture time
      total_t0 = time.time()
      train_total_loss = 0
      for batch in tqdm(train_dataloader): # split into batches to fit into the memory
            input_ids, labels = batch
            input_ids.to(device)
            labels.to(device)
            
            # Always clear any previously calculated gradients before performing a
            # backward pass. 
            optimizer.zero_grad()


            # Perform a forward pass (evaluate the model on this training batch).
            # This will return the loss (rather than the model output) because we
            # have provided the `labels`.
            # The documentation for this `model` function is here: 
            # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
            outputs = model(input_ids,labels=labels)
            # Calculate the loss i.e. distance between predicted labels and true labels using cross entropy
            loss = outputs[0]

            # Accumulate the training loss over all of the batches so that we can
            # calculate the average loss at the end. `loss` is a Tensor containing a
            # single value; the `.item()` function just returns the Python value 
            # from the tensor.
            train_total_loss += loss.item()
            # Perform a backward pass to calculate the gradients.
            loss.backward()

            # Clip the norm of the gradients to 1.0.
            # This is to help prevent the "exploding gradients" problem.
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

            # Update parameters using the optimizer and the gradient values
            optimizer.step()
      
      # print result summaries
      print("")
      print("summary results")
      print("epoch | train loss | train time")
      
      # Calculate the average loss over the training data.
      avg_train_loss = train_total_loss / len(train_dataloader)
      
      # Store the loss value for plotting the learning curve.
      loss_values.append(avg_train_loss)
      
      
      # training time end
      training_time = time.time() - total_t0
      print(f"{epoch+1:5d} | {avg_train_loss:.5f} |  {training_time:}")
      
      # =================================
      #             Validation
      # =================================
      # After the completion of each training epoch, measure our performance on
      # our validation set.
      print("")
      print("Running Validation...")
      # capture time
      total_t0 = time.time()
      # switch to evaluation mode i.e. no backward pass
      model.eval()
      # Tracking variables
      
      # Evaluate data for one epoch
      with torch.no_grad():
            preds_list = []
            accuracy_list = []
            labelsset=[]
            accuracy_sum = 0
            for batch in tqdm(validation_dataloader):
                  input_ids, labels = batch
                  input_ids.to(device)
                  outputs = model(input_ids)
                  logits =outputs.logits.detach().cpu().numpy()   # Taking the softmax of output
                  pred=np.argmax(logits, axis=1).tolist()
                  acc=accuracy_score(labels.detach().cpu().numpy().tolist(), pred)
                  accuracy_sum+=acc
                  preds_list.extend(pred)
                  accuracy_list.append(acc)
                  labelsset.extend(labels.detach().cpu().numpy())
      
      mean_accuracy = accuracy_sum / len(validation_dataloader)
      print("  Accuracy: {0:.2f}".format(mean_accuracy))




epoch:  1
Training...


  0%|          | 0/14591 [00:00<?, ?it/s]

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
100%|██████████| 14591/14591 [1:54:06<00:00,  2.13it/s]



summary results
epoch | train loss | train time
    1 | 0.02923 |  6846.694295167923

Running Validation...


100%|██████████| 920/920 [02:18<00:00,  6.66it/s]


  Accuracy: 0.99
epoch:  2
Training...


100%|██████████| 14591/14591 [1:49:41<00:00,  2.22it/s]



summary results
epoch | train loss | train time
    2 | 0.01572 |  6581.928730726242

Running Validation...


100%|██████████| 920/920 [02:18<00:00,  6.67it/s]


  Accuracy: 0.99
epoch:  3
Training...


100%|██████████| 14591/14591 [1:49:34<00:00,  2.22it/s]



summary results
epoch | train loss | train time
    3 | 0.01384 |  6574.551898956299

Running Validation...


100%|██████████| 920/920 [02:18<00:00,  6.66it/s]

  Accuracy: 0.99





In [15]:
# Save model
# At the time this model was trained, I did not save it directly to the HuggingFace repository but to a local folder. I then uploaded this to the website. 
model.save_pretrained('../../models/distilbert-fls')

## Next step

This classifier is evaluated along with the FLS classifier based on FinBERT (notebook ``11b_fls-classifier_finbert``). The name of the evaluation notebook is ``12_evaluate_fls_classifiers``.