<a href="https://colab.research.google.com/github/pimverschuuren/ComplaintDepartment/blob/main/TransformerMulticlassifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Get the dataset in compressed form.

In [1]:
!wget https://files.consumerfinance.gov/ccdb/complaints.csv.zip

--2021-10-27 16:25:11--  https://files.consumerfinance.gov/ccdb/complaints.csv.zip
Resolving files.consumerfinance.gov (files.consumerfinance.gov)... 52.84.158.12, 52.84.158.8, 52.84.158.48, ...
Connecting to files.consumerfinance.gov (files.consumerfinance.gov)|52.84.158.12|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 383843806 (366M) [binary/octet-stream]
Saving to: ‘complaints.csv.zip’


2021-10-27 16:25:14 (123 MB/s) - ‘complaints.csv.zip’ saved [383843806/383843806]



Decompress the data.

In [2]:
!unzip complaints.csv.zip

Archive:  complaints.csv.zip
  inflating: complaints.csv          


Setting up GPU if available

In [3]:
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

Install and import some libraries.

In [4]:
# Install the transformers package of Hugging Face.
!pip install transformers

# Importing the libraries needed
import pandas as pd
import torch
import time
import numpy as np
import torch.nn.functional as F
import transformers
from torch.utils.data import Dataset, DataLoader
from transformers import DistilBertModel, DistilBertTokenizer
from transformers import BertModel, BertTokenizer
torch.backends.cudnn.deterministic = True

Collecting transformers
  Downloading transformers-4.11.3-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 7.2 MB/s 
[?25hCollecting huggingface-hub>=0.0.17
  Downloading huggingface_hub-0.0.19-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 4.2 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 58.4 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 58.4 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 64.3 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempti

Load the dataset into a pandas dataframe

In [5]:
total_dataset_full = pd.read_csv('complaints.csv')

Possible prediction variables: Company public response and product.

Future use of the predictions: In case of company public responses, it is hard for a human being to choose the optimal public response to a complaint from 10 categories. Predicting a public response based on the complaint would be a helpful tool.

Lets pre-process the data by removing nan values from the target variable and complaints. Also, when printing the frequency if each class in the target variable we see a large class imbalance.

In [6]:
text_variable = 'Consumer complaint narrative'
target_variable = 'Company public response'

print("Total number of statistics: "+str(len(total_dataset_full)))

total_dataset_full = total_dataset_full.dropna(subset=[text_variable])
total_dataset_full = total_dataset_full.dropna(subset=[target_variable])

print("Remaining number of statistics: "+str(len(total_dataset_full)))
print("Number of remaining classes are: "+str(total_dataset_full[target_variable].nunique()))
total_dataset_full[target_variable].value_counts()

Total number of statistics: 2317009
Remaining number of statistics: 391199
Number of remaining classes are: 11


Company has responded to the consumer and the CFPB and chooses not to provide a public response                            306269
Company believes it acted appropriately as authorized by contract or law                                                    46119
Company chooses not to provide a public response                                                                            19818
Company believes the complaint is the result of a misunderstanding                                                           4643
Company disputes the facts presented in the complaint                                                                        4251
Company believes complaint is the result of an isolated error                                                                2811
Company believes complaint caused principally by actions of third party outside the control or direction of the company      2759
Company believes complaint represents an opportunity for improvement to better serve consu

Here we resample the classes to balance the dataset in its target variable. This is to avoid the model learns to predict only the class that occurs the most in the training data.

In [23]:
from sklearn.utils import resample

total_dataset = None

# Define the number of occurences wanted for each class. 1000 is realitively low
max_len = 20000

for index, class_val in enumerate(total_dataset_full[target_variable].unique()):

  class_dataset = total_dataset_full.loc[total_dataset_full[target_variable] == class_val]

  class_dataset = resample(class_dataset,
                                 replace=True,
                                 n_samples=max_len,
                                 random_state=42)

  if index == 0:
    total_dataset = class_dataset.copy()
  else:
    total_dataset = pd.concat([total_dataset, class_dataset])

print(total_dataset[target_variable].value_counts())

# Include this line if a smaller dataset is needed for debugging.
#total_dataset = total_dataset_full.sample(frac=0.0005)

#print(total_dataset[target_variable].value_counts())

Company believes complaint caused principally by actions of third party outside the control or direction of the company    20000
Company believes the complaint is the result of a misunderstanding                                                         20000
Company believes the complaint provided an opportunity to answer consumer's questions                                      20000
Company chooses not to provide a public response                                                                           20000
Company believes complaint is the result of an isolated error                                                              20000
Company disputes the facts presented in the complaint                                                                      20000
Company has responded to the consumer and the CFPB and chooses not to provide a public response                            20000
Company can't verify or dispute the facts in the complaint                                       

Make a loading object that will pass tokenize and pass the data to the dataloader to avoid loading all the data in memory.

In [20]:
encode_dict = {}

def encode_product(x):
    if x not in encode_dict.keys():
        encode_dict[x]=len(encode_dict)
    return encode_dict[x]

class dataset_fold_BERT(Dataset):
    def __init__(self, xfold, yfold, tokenizer, max_len):
        self.len = len(xfold)
        self.xfold = xfold
        self.yfold = yfold
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __getitem__(self, index):
        sentence = str(self.xfold.iloc[index][text_variable])
        #title = " ".join(title.split())
        #print(sentence)
        inputs = self.tokenizer.encode_plus(
            sentence,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True,
            truncation=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']

        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'targets': self.yfold[index]
        } 
    
    def __len__(self):
        return self.len

Make a dataloader for k folds with stratified target variable classes.

In [24]:
from sklearn.model_selection import StratifiedKFold

# Define the number of folds.
k = 5

# Get the number of categories for the target variable.
n_class = total_dataset[target_variable].nunique()

kfold = StratifiedKFold(n_splits=(k))

# Define a maximum length for the complaint to be truncated to.
max_len = 512

# Define batch size
batch_size = 4

# Number of training epochs.
epochs = 1

# Learning rate for the optimizer.
lr = 1e-05

# Tokenizer to convert the text into tokens.
#tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

predictors = total_dataset.drop(target_variable, axis=1)

# Convert the products to integers.
target = total_dataset[target_variable].apply(lambda x: encode_product(x))

# Create a dict that will contain the dataloaders for all folds.
all_dataloaders = {}

# Define the dataloader parameters.
train_params = {'batch_size': batch_size,
                'shuffle': True,
                'num_workers': 0
                }
fold_count = 1

# Loop over the folds.
for _, fold in kfold.split(predictors, target):

    fold_name = "fold_"+str(fold_count)
    fold_count = fold_count + 1

    # Only keep the text variable column.
    X_fold = predictors.iloc[fold]
    y_fold = target.iloc[fold]

    # Convert to tensor.
    y_fold = torch.tensor(y_fold.values.astype(np.int64))

    # Get the dataset fold.
    training_fold = dataset_fold_BERT(X_fold, y_fold, tokenizer, max_len)

    # Get the dataloader.
    dataloader_fold = DataLoader(training_fold, **train_params)

    # Put all the dataloaders in a dict.
    all_dataloaders[fold_name] = dataloader_fold

Define two different models. Both use the pretrained BERT model as first layers. 

In [10]:
class BERTClass(torch.nn.Module):
    def __init__(self, n_class, hidden_dim, dropout):
        super(BERTClass, self).__init__()
        self.l1 = BertModel.from_pretrained("bert-base-uncased")
        #self.l1 = DistilBertModel.from_pretrained("distilbert-base-uncased")
        self.pre_classifier = torch.nn.Linear(768, hidden_dim)
        self.dropout = torch.nn.Dropout(dropout)
        self.classifier = torch.nn.Linear(hidden_dim, n_class)

    def forward(self, input_ids, attention_mask):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        pooler = self.pre_classifier(pooler)
        pooler = torch.nn.ReLU()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output

class GRUBERTClass(torch.nn.Module):
    def __init__(self,
                 hidden_dim,
                 output_dim,
                 n_layers,
                 bidirectional,
                 dropout):
        
        super().__init__()
        
        self.bert = BertModel.from_pretrained("bert-base-uncased")
        
        embedding_dim = self.bert.config.to_dict()['hidden_size']
        
        self.rnn = torch.nn.GRU(embedding_dim,
                          hidden_dim,
                          num_layers = n_layers,
                          bidirectional = bidirectional,
                          batch_first = True,
                          dropout = 0 if n_layers < 2 else dropout)
        
        self.out = torch.nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        
        self.dropout = torch.nn.Dropout(dropout)
    
    def forward(self, text):
        
        with torch.no_grad():
          embedded = self.bert(text)[0]
        
        _, hidden = self.rnn(embedded)
        
        if self.rnn.bidirectional:
            hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
        else:
            hidden = self.dropout(hidden[-1,:,:])
        
        output = self.out(hidden)
        
        return output


Instantiate the models.

In [11]:
hidden_dim = 256
n_layers = 2
bidirectional = False
dropout = 0.2

model = BERTClass(n_class, hidden_dim, dropout)
'''
model = GRUBERTClass(hidden_dim,
                 n_class,
                 n_layers,
                 bidirectional,
                 dropout)
'''
model.to(device)

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BERTClass(
  (l1): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    

Define a function that gives the number of learnable parameters. This gives an indication of the model complexity.

In [12]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

count_parameters(model)

The model has 109,681,931 trainable parameters


109681931

In [13]:
# Creating the loss function and optimizer
loss_function = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params =  model.parameters(), lr=lr)

Define functions to calculate the accuracy, recall and precision.

In [14]:
from sklearn.metrics import balanced_accuracy_score

def calculate_accu(big_idx, targets):
# Calculate the accuracy for a multiclass prediction.
    n_correct = (big_idx==targets).sum().item()
    return n_correct

def averaged_recall_precision(confusion_matrix):
    
    sum_precision = 0
    sum_recall = 0

    # Sum along row/col.
    sum_rows = np.sum(confusion_matrix, axis=0)
    sum_cols = np.sum(confusion_matrix, axis=1)

    # Sum all the precisions.
    for i_class in range(confusion_matrix.shape[0]):
      sum_precision += confusion_matrix[i_class,i_class]/sum_rows[i_class]
      
    
    # Sum all recalls.
    for i_class in range(confusion_matrix.shape[0]):
      sum_recall += confusion_matrix[i_class,i_class]/sum_cols[i_class]

    return sum_recall/confusion_matrix.shape[0], sum_precision/confusion_matrix.shape[0]

Define a function that can calculate the passed time for during training.



In [15]:
def passed_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Save the untrained model so that for each fold the model can be trained from its initial state.

In [17]:
torch.save(model.state_dict(), 'initial_state.pt')

Define the training loop.

In [25]:
from sklearn.metrics import classification_report

def cv_training(n_epochs, n_steps_per_print):

    final_precision = 0
    final_recall = 0
    final_f1 = 0
    final_acc = 0

    for fold_val_key in all_dataloaders:

      # Initiate the initial state of the model for each fold.
      model.load_state_dict(torch.load('initial_state.pt')) 

      # Reset everything for validation fold.
      tr_loss = 0
      n_tr_correct = 0
      n_val_correct = 0
      confusion_matrix_train = np.zeros((n_class, n_class))
      confusion_matrix_val = np.zeros((n_class, n_class))

      nb_tr_steps = 0
      nb_tr_examples = 0

      nb_val_examples = 0

      model.train()
      time_a = time.time()

      for fold_train_key in all_dataloaders:
        
        if fold_train_key == fold_val_key:
          continue

        print("Training on "+fold_train_key)

        # Perform training.
        for epoch in range(n_epochs): 

          print(f"Epoch {epoch}")

          for _,data in enumerate(all_dataloaders[fold_train_key], 0):
              
              # Get tokenized input text.
              ids = data['ids'].to(device, dtype = torch.long)
              mask = data['mask'].to(device, dtype = torch.long)
              
              # Get the target variable.
              targets = data['targets'].to(device, dtype = torch.long)

              # If using BERT Classifier, also pass masks.
              outputs = model(ids, mask)

              # If using GRU Classiifier, only pass tokens.
              #outputs = model(ids)

              # Get the loss function.
              loss = loss_function(outputs, targets)

              # Sum the loss
              tr_loss += loss.item()

              # Get class that had the highest classifier output i.e. the class
              # that the model predicts.
              big_val, big_idx = torch.max(outputs.data, dim=1)

              # Update the confusion matrix.
              for i in range(len(big_idx)):
                confusion_matrix_train[big_idx[i],targets[i]] += 1

              # Get the number of correct classifications for the whole batch.
              n_tr_correct += calculate_accu(big_idx, targets)

              nb_tr_steps += 1
              nb_tr_examples+=targets.size(0)
              
              if _%n_steps_per_print==0:
                  mins, secs = passed_time(time_a, time.time())
                  print(f'Passed Time: {mins}m {secs}s')

                  loss_step = tr_loss/nb_tr_steps
                  acc = (n_tr_correct*100)/nb_tr_examples
                  av_recall, av_precision = averaged_recall_precision(confusion_matrix_train)
                  f1_score = 2*av_recall*av_precision/(av_recall + av_precision)

                  print("==================================")
                  print("After "+str(nb_tr_steps)+" steps:")
                  print(f"Training Loss: {loss_step}")
                  print(f"Training Av. Recall: {av_recall}")
                  print(f"Training Av. Precision: {av_precision}")
                  print(f"Training F1-score: {f1_score}")
                  print(f"Training Accuracy: {acc}")
                  print("==================================")

                  time_a = time.time()

              optimizer.zero_grad()
              loss.backward()
              # # When using GPU
              optimizer.step()
              
      # Turns of dropout.
      model.eval()

      # Fix the parameters.
      with torch.no_grad():
        
        for index,data in enumerate(all_dataloaders[fold_val_key], 0):

          # Get tokenized input text.
          ids = data['ids'].to(device, dtype = torch.long)
          mask = data['mask'].to(device, dtype = torch.long)

          # Get the target variable.
          targets = data['targets'].to(device, dtype = torch.long)

          # If using BERT Classifier, also pass masks.
          outputs = model(ids, mask)
          
          # If using GRU Classiifier, only pass tokens.
          #outputs = model(ids)

          # Get class that had the highest classifier output i.e. the class
          # that the model predicts.
          big_val, big_idx = torch.max(outputs.data, dim=1)

          # Get the number of correct classifications for the whole batch.
          n_val_correct += calculate_accu(big_idx, targets)

          # Update the confusion matrix.
          for i in range(len(big_idx)):
            confusion_matrix_val[big_idx[i],targets[i]] += 1

          nb_val_examples+=targets.size(0)

        acc = (n_val_correct*100)/nb_val_examples
        av_recall, av_precision = averaged_recall_precision(confusion_matrix_val)
        f1_score = 2*av_recall*av_precision/(av_recall + av_precision)

        print("============= Validation of "+fold_val_key+"=============")
        print(f"Val. Accuracy: {acc}")
        print(f"Val. Av. recall: {av_recall}")
        print(f"Val. Av. precision: {av_precision}")
        print(f"Val. F1-score: {f1_score}")
        print("==================================")
        
        final_acc += acc
        final_recall += av_recall
        final_precision += av_precision
        final_f1 += f1_score

    final_acc = final_acc/k
    final_recall = final_recall/k
    final_precision = final_precision/k
    final_f1 = final_f1/k

    print("============= All folds averages =============")
    print(f"Val. Accuracy: {final_acc}")
    print(f"Val. Av. recall: {final_recall}")
    print(f"Val. Av. precision: {final_precision}")
    print(f"Val. F1-score: {final_f1}")
    
    return 

In [None]:
cv_training(1,1000)

Training on fold_2
Epoch 0
Passed Time: 0m 0s
After 1 steps:
Training Loss: 2.4772846698760986
Training Av. Recall: nan
Training Av. Precision: nan
Training F1-score: nan
Training Accuracy: 0.0




Passed Time: 4m 8s
After 1001 steps:
Training Loss: 2.398004163395275
Training Av. Recall: 0.10231305307799682
Training Av. Precision: 0.09687927714583443
Training F1-score: 0.09952205101111723
Training Accuracy: 9.84015984015984
Passed Time: 4m 8s
After 2001 steps:
Training Loss: 2.3911084721292157
Training Av. Recall: 0.10141251447808926
Training Av. Precision: 0.10403659356705565
Training F1-score: 0.10270779612293816
Training Accuracy: 10.444777611194402
Passed Time: 4m 9s
After 3001 steps:
Training Loss: 2.387397681105657
Training Av. Recall: 0.1025630549234081
Training Av. Precision: 0.10851405085962197
Training F1-score: 0.10545466328041425
Training Accuracy: 10.804731756081306
Passed Time: 4m 8s
After 4001 steps:
Training Loss: 2.379718685084598
Training Av. Recall: 0.10970323102033434
Training Av. Precision: 0.11755730075059567
Training F1-score: 0.1134945484979206
Training Accuracy: 11.72831792051987
Passed Time: 4m 8s
After 5001 steps:
Training Loss: 2.3464381298144517
Train