Introduction to BERT and the problem at hand

Exploratory Data Analysis and Preprocessing 

Training/Validation Split

Loading Tokenizer and Encoding our Data

Setting up BERT Pretrained Model

Creating Data Loaders

Setting Up Optimizer and Scheduler

Defining our Performance Metrics

Creating our Training Loop

Loading and Evaluating our Model

In [14]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [15]:
!pip install transformers



In [16]:
import torch
from tqdm.notebook import tqdm

from transformers import BertTokenizer
from torch.utils.data import TensorDataset

from transformers import BertForSequenceClassification

import pandas as pd

In [17]:
df = pd.read_csv('/content/drive/MyDrive/Colab Data/complaints.csv.zip')

In [18]:
df.head()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,2019-09-24,Debt collection,I do not know,Attempts to collect debt not owed,Debt is not yours,transworld systems inc. \nis trying to collect...,,TRANSWORLD SYSTEMS INC,FL,335XX,,Consent provided,Web,2019-09-24,Closed with explanation,Yes,,3384392
1,2019-09-19,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Information belongs to someone else,,Company has responded to the consumer and the ...,Experian Information Solutions Inc.,PA,15206,,Consent not provided,Web,2019-09-20,Closed with non-monetary relief,Yes,,3379500
2,2019-11-08,Debt collection,I do not know,Communication tactics,Frequent or repeated calls,"Over the past 2 weeks, I have been receiving e...",,"Diversified Consultants, Inc.",NC,275XX,,Consent provided,Web,2019-11-08,Closed with explanation,Yes,,3433198
3,2019-09-15,Debt collection,Other debt,Attempts to collect debt not owed,Debt was result of identity theft,Pioneer has committed several federal violatio...,,Pioneer Capital Solutions Inc,CA,925XX,,Consent provided,Web,2019-09-15,Closed with explanation,Yes,,3374555
4,2021-03-02,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Information belongs to someone else,,,"EQUIFAX, INC.",FL,33325,,,Web,2021-03-02,Closed with explanation,Yes,,4176536


In [19]:
df = df[pd.notnull(df['Consumer complaint narrative'])]
col = ['Product', 'Consumer complaint narrative']
df = df[col]
df.columns = ['Product', 'Consumer_complaint_narrative']

In [20]:
df['Product'].value_counts()

Credit reporting, credit repair services, or other personal consumer reports    248115
Debt collection                                                                 134651
Mortgage                                                                         75042
Credit card or prepaid card                                                      50365
Credit reporting                                                                 31588
Checking or savings account                                                      29404
Student loan                                                                     27735
Credit card                                                                      18838
Bank account or service                                                          14885
Money transfer, virtual currency, or money service                               12994
Vehicle loan or lease                                                            12287
Consumer Loan                              

In [21]:
import numpy as np

# Percentage relative to the minority class
samplingStrategy = 0.02

# Undersampling using sampling strategy 
nsamples_per_class = np.int16(df['Product'].value_counts().min()/samplingStrategy)
undersample = lambda df: df.loc[np.random.choice(a=df.index, size=min(len(df.index), nsamples_per_class), replace=False)]
df_bal = df.groupby(['Product'], as_index=False).apply(undersample)

print(df_bal.shape)

(13108, 2)


In [22]:
df = df_bal

In [23]:
possible_labels = df.Product.unique()

label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index
label_dict

{'Bank account or service': 0,
 'Checking or savings account': 1,
 'Consumer Loan': 2,
 'Credit card': 3,
 'Credit card or prepaid card': 4,
 'Credit reporting': 5,
 'Credit reporting, credit repair services, or other personal consumer reports': 6,
 'Debt collection': 7,
 'Money transfer, virtual currency, or money service': 8,
 'Money transfers': 9,
 'Mortgage': 10,
 'Other financial service': 11,
 'Payday loan': 12,
 'Payday loan, title loan, or personal loan': 13,
 'Prepaid card': 14,
 'Student loan': 15,
 'Vehicle loan or lease': 16,
 'Virtual currency': 17}

In [24]:
df['label'] = df.Product.replace(label_dict)

In [25]:
df.head()

Unnamed: 0,Unnamed: 1,Product,Consumer_complaint_narrative,label
0,1749743,Bank account or service,XXXX XXXX we used bank card to pay for car rep...,0
0,1743430,Bank account or service,I made a bill payment using TD Bank 's Online ...,0
0,1682612,Bank account or service,"Hello, We started a personal property claim wi...",0
0,1598661,Bank account or service,My issue is that there are no regulations for ...,0
0,1705430,Bank account or service,Well Fargo 's system displayed incorrect accou...,0


In [26]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                  df.label.values, 
                                                  test_size=0.15, 
                                                  random_state=42, 
                                                  stratify=df.label.values)

In [27]:
df['data_type'] = ['not_set']*df.shape[0]

df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

In [28]:
df.groupby(['Product', 'label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Consumer_complaint_narrative
Product,label,data_type,Unnamed: 3_level_1
Bank account or service,0,train,680
Bank account or service,0,val,120
Checking or savings account,1,train,680
Checking or savings account,1,val,120
Consumer Loan,2,train,680
Consumer Loan,2,val,120
Credit card,3,train,680
Credit card,3,val,120
Credit card or prepaid card,4,train,680
Credit card or prepaid card,4,val,120


In [29]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [30]:
encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].Consumer_complaint_narrative.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    df[df.data_type=='val'].Consumer_complaint_narrative.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)


input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].label.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.data_type=='val'].label.values)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [31]:
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)

In [32]:
len(dataset_train), len(dataset_val)

(11141, 1967)

In [33]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [34]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

batch_size = 16

dataloader_train = DataLoader(dataset_train, 
                              sampler=RandomSampler(dataset_train), 
                              batch_size=batch_size)

dataloader_validation = DataLoader(dataset_val, 
                                   sampler=SequentialSampler(dataset_val), 
                                   batch_size=batch_size)

In [35]:
from transformers import AdamW, get_linear_schedule_with_warmup

optimizer = AdamW(model.parameters(),
                  lr=1e-5, 
                  eps=1e-8)

In [36]:
epochs = 3

scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps=0,
                                            num_training_steps=len(dataloader_train)*epochs)

In [53]:
from sklearn.metrics import f1_score

def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        acc = np.round(len(y_preds[y_preds==label]) / len(y_true), decimals=2)
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)} = {acc}%\n')

In [38]:
import random
import numpy as np

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [39]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(device)

cuda


In [40]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in dataloader_val:
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [41]:
for epoch in tqdm(range(1, epochs+1)):
    
    model.train()
    
    loss_train_total = 0

    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:

        model.zero_grad()
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }       

        outputs = model(**inputs)
        
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)


        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
         
        
    torch.save(model.state_dict(), f'finetuned_BERT_epoch_{epoch}.model')
        
    tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)            
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_validation)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (Weighted): {val_f1}')

HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=697.0, style=ProgressStyle(description_widt…


Epoch 1
Training loss: 2.048479602360828
Validation loss: 1.482991490664521
F1 Score (Weighted): 0.4664825250812034


HBox(children=(FloatProgress(value=0.0, description='Epoch 2', max=697.0, style=ProgressStyle(description_widt…


Epoch 2
Training loss: 1.348320839052734
Validation loss: 1.2846289988213437
F1 Score (Weighted): 0.5642545278723318


HBox(children=(FloatProgress(value=0.0, description='Epoch 3', max=697.0, style=ProgressStyle(description_widt…


Epoch 3
Training loss: 1.167393882869819
Validation loss: 1.233807732055827
F1 Score (Weighted): 0.5876427127657708



In [42]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

model.to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [56]:
model.load_state_dict(torch.load('finetuned_BERT_epoch_3.model', map_location=torch.device('cpu')))

<All keys matched successfully>

In [57]:
_, predictions, true_vals = evaluate(dataloader_validation)

In [58]:
accuracy_per_class(predictions, true_vals)

Class: Bank account or service
Accuracy: 84/120 = 0.7%

Class: Checking or savings account
Accuracy: 60/120 = 0.5%

Class: Consumer Loan
Accuracy: 28/120 = 0.23%

Class: Credit card
Accuracy: 76/120 = 0.63%

Class: Credit card or prepaid card
Accuracy: 43/120 = 0.36%

Class: Credit reporting
Accuracy: 67/120 = 0.56%

Class: Credit reporting, credit repair services, or other personal consumer reports
Accuracy: 46/120 = 0.38%

Class: Debt collection
Accuracy: 87/120 = 0.72%

Class: Money transfer, virtual currency, or money service
Accuracy: 66/120 = 0.55%

Class: Money transfers
Accuracy: 88/120 = 0.73%

Class: Mortgage
Accuracy: 109/120 = 0.91%

Class: Other financial service
Accuracy: 0/44 = 0.0%

Class: Payday loan
Accuracy: 96/120 = 0.8%

Class: Payday loan, title loan, or personal loan
Accuracy: 43/120 = 0.36%

Class: Prepaid card
Accuracy: 101/120 = 0.84%

Class: Student loan
Accuracy: 111/120 = 0.92%

Class: Vehicle loan or lease
Accuracy: 88/120 = 0.73%

Class: Virtual currency


In [59]:
f1_score_func(predictions, true_vals)

0.5876427127657708