<a href="https://colab.research.google.com/github/pimverschuuren/ComplaintDepartment/blob/main/SubSample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Get the dataset in compressed form.

In [None]:
!wget https://files.consumerfinance.gov/ccdb/complaints.csv.zip

--2021-10-27 16:00:18--  https://files.consumerfinance.gov/ccdb/complaints.csv.zip
Resolving files.consumerfinance.gov (files.consumerfinance.gov)... 13.32.150.6, 13.32.150.22, 13.32.150.68, ...
Connecting to files.consumerfinance.gov (files.consumerfinance.gov)|13.32.150.6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 383843806 (366M) [binary/octet-stream]
Saving to: ‘complaints.csv.zip’


2021-10-27 16:00:22 (95.8 MB/s) - ‘complaints.csv.zip’ saved [383843806/383843806]



Decompress the data.

In [None]:
!unzip complaints.csv.zip

Archive:  complaints.csv.zip
  inflating: complaints.csv          


Setting up GPU if available

In [None]:
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

Install and import some libraries.

In [None]:
# Install the transformers package of Hugging Face.
!pip install transformers

# Importing the libraries needed
import pandas as pd
import torch
import time
import numpy as np
import torch.nn.functional as F
import transformers
from torch.utils.data import Dataset, DataLoader
#from transformers import DistilBertModel, DistilBertTokenizer
from transformers import BertModel, BertTokenizer
torch.backends.cudnn.deterministic = True

Collecting transformers
  Downloading transformers-4.11.3-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 31.5 MB/s 
Collecting huggingface-hub>=0.0.17
  Downloading huggingface_hub-0.0.19-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 4.6 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 42.3 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 25.3 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 41.4 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempt

Load the dataset into a pandas dataframe

In [None]:
total_dataset = pd.read_csv('complaints.csv')

Lets pre-process the data by removing nan values from the target variable and complaints and removing complaints that are too short. Also, lets encode product 

In [None]:
text_variable = 'Consumer complaint narrative'
target_variable = 'Company public response'

print("Total number of statistics: "+str(len(total_dataset)))

total_dataset = total_dataset.dropna(subset=[text_variable])

print("Remaining number of statistics: "+str(len(total_dataset)))

Total number of statistics: 2317009
Remaining number of statistics: 805096


Use the stratified k-fold procedure to produce subsamples with equal proportions of product class.

In [None]:
encode_dict = {}

def encode_product(x):
    if x not in encode_dict.keys():
        encode_dict[x]=len(encode_dict)
    return encode_dict[x]

class dataset_fold_BERT(Dataset):
    def __init__(self, xfold, yfold, tokenizer, max_len):
        self.len = len(xfold)
        self.xfold = xfold
        self.yfold = yfold
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __getitem__(self, index):
        sentence = str(self.xfold.iloc[index][text_variable])
        #title = " ".join(title.split())
        #print(sentence)
        inputs = self.tokenizer.encode_plus(
            sentence,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True,
            truncation=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']

        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'targets': self.yfold[index]
        } 
    
    def __len__(self):
        return self.len

In [None]:
# Get the number of categories for the target variable.
n_class = total_dataset[target_variable].nunique()

# Define a maximum length for the complaint to be truncated to.
max_len = 512

# Define batch size
batch_size = 4

# Tokenizer to convert the text into tokens.
#tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

predictors = total_dataset.drop(target_variable, axis=1)

# Convert the products to integers.
target = total_dataset[target_variable].apply(lambda x: encode_product(x))

# Create a dict that will contain the dataloaders for all folds.
all_dataloaders = {}

# Define the dataloader parameters.
train_params = {'batch_size': batch_size,
                'shuffle': True,
                'num_workers': 0
                }

# Get the dataset.
training_data = dataset_fold_BERT(predictors, target, tokenizer, max_len)

# Get the dataloader.
dataloader = DataLoader(training_data, **train_params)

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
class BERTEncoderClass(torch.nn.Module):
    def __init__(self, n_class, hidden_dim, dropout):
        super(BERTClass, self).__init__()
        self.l1 = BertModel.from_pretrained("bert-base-uncased")
        #self.l1 = DistilBertModel.from_pretrained("distilbert-base-uncased")
        self.pre_classifier = torch.nn.Linear(768, hidden_dim)

    def forward(self, input_ids, attention_mask):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        output = self.pre_classifier(pooler)
        return output

Define the training loop

In [None]:
def fill_wordvec_hist():

    for _,data in enumerate(dataloader, 0):

        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.long)

        outputs = model(ids, mask)

        print(outputs)
