*This* notebook illustrates how we can use transformers to finetune text classification models on our own datasets. I am using the same dataset we used with the other (relatively) old style approaches. However, keep in mind, this training takes a little bit of time, and would need some GPU resource. 

Based on: https://huggingface.co/transformers/v3.4.0/custom_datasets.html 

In [1]:
#install the required libraries
!pip install transformers
!pip install datasets
!pip install pandas
!pip install scikit-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.22.2-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 25.1 MB/s 
Collecting huggingface-hub<1.0,>=0.9.0
  Downloading huggingface_hub-0.10.0-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 55.8 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 53.9 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.10.0 tokenizers-0.12.1 transformers-4.22.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.5.2-py3-none-any.whl (432 kB)
[K     |████████████████████████████████| 432 kB 2

In [2]:
#import what we need later
import datasets
from datasets import load_dataset
from datasets import Dataset, DatasetDict

import pandas as pd

from sklearn.model_selection import train_test_split


In [3]:
#Read the csv file containing our data
our_data = pd.read_csv("Full-Economic-News-DFE-839861.csv" , encoding = "ISO-8859-1" )

In [4]:
#Pick the two columns we need from this data (text, relevance), and take only those where relevance is either a Yes or No.
#There seem to be some NaNs.
mylen = len(our_data["text"].tolist())
mytexts = [] #will contain the text strings
mylabels = [] #will contain the label as 1 or 0 (Yes or No respectively)
for i in range(0,mylen):
    if str(our_data['relevance'][i]) == 'yes':
        mytexts.append(str(our_data["text"][i]))
        mylabels.append(1)
    elif str(our_data["relevance"][i]) == "no":
        mytexts.append(str(our_data["text"][i]))
        mylabels.append(0)
    else:
        print("skipping")
len(mytexts)
len(mylabels)

skipping
skipping
skipping
skipping
skipping
skipping
skipping


6637

In [22]:
mylen

6644

In [5]:
#Split the data into train, validation, test. Actually, you have to split into train/valid/test. 
train_texts, test_texts, train_labels, test_labels = train_test_split(mytexts, mylabels, test_size=.25)
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.1)


In [6]:
#preprocessing and text representation, transformer way
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [7]:
#Our labels and models should be turned into a Dataset object, which is what Huggingface's transformers
#library uses for training

#Note: I am just following the online tutorial, changing the class name. 
import torch

class MyDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [8]:
#Call the above function to create train and test datasets
train_dataset = MyDataset(train_encodings, train_labels)
test_dataset = MyDataset(test_encodings, test_labels)
val_dataset = MyDataset(val_encodings, val_labels)

In [9]:
#Import what is required for training
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_metric

In [11]:
#Specify training arguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

#Specify evaluation metrics
def compute_metrics(eval_preds):
    metric = load_metric("accuracy", "f1")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

#define the model
model = BertForSequenceClassification.from_pretrained("bert-base-cased")

#instantiate the trainer
trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,           # evaluation dataset
    compute_metrics=compute_metrics      #specify metrics

)


PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/a8d257ba9925ef39f3036bfc338acf5283c512d9/config.json
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_vers

In [12]:
#train
trainer.train()

***** Running training *****
  Num examples = 4479
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 840


Step,Training Loss
10,0.4578
20,0.4796
30,0.4894
40,0.5578
50,0.4953
60,0.4851
70,0.5051
80,0.4939
90,0.5411
100,0.4811


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=840, training_loss=0.4087772304103488, metrics={'train_runtime': 1286.208, 'train_samples_per_second': 10.447, 'train_steps_per_second': 0.653, 'total_flos': 3535423250872320.0, 'train_loss': 0.4087772304103488, 'epoch': 3.0})

In [17]:
import numpy as np
#predict
predictions = trainer.predict(test_dataset)
print(predictions.predictions.shape, predictions.label_ids.shape)
preds = np.argmax(predictions.predictions, axis=-1)
metric =load_metric('accuracy', 'f1')
print(metric.compute(predictions=preds, references=predictions.label_ids))

***** Running Prediction *****
  Num examples = 1660
  Batch size = 64


(1660, 2) (1660,)
{'accuracy': 0.8090361445783133}


In [18]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(predictions.label_ids, preds, labels=[1,0]))

[[ 117  172]
 [ 145 1226]]


In [20]:
len(train_labels), len(val_labels), len(test_labels)

(4479, 498, 1660)