This notebook illustrates how we can use transformers to finetune text classification models on our own datasets. I am using the same dataset we used with the other (relatively) old style approaches. However, keep in mind, this training takes a little bit of time, and would need some GPU resource. 

Based on: https://huggingface.co/transformers/v3.4.0/custom_datasets.html 

In [105]:
#install the required libraries
!pip install transformers
!pip install datasets
!pip install pandas
!pip install scikit-learn

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
huggingfac

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [106]:
#import what we need later
import datasets
from datasets import load_dataset
from datasets import Dataset, DatasetDict

import pandas as pd

from sklearn.model_selection import train_test_split


In [107]:
#Read the csv file containing our data
our_data = pd.read_csv("Full-Economic-News-DFE-839861.csv" , encoding = "ISO-8859-1" )

In [108]:
#Pick the two columns we need from this data (text, relevance), and take only those where relevance is either a Yes or No.
#There seem to be some NaNs.
mylen = len(our_data["text"].tolist())
mytexts = [] #will contain the text strings
mylabels = [] #will contain the label as 1 or 0 (Yes or No respectively)
for i in range(0,mylen):
    if str(our_data['relevance'][i]) == 'yes':
        mytexts.append(str(our_data["text"][i]))
        mylabels.append(1)
    elif str(our_data["relevance"][i]) == "no":
        mytexts.append(str(our_data["text"][i]))
        mylabels.append(0)
    else:
        print("skipping")
len(mytexts)
len(mylabels)

skipping
skipping
skipping
skipping
skipping
skipping
skipping
skipping
skipping


7991

In [109]:
#Split the data into train, validation, test. Actually, you have to split into train/valid/test. 
train_texts, test_texts, train_labels, test_labels = train_test_split(mytexts, mylabels, test_size=.25)
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.1)


In [110]:
#preprocessing and text representation, transformer way
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)


loading configuration file https://huggingface.co/bert-base-cased/resolve/main/config.json from cache at /Users/Vajjalas/.cache/huggingface/transformers/a803e0468a8fe090683bdc453f4fac622804f49de86d7cecaee92365d4a0f829.a64a22196690e0e82ead56f388a3ef3a50de93335926ccfa20610217db589307
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.21.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading file https://huggingface.co/bert-base-cased/resolv

In [111]:
#Our labels and models should be turned into a Dataset object, which is what Huggingface's transformers
#library uses for training

#Note: I am just following the online tutorial, changing the class name. 
import torch

class MyDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [112]:
#Call the above function to create train and test datasets
train_dataset = MyDataset(train_encodings, train_labels)
test_dataset = MyDataset(test_encodings, test_labels)
val_dataset = MyDataset(val_encodings, val_labels)

In [113]:
#Import what is required for training
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_metric

In [114]:
#Specify training arguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

#Specify evaluation metrics
def compute_metrics(eval_preds):
    metric = load_metric("accuracy", "f1")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

#define the model
model = BertForSequenceClassification.from_pretrained("bert-base-cased")

#instantiate the trainer
trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,           # evaluation dataset
    compute_metrics=compute_metrics      #specify metrics

)



PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
loading configuration file https://huggingface.co/bert-base-cased/resolve/main/config.json from cache at /Users/Vajjalas/.cache/huggingface/transformers/a803e0468a8fe090683bdc453f4fac622804f49de86d7cecaee92365d4a0f829.a64a22196690e0e82ead56f388a3ef3a50de93335926ccfa20610217db589307
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads":

In [115]:
#train
trainer.train()


***** Running training *****
  Num examples = 5393
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 338


Step,Training Loss
10,0.6237
20,0.5984
30,0.5837
40,0.5075
50,0.5225
60,0.501
70,0.422
80,0.4722
90,0.5143
100,0.4382




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=338, training_loss=0.4652364486773339, metrics={'train_runtime': 4282.6009, 'train_samples_per_second': 1.259, 'train_steps_per_second': 0.079, 'total_flos': 1418957921556480.0, 'train_loss': 0.4652364486773339, 'epoch': 1.0})

In [116]:
import numpy as np
#predict
predictions = trainer.predict(test_dataset)
print(predictions.predictions.shape, predictions.label_ids.shape)
preds = np.argmax(predictions.predictions, axis=-1)
metric =load_metric('accuracy', 'f1')
print(metric.compute(predictions=preds, references=predictions.label_ids))

***** Running Prediction *****
  Num examples = 1998
  Batch size = 64


(1998, 2) (1998,)
{'accuracy': 0.8113113113113113}


In [119]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(predictions.label_ids, preds, labels=[1,0]))


[[   0  377]
 [   0 1621]]


Whaat?? After all that time, did it just learn this majority classification, for which we don't require any learning at all?? :O 

Note that I trained for only one epoch, whereas usually we train for more. 

Check: https://discuss.huggingface.co/t/dealing-with-imbalanced-datasets/4328/2
for some discussion on why this could have happened.