<a href="https://colab.research.google.com/github/marcekovam/data_science_practicum/blob/main/HW_text_classification_brute_force.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Classification

We will use the [distilled version of the BERT base model](https://huggingface.co/distilbert-base-uncased) on a [dataset with news articles](https://huggingface.co/datasets/ag_news) from HuggingFace.

The dataset consists of 120000 training and 7600 testing samples which can be divided into 4 classes: `World` (0), `Sports` (1), `Business` (2), and `Sci/Tech` (3)

In [1]:
!pip install -qq transformers[torch] datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m31.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m30.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m58.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m59.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m30.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━

In [2]:
DATASET = 'ag_news'
NUM_LABELS = 4
MODEL = 'distilbert-base-uncased'

Load the dataset with news articles:

In [3]:
from datasets import load_dataset

dataset = load_dataset(DATASET)
dataset

Downloading builder script:   0%|          | 0.00/4.06k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.65k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/751k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

Check the format of one sample from our dataset:

In [4]:
dataset['train'][0]

{'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.",
 'label': 2}

Check whether our dataset is balanced (get the number of samples from each class):

In [5]:
import numpy as np

def check_class_balance(class_labels):
  values, counts = np.unique(class_labels, return_counts=True)
  return values, counts

check_class_balance(dataset['train']['label'])

(array([0, 1, 2, 3]), array([30000, 30000, 30000, 30000]))

Load the tokenizer and have a look at it's special tokens:

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL)
tokenizer

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

DistilBertTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

*What do these tokens mean?*

Check what exactly does the tokenizer return (when applied on one sample):

In [7]:
first_sample_text = dataset['train'][0]['text']
first_sample_text

"Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again."

In [None]:
# TODO
# hint: use tokenizer.tokenize(), tokenizer.convert_tokens_to_ids(), tokenizer.decode()
tok = [tokenizer.tokenize(dataset['train'][i]['text']) for i in range(len(dataset['train']['text']))]
#tok

In [None]:
tok_ids = [tokenizer.convert_tokens_to_ids(i) for i in tok]
#tok_ids

In [None]:
tok_decode = [tokenizer.decode(i) for i in tok_ids]
#tok_decode

Compare it to what is returned we when use the `preprocess_function`:



In [11]:
def preprocess_function(examples):
  # https://huggingface.co/docs/transformers/pad_truncation
  # truncation=True and padding='max_length' -> pads sequences with [PAD] token to given max sequence length
  return tokenizer(examples['text'], truncation=True, padding='max_length', return_tensors='pt')

first_sample_tokenized = preprocess_function(dataset['train'][0])
first_sample_tokenized

{'input_ids': tensor([[  101,  2813,  2358,  1012,  6468, 15020,  2067,  2046,  1996,  2304,
          1006, 26665,  1007, 26665,  1011,  2460,  1011, 19041,  1010,  2813,
          2395,  1005,  1055,  1040, 11101,  2989,  1032,  2316,  1997, 11087,
          1011, 22330,  8713,  2015,  1010,  2024,  3773,  2665,  2153,  1012,
           102,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,  

Preprocess more samples from our dataset at once:

In [12]:
# training on the whole dataset would take more than 5 hours :(
# train_dataset = dataset['train'].map(preprocess_function, batched=True)
# test_dataset = dataset['test'].map(preprocess_function, batched=True)

train_dataset = dataset['train'].shuffle(seed=42).select(range(2500)).map(preprocess_function, batched=True)
test_dataset = dataset['test'].shuffle(seed=42).select(range(500)).map(preprocess_function, batched=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [13]:
check_class_balance(train_dataset['label'])

(array([0, 1, 2, 3]), array([244, 243, 242, 271]))

In [14]:
check_class_balance(test_dataset['label'])

(array([0, 1, 2, 3]), array([120, 121, 134, 125]))

Load the model:

In [15]:
from transformers import AutoModelForSequenceClassification

id2label = {0: 'World', 1: 'Sports', 2: 'Business', 3: 'Sci/Tech'}
label2id = {'World': 0, 'Sports': 1, 'Business': 2, 'Sci/Tech': 3}

model = AutoModelForSequenceClassification.from_pretrained(MODEL,
                                                           num_labels=NUM_LABELS,
                                                           id2label=id2label,
                                                           label2id=label2id)
model

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

Define evaluation metrics and train our model:

In [16]:
from transformers import TrainingArguments, Trainer
from sklearn.metrics import accuracy_score
import numpy as np

def compute_metrics(p):
    logits = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.argmax(logits, axis=1)
    return {'accuracy': accuracy_score(p.label_ids, preds)}

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,
    per_device_train_batch_size=16,
    evaluation_strategy='epoch',
    learning_rate=5e-5,
    weight_decay=0.0
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.429343,0.868
2,No log,0.412822,0.87


TrainOutput(global_step=126, training_loss=0.49181365966796875, metrics={'train_runtime': 111.4888, 'train_samples_per_second': 17.939, 'train_steps_per_second': 1.13, 'total_flos': 264944246784000.0, 'train_loss': 0.49181365966796875, 'epoch': 2.0})

Use the trained model to get prediction for some random sentence of your choice using `pipeline`:

https://huggingface.co/docs/transformers/main_classes/pipelines


In [17]:
from transformers import TextClassificationPipeline

# TODO

pipe = TextClassificationPipeline(model = model, tokenizer = tokenizer, device = 0)
pipe

<transformers.pipelines.text_classification.TextClassificationPipeline at 0x7cc962c88970>

In [18]:
pipe('Climbing is good.')

[{'label': 'Sci/Tech', 'score': 0.5061569809913635}]

In [19]:
pipe('Olive garden.')

[{'label': 'Sci/Tech', 'score': 0.40717989206314087}]

In [20]:
# Reducing size of training set to make brute force quicker

train_dataset = dataset['train'].shuffle(seed=42).select(range(500)).map(preprocess_function, batched=True)
#test_dataset = dataset['test'].shuffle(seed=42).select(range(500)).map(preprocess_function, batched=True)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [21]:
def compute_metrics(p):
    logits = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.argmax(logits, axis=1)
    return {'accuracy': accuracy_score(p.label_ids, preds)}

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,
    per_device_train_batch_size=16,
    evaluation_strategy='epoch',
    learning_rate=5e-5,
    weight_decay=0.0
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
)

train1 = trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.492149,0.848
2,No log,0.441905,0.882


In [27]:
train1[1]

0.2163592278957367

In [29]:
lr_values = [4e-5, 5e-5, 6e-5]
b_size_values = [14, 16, 18]
wd_values = [0.0, 0.01, 0.02]
loss = []

for lr in lr_values:
  for b_size in b_size_values:
    for wd in wd_values:
      training_args = TrainingArguments(
        output_dir='./results',
        num_train_epochs=2,
        per_device_train_batch_size=b_size,
        evaluation_strategy='epoch',
        learning_rate=lr,
        weight_decay=wd
      )

      trainer = Trainer(
          model=model,
          args=training_args,
          train_dataset=train_dataset,
          eval_dataset=test_dataset,
          compute_metrics=compute_metrics
      )
      train2 = trainer.train()
      loss.append(train2[1])

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.989753,0.866
2,No log,2.192467,0.864


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.207983,0.868
2,No log,2.327299,0.864


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.225353,0.866
2,No log,2.203234,0.868


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.198649,0.87
2,No log,2.194458,0.87


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.194287,0.87
2,No log,2.191185,0.87


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.191211,0.87
2,No log,2.189443,0.87


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.149255,0.87
2,No log,2.119405,0.874


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.106393,0.874
2,No log,2.094481,0.874


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.090637,0.874
2,No log,2.08646,0.874


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.083107,0.876
2,No log,2.082568,0.876


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.080036,0.876
2,No log,2.079791,0.876


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.077883,0.876
2,No log,2.077841,0.876


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.079457,0.876
2,No log,2.079591,0.876


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.081124,0.876
2,No log,2.081256,0.876


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.08272,0.876
2,No log,2.082849,0.876


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.081385,0.876
2,No log,2.07903,0.876


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.077859,0.876
2,No log,2.07583,0.876


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.074886,0.876
2,No log,2.073135,0.876


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.073152,0.878
2,No log,2.073803,0.878


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.073936,0.878
2,No log,2.074595,0.878


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.074801,0.878
2,No log,2.07546,0.878


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.077315,0.878
2,No log,2.077767,0.878


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.079506,0.878
2,No log,2.079925,0.878


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.081542,0.878
2,No log,2.081927,0.878


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.081692,0.878
2,No log,2.080471,0.878


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.08031,0.878
2,No log,2.079237,0.878


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,2.079158,0.878
2,No log,2.07822,0.88


In [30]:
loss

[1.915020725896789e-05,
 2.1098069661699508e-08,
 1.3008947992905935e-09,
 2.3283062977608182e-10,
 1.1641531488804091e-10,
 0.0,
 2.280788129788951e-09,
 5.068421872676612e-10,
 2.3652633819791294e-10,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 2.3652633819791294e-10,
 1.1826317702912092e-10,
 1.1826317702912092e-10,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0]

In [31]:
# Enlarging training set

train_dataset = dataset['train'].shuffle(seed=42).select(range(5000)).map(preprocess_function, batched=True)

# Setting optimal parameters (lr = 6e-5, wd = 0.02, bs = 18) and increasing number of epochs

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=18,
    evaluation_strategy='epoch',
    learning_rate=6e-5,
    weight_decay=0.02
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
)

# Training of final model

trainer.train()

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.35559,0.898
2,0.337400,0.362378,0.89
3,0.337400,0.368905,0.922


TrainOutput(global_step=834, training_loss=0.24193972763683586, metrics={'train_runtime': 705.016, 'train_samples_per_second': 21.276, 'train_steps_per_second': 1.183, 'total_flos': 1987081850880000.0, 'train_loss': 0.24193972763683586, 'epoch': 3.0})

What happens when we try to predict the label of a sentence that actually belongs to a class that wasn't in our data?

Is it correct behaviour?

How can we improve the performance of our model?