<a href="https://www.kaggle.com/code/morisdibil/notebook1b71ce664d?scriptVersionId=91377046" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
! pip install datasets
! pip install transformers

In [None]:
from datasets import load_dataset
dataset = load_dataset('yahoo_answers_topics') 

In [3]:
from transformers import (ElectraTokenizer, ElectraForSequenceClassification,
                          get_scheduler, pipeline, ElectraForMaskedLM, ElectraModel)

import torch

from torch.utils.data import DataLoader
from datasets import load_metric

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
MODEL_NAME = "google/electra-small-generator"
TOKENIZER_NAME = "google/electra-small-generator"

In [43]:
fill_mask = pipeline(
    "fill-mask",
    model=MODEL_NAME,
    tokenizer=MODEL_NAME
)

In [44]:
ex = ["Why don't you ask [MASK]?",
      "What is [MASK]",
      "Let's talk about [MASK] physics"]

for i in ex:
    print(
    fill_mask(i)[0]['sequence']
)

why don't you ask me?
what is?
let's talk about quantum physics


### Model downloading and data preparation

In [11]:
from transformers import ElectraForSequenceClassification, ElectraConfig, ElectraTokenizer

config = ElectraConfig()
tokenizer = ElectraTokenizer.from_pretrained(TOKENIZER_NAME)
model = ElectraForSequenceClassification.from_pretrained(TOKENIZER_NAME)

model.classifier.dense = torch.nn.Linear(256, 64)
model.classifier.out_proj = torch.nn.Sequential(
    torch.nn.LeakyReLU(),
    torch.nn.Linear(64, 10))
model.to(device)

Some weights of the model checkpoint at google/electra-small-generator were not used when initializing ElectraForSequenceClassification: ['generator_predictions.LayerNorm.bias', 'generator_predictions.LayerNorm.weight', 'generator_lm_head.bias', 'generator_predictions.dense.bias', 'generator_predictions.dense.weight', 'generator_lm_head.weight']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-generator and are newly initializ

ElectraForSequenceClassification(
  (electra): ElectraModel(
    (embeddings): ElectraEmbeddings(
      (word_embeddings): Embedding(30522, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (embeddings_project): Linear(in_features=128, out_features=256, bias=True)
    (encoder): ElectraEncoder(
      (layer): ModuleList(
        (0): ElectraLayer(
          (attention): ElectraAttention(
            (self): ElectraSelfAttention(
              (query): Linear(in_features=256, out_features=256, bias=True)
              (key): Linear(in_features=256, out_features=256, bias=True)
              (value): Linear(in_features=256, out_features=256, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): ElectraSelfOutput(
              (dense): Linear(in_

In [18]:
from sklearn.model_selection import train_test_split
import numpy as np

ids = np.random.randint(0, high=1400000, size=80000)
x = np.array(dataset['train']['question_title'])[ids]
y = np.array(dataset['train']['topic'])[ids]
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.1, stratify=y)

In [21]:
from collections import Counter
Counter(y)
# classes are almost balanced

Counter({2: 7956,
         8: 7887,
         3: 7971,
         0: 8148,
         4: 8003,
         9: 7878,
         5: 8104,
         7: 7927,
         6: 8111,
         1: 8015})

In [22]:
x_train = tokenizer(x_train.tolist(), padding="max_length", truncation=True, return_tensors='pt')
x_val = tokenizer(x_val.tolist(), padding="max_length", truncation=True, return_tensors='pt')

In [23]:
from torch.utils.data import DataLoader, Dataset

class EDataset(Dataset):
    def __init__(self, x, y):
        super().__init__()
        self.x = x
        self.y = y

    def __len__(self):
        return len(self.y)

    def __getitem__(self, index):
        input_ids = self.x['input_ids'][index]
        token_type_ids = self.x['token_type_ids'][index]
        att_mask = self.x['attention_mask'][index]
        y_out = self.y[index]
        
        out = {'input_ids': input_ids,
               'token_type_ids': token_type_ids,
               'attention_mask': att_mask}

        return out, y_out

train_dataset = EDataset(x_train, y_train)
val_dataset = EDataset(x_val, y_val)

In [30]:
from tqdm import tqdm
import numpy as np
from sklearn.metrics import f1_score

def fit_epoch(model, train_loader, criterion, optimizer):
    model.train()
    running_loss = []
    acc = []
    for inputs, labels in train_loader:
        input_ids = inputs['input_ids'].to(device)
        attention_mask = inputs['attention_mask'].to(device)
        token_type_ids = inputs['token_type_ids'].to(device)
        labels = labels.to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        loss = criterion(outputs['logits'], labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        preds = torch.argmax(outputs['logits'], 1)
        acc.append(sum(preds.cpu() == labels.data.cpu()) / len(preds.cpu()))

        running_loss.append(loss.item()) 
    return np.array(running_loss).mean(), np.array(acc, dtype=float).mean()

def eval_epoch(model, val_loader, criterion):
    model.eval()
    running_loss = []
    acc = []

    for inputs, labels in val_loader:
        input_ids = inputs['input_ids'].to(device)
        attention_mask = inputs['attention_mask'].to(device)
        token_type_ids = inputs['token_type_ids'].to(device)
        labels = labels.to(device)
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
            loss = criterion(outputs['logits'], labels)

            preds = torch.argmax(outputs['logits'], 1)
            f1s = f1_score(labels.cpu().data, preds.cpu(), average='weighted')
            
            acc.append(f1s)
                #sum(preds.cpu() == labels.data.cpu()) / len(preds.cpu()))
            running_loss.append(loss.item())

    return np.array(running_loss).mean(), np.array(acc).mean()

def train(train_dataset, val_dataset, model, epochs, 
          batch_size, opt, criterion, scheduler=None, save_best=None):
    
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True)

    history = []
    best_val_loss = 1000
    log_template = "\nEpoch {ep:03d} train_loss: {t_loss:0.4f} train_acc: {t_acc:0.4f} val_loss {v_loss:0.4f} val_f1 {v_acc:0.4f}"
    with tqdm(desc="epoch", total=epochs) as pbar_outer:

        for epoch in range(epochs):
            train_loss, acc = fit_epoch(model, train_loader, criterion, opt)
            
            val_loss, v_acc = eval_epoch(model, val_loader, criterion)
            history.append((train_loss, acc, val_loss, v_acc))
            
            if scheduler:
                scheduler.step()

            if save_best:
              if val_loss < best_val_loss:
                best_val_loss = val_loss
                torch.save(model.state_dict(), save_best)
            
            pbar_outer.update(1)
            tqdm.write(log_template.format(ep=epoch+1, t_loss=train_loss, t_acc=acc, v_loss=val_loss, v_acc=v_acc))
            print('')
            
    if save_best:
      model.load_state_dict(torch.load(save_best, map_location=device))  

    return history

### Fine tuning the whole model

In [31]:
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
criterion = torch.nn.CrossEntropyLoss()
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 2, gamma=0.1, 
                                            last_epoch=-1, verbose=True)

history = train(train_dataset, val_dataset, model, epochs=10, batch_size=32, 
                opt=optimizer, criterion=criterion, scheduler=scheduler) 

Adjusting learning rate of group 0 to 1.0000e-03.


epoch:  10%|█         | 1/10 [14:16<2:08:29, 856.57s/it]

Adjusting learning rate of group 0 to 1.0000e-03.

Epoch 001 train_loss: 2.3036 train_acc: 0.1003 val_loss 2.3026 val_f1 0.0229



epoch:  20%|██        | 2/10 [28:33<1:54:15, 856.89s/it]

Adjusting learning rate of group 0 to 1.0000e-04.

Epoch 002 train_loss: 2.3028 train_acc: 0.1001 val_loss 2.3026 val_f1 0.0223



epoch:  30%|███       | 3/10 [42:50<1:39:59, 857.05s/it]

Adjusting learning rate of group 0 to 1.0000e-04.

Epoch 003 train_loss: 2.3026 train_acc: 0.0991 val_loss 2.3026 val_f1 0.0226



epoch:  40%|████      | 4/10 [57:08<1:25:43, 857.27s/it]

Adjusting learning rate of group 0 to 1.0000e-05.

Epoch 004 train_loss: 2.3026 train_acc: 0.0993 val_loss 2.3025 val_f1 0.0225



epoch:  50%|█████     | 5/10 [1:11:25<1:11:26, 857.23s/it]

Adjusting learning rate of group 0 to 1.0000e-05.

Epoch 005 train_loss: 2.3025 train_acc: 0.1008 val_loss 2.3025 val_f1 0.0235



epoch:  60%|██████    | 6/10 [1:25:41<57:07, 856.76s/it]  

Adjusting learning rate of group 0 to 1.0000e-06.

Epoch 006 train_loss: 2.3025 train_acc: 0.1000 val_loss 2.3025 val_f1 0.0225



epoch:  70%|███████   | 7/10 [1:39:58<42:50, 856.73s/it]

Adjusting learning rate of group 0 to 1.0000e-06.

Epoch 007 train_loss: 2.3025 train_acc: 0.1010 val_loss 2.3025 val_f1 0.0223



epoch:  80%|████████  | 8/10 [1:54:15<28:33, 856.76s/it]

Adjusting learning rate of group 0 to 1.0000e-07.

Epoch 008 train_loss: 2.3025 train_acc: 0.1010 val_loss 2.3025 val_f1 0.0228



epoch:  90%|█████████ | 9/10 [2:08:30<14:16, 856.32s/it]

Adjusting learning rate of group 0 to 1.0000e-07.

Epoch 009 train_loss: 2.3025 train_acc: 0.1014 val_loss 2.3025 val_f1 0.0234



epoch: 100%|██████████| 10/10 [2:22:45<00:00, 856.54s/it]

Adjusting learning rate of group 0 to 1.0000e-08.

Epoch 010 train_loss: 2.3025 train_acc: 0.1014 val_loss 2.3025 val_f1 0.0227






Fine-tuning is failed

In [32]:
from sklearn.metrics import f1_score
ids = np.random.randint(1, high=60000, size=1000)
x_test = np.array(dataset['test']['question_title'])[ids]
y_test = np.array(dataset['test']['topic'])[ids]
x_test = tokenizer(x_test.tolist(), padding="max_length", truncation=True, return_tensors='pt')
test_dataset = EDataset(x_test, y_test)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)
model.eval()
f1 = []
for i, l in test_loader:
    with torch.no_grad():
        input_ids = i['input_ids'].to(device)
        attention_mask = i['attention_mask'].to(device)
        token_type_ids = i['token_type_ids'].to(device)
        labels = l.to(device)
        outputs = model(input_ids=input_ids, attention_mask=attention_mask,
                        token_type_ids=token_type_ids)
        preds = torch.argmax(outputs['logits'], 1)
        f1s = f1_score(labels.cpu().data, preds.cpu(), average='weighted')
        f1.append(f1s)
        
print(sum(f1)/len(f1))

0.015390935479702888


### Transfer learning with layers freezing

In [33]:
config = ElectraConfig()
tokenizer = ElectraTokenizer.from_pretrained(TOKENIZER_NAME)
model = ElectraForSequenceClassification.from_pretrained(TOKENIZER_NAME)

Some weights of the model checkpoint at google/electra-small-generator were not used when initializing ElectraForSequenceClassification: ['generator_predictions.LayerNorm.bias', 'generator_predictions.LayerNorm.weight', 'generator_lm_head.bias', 'generator_predictions.dense.bias', 'generator_predictions.dense.weight', 'generator_lm_head.weight']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-generator and are newly initializ

In [35]:
for param in model.parameters():
      param.requires_grad = False

In [36]:
model.classifier.dense = torch.nn.Linear(256, 64)
model.classifier.out_proj = torch.nn.Sequential(
    torch.nn.LeakyReLU(),
    torch.nn.Linear(64, 10))
model.to(device)

ElectraForSequenceClassification(
  (electra): ElectraModel(
    (embeddings): ElectraEmbeddings(
      (word_embeddings): Embedding(30522, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (embeddings_project): Linear(in_features=128, out_features=256, bias=True)
    (encoder): ElectraEncoder(
      (layer): ModuleList(
        (0): ElectraLayer(
          (attention): ElectraAttention(
            (self): ElectraSelfAttention(
              (query): Linear(in_features=256, out_features=256, bias=True)
              (key): Linear(in_features=256, out_features=256, bias=True)
              (value): Linear(in_features=256, out_features=256, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): ElectraSelfOutput(
              (dense): Linear(in_

In [41]:
params_to_update = [param for param in model.classifier.parameters()]
optimizer = torch.optim.AdamW(params_to_update, lr=0.001)
criterion = torch.nn.CrossEntropyLoss()
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 3, gamma=0.2, 
                                            last_epoch=-1, verbose=True)

history = train(train_dataset, val_dataset, model, epochs=10, batch_size=32, 
                opt=optimizer, criterion=criterion) 
# what a shame i forgot to pass scheduler parameter

Adjusting learning rate of group 0 to 1.0000e-03.


epoch:  10%|█         | 1/10 [05:22<48:18, 322.05s/it]


Epoch 001 train_loss: 1.7281 train_acc: 0.4116 val_loss 1.4979 val_f1 0.5000



epoch:  20%|██        | 2/10 [10:44<42:55, 321.99s/it]


Epoch 002 train_loss: 1.7248 train_acc: 0.4133 val_loss 1.4785 val_f1 0.5375



epoch:  30%|███       | 3/10 [16:06<37:33, 322.00s/it]


Epoch 003 train_loss: 1.7189 train_acc: 0.4155 val_loss 1.4888 val_f1 0.5177



epoch:  40%|████      | 4/10 [21:28<32:12, 322.13s/it]


Epoch 004 train_loss: 1.7218 train_acc: 0.4153 val_loss 1.4612 val_f1 0.5397



epoch:  50%|█████     | 5/10 [26:50<26:51, 322.24s/it]


Epoch 005 train_loss: 1.7202 train_acc: 0.4156 val_loss 1.4429 val_f1 0.5467



epoch:  60%|██████    | 6/10 [32:13<21:29, 322.30s/it]


Epoch 006 train_loss: 1.7143 train_acc: 0.4167 val_loss 1.4836 val_f1 0.5305



epoch:  70%|███████   | 7/10 [37:35<16:06, 322.31s/it]


Epoch 007 train_loss: 1.7137 train_acc: 0.4178 val_loss 1.4699 val_f1 0.5326



epoch:  80%|████████  | 8/10 [42:57<10:44, 322.32s/it]


Epoch 008 train_loss: 1.7137 train_acc: 0.4177 val_loss 1.4724 val_f1 0.5332



epoch:  90%|█████████ | 9/10 [48:20<05:22, 322.33s/it]


Epoch 009 train_loss: 1.7168 train_acc: 0.4178 val_loss 1.4630 val_f1 0.5102



epoch: 100%|██████████| 10/10 [53:42<00:00, 322.25s/it]


Epoch 010 train_loss: 1.7134 train_acc: 0.4167 val_loss 1.4351 val_f1 0.5363






In [40]:
model.eval()
f1 = []
for i, l in test_loader:
    with torch.no_grad():
        input_ids = i['input_ids'].to(device)
        attention_mask = i['attention_mask'].to(device)
        token_type_ids = i['token_type_ids'].to(device)
        labels = l.to(device)
        outputs = model(input_ids=input_ids, attention_mask=attention_mask,
                        token_type_ids=token_type_ids)
        preds = torch.argmax(outputs['logits'], 1)
        f1s = f1_score(labels.cpu().data, preds.cpu(), average='weighted')
        f1.append(f1s)
        
print(sum(f1)/len(f1))

0.5255830394329934


Well this looks better

## Results

I spent much time on fine-tuning whole model, i tried Adam and AdamW optimizers (they say, AdamW is more suitable for BERT-based models), also learning rate scheduler. In first case, it did not impore learning process at all. I guess i had to train Electra in another way, as it after training it performed even worse, than before training

In case with freezing weights, it trained much faster and performed quite well. I freezed all parameters of model, except the classifier layers. 