# Project Outline
- Step 1: Introduction (this section)

- Step 2: Exploratory Data Analysis and Preprocessing

- Step 3: Without Finetuning

- Step 4: Loading Tokenizer and Encoding our Data

- Step 5: Setting up BERT Pretrained Model

- Step 6: Creating Data Loaders

- Step 7: Setting Up Optimizer and Scheduler

- Step 8: Defining our Performance Metrics

- Step 9: Finetune BERT and SciBERT

## Introduction

[BERT](https://https://huggingface.co/docs/transformers/model_doc/bert) is a large-scale transformer-based Language Model that can be finetuned for a variety of tasks. For more information, the original paper can be found [here](https://https://arxiv.org/abs/1810.04805).

[SciBERT](https://huggingface.co/allenai/scibert_scivocab_uncased) is BERT trained on scientific text. For more information, the original paper can be found [here](https://aclanthology.org/D19-1371/).

## Exploratory Data Analysis and Preprocessing

In [9]:
from datasets import load_dataset
import pandas as pd
import torch
from tqdm import tqdm

sst2 = load_dataset("stanfordnlp/sst2")
sst2

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 872
    })
    test: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 1821
    })
})

In [10]:
# Build dataframes
df_train = pd.DataFrame(sst2["train"])
df_val   = pd.DataFrame(sst2["validation"])

In [11]:
print(df_train.head(10).to_markdown(index=False)) # 10 examples are demonstrated here

|   idx | sentence                                                                                                                                             |   label |
|------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------|--------:|
|     0 | hide new secretions from the parental units                                                                                                          |       0 |
|     1 | contains no wit , only labored gags                                                                                                                  |       0 |
|     2 | that loves its characters and communicates something rather beautiful about human nature                                                             |       1 |
|     3 | remains utterly satisfied to remain the same throughout                                                                                

In [12]:
split_idx = int(0.8 * len(df_val))

df_train_small = df_val.iloc[:split_idx].reset_index(drop=True)
df_val_small  = df_val.iloc[split_idx:].reset_index(drop=True)

print("Train:", len(df_train_small))
print("Test :", len(df_val_small))

Train: 697
Test : 175


In [13]:
df_train_small.label.value_counts()
#Point to be noted, you have to convert the categorical labels into numerical labels, if exists. like, positive --> 1, Negative --> 0.

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,357
0,340


In [14]:
possible_labels = df_train_small.label.unique()
print("possible_labels",possible_labels)

possible_labels [1 0]


In [15]:
#Optional for categorical to numerical map
label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

In [16]:
label_dict #{"Positive":1, "Negative":0}

{np.int64(1): 0, np.int64(0): 1}

In [17]:
label_dict_inverse = {1: "Positive", 0:"Negative"}

## without finetuning BERT

In [1]:
from transformers import AutoTokenizer, AutoModel
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
bert = AutoModel.from_pretrained("bert-base-uncased")

bert.to(device)
bert.eval()   #notice

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertModel LOAD REPORT from: bert-base-uncased
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
cls.predictions.transform.dense.weight     | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED |  | 
cls.seq_relationship.weight                | UNEXPECTED |  | 
cls.predictions.bias                       | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED |  | 
cls.seq_relationship.bias                  | UNEXPECTED |  | 
cls.predictions.transform.dense.bias       | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

In [2]:
def get_bert_cls_embeddings(sentences, batch_size=32, max_len=128):

    all_embeddings = []

    with torch.no_grad():
        for i in tqdm(range(0, len(sentences), batch_size)):

            batch = sentences[i:i+batch_size]

            enc = tokenizer(
                batch,
                padding=True,
                truncation=True,
                max_length=max_len,
                return_tensors="pt"
            )

            enc = {k: v.to(device) for k, v in enc.items()}

            outputs = bert(**enc)

            # CLS token representation
            cls_emb = outputs.last_hidden_state[:, 0, :]  # (B, 768)

            all_embeddings.append(cls_emb.cpu())

    return torch.cat(all_embeddings, dim=0)


In [46]:
X_train = get_bert_cls_embeddings(
    df_train_small["sentence"].tolist()
)

X_val = get_bert_cls_embeddings(
    df_val_small["sentence"].tolist()
)

y_train = df_train_small["label"].values
y_val   = df_val_small["label"].values

print(X_train.shape, X_val.shape)


100%|██████████| 22/22 [00:00<00:00, 26.82it/s]
100%|██████████| 6/6 [00:00<00:00, 27.56it/s]

torch.Size([697, 768]) torch.Size([175, 768])





In [47]:
import torch.nn as nn
#768  →  256  →  1

class MLPClassifier(nn.Module):
    def __init__(self, in_dim, hidden_dim=256):
        super().__init__()

        self.fc1 = nn.Linear(in_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, 1)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x.squeeze(-1)


In [48]:
model = MLPClassifier(768,256).to(device)

criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

In [49]:
Xtr = X_train.to(device)
ytr = torch.tensor(y_train, dtype=torch.float32).to(device)

for epoch in range(10):
    model.train()

    optimizer.zero_grad()
    logits = model(Xtr)

    loss = criterion(logits, ytr)
    loss.backward()
    optimizer.step()

    print(f"Epoch {epoch} loss: {loss.item():.4f}")


Epoch 0 loss: 0.6967
Epoch 1 loss: 0.6585
Epoch 2 loss: 0.6238
Epoch 3 loss: 0.5899
Epoch 4 loss: 0.5567
Epoch 5 loss: 0.5252
Epoch 6 loss: 0.4960
Epoch 7 loss: 0.4691
Epoch 8 loss: 0.4445
Epoch 9 loss: 0.4221


In [50]:
model.eval()
with torch.no_grad():
    logits = model(X_val.to(device))
    probs = torch.sigmoid(logits)
    preds = (probs > 0.5).long().cpu().numpy()

In [51]:
from sklearn.metrics import accuracy_score, f1_score, classification_report

print("Accuracy:", accuracy_score(y_val, preds))
print("F1:", f1_score(y_val, preds))
print(classification_report(
        y_val, preds
    ))

Accuracy: 0.7828571428571428
F1: 0.7934782608695652
              precision    recall  f1-score   support

           0       0.82      0.73      0.77        88
           1       0.75      0.84      0.79        87

    accuracy                           0.78       175
   macro avg       0.79      0.78      0.78       175
weighted avg       0.79      0.78      0.78       175



## without finetuning SciBERT

In [68]:
from transformers import AutoTokenizer, AutoModel
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained("allenai/scibert_scivocab_uncased")
bert = AutoModel.from_pretrained("allenai/scibert_scivocab_uncased")

bert.to(device)
bert.eval()   #notice

config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/442M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertModel LOAD REPORT from: allenai/scibert_scivocab_uncased
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
cls.predictions.transform.dense.weight     | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED |  | 
cls.seq_relationship.weight                | UNEXPECTED |  | 
cls.predictions.bias                       | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED |  | 
cls.predictions.transform.dense.bias       | UNEXPECTED |  | 
cls.seq_relationship.bias                  | UNEXPECTED |  | 
cls.predictions.decoder.weight             | UNEXPECTED |  | 
cls.predictions.decoder.bias               | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(31090, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

In [69]:
def get_bert_cls_embeddings(sentences, batch_size=32, max_len=128):

    all_embeddings = []

    with torch.no_grad():
        for i in tqdm(range(0, len(sentences), batch_size)):

            batch = sentences[i:i+batch_size]

            enc = tokenizer(
                batch,
                padding=True,
                truncation=True,
                max_length=max_len,
                return_tensors="pt"
            )

            enc = {k: v.to(device) for k, v in enc.items()}

            outputs = bert(**enc)

            # CLS token representation
            cls_emb = outputs.last_hidden_state[:, 0, :]  # (B, 768)

            all_embeddings.append(cls_emb.cpu())

    return torch.cat(all_embeddings, dim=0)


In [70]:
X_train = get_bert_cls_embeddings(
    df_train_small["sentence"].tolist()
)

X_val = get_bert_cls_embeddings(
    df_val_small["sentence"].tolist()
)

y_train = df_train_small["label"].values
y_val   = df_val_small["label"].values

print(X_train.shape, X_val.shape)


100%|██████████| 22/22 [00:00<00:00, 27.18it/s]
100%|██████████| 6/6 [00:00<00:00, 29.46it/s]

torch.Size([697, 768]) torch.Size([175, 768])





In [71]:
import torch.nn as nn
#768  →  256  →  1

class MLPClassifier(nn.Module):
    def __init__(self, in_dim, hidden_dim=256):
        super().__init__()

        self.fc1 = nn.Linear(in_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, 1)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x.squeeze(-1)


In [72]:
model = MLPClassifier(768,256).to(device)

criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

In [73]:
Xtr = X_train.to(device)
ytr = torch.tensor(y_train, dtype=torch.float32).to(device)

for epoch in range(10):
    model.train()

    optimizer.zero_grad()
    logits = model(Xtr)

    loss = criterion(logits, ytr)
    loss.backward()
    optimizer.step()

    print(f"Epoch {epoch} loss: {loss.item():.4f}")


Epoch 0 loss: 0.6945
Epoch 1 loss: 0.6807
Epoch 2 loss: 0.6277
Epoch 3 loss: 0.6232
Epoch 4 loss: 0.5815
Epoch 5 loss: 0.5700
Epoch 6 loss: 0.5557
Epoch 7 loss: 0.5302
Epoch 8 loss: 0.5236
Epoch 9 loss: 0.5110


In [74]:
model.eval()
with torch.no_grad():
    logits = model(X_val.to(device))
    probs = torch.sigmoid(logits)
    preds = (probs > 0.5).long().cpu().numpy()

In [76]:
from sklearn.metrics import accuracy_score, f1_score, classification_report

print("Accuracy:", accuracy_score(y_val, preds))
print("F1:", f1_score(y_val, preds))
print(classification_report(
        y_val, preds
    ))

Accuracy: 0.72
F1: 0.7262569832402235
              precision    recall  f1-score   support

           0       0.73      0.69      0.71        88
           1       0.71      0.75      0.73        87

    accuracy                           0.72       175
   macro avg       0.72      0.72      0.72       175
weighted avg       0.72      0.72      0.72       175



## Loading Tokenizer and Encoding our Data

In [91]:
from transformers import AutoTokenizer
from torch.utils.data import TensorDataset
import torch

In [92]:
# Initialise tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased",do_lower_case=True)

In [93]:
# -------- train split --------
encoded_data_train = tokenizer(
    df_train_small["sentence"].values.tolist(),
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt',padding=True,truncation=True
)

input_ids_train = encoded_data_train["input_ids"]
attention_masks_train = encoded_data_train["attention_mask"]
labels_train = torch.tensor(df_train_small["label"].values)


# -------- validation split --------
encoded_data_val = tokenizer(
    df_val_small["sentence"].values.tolist(),
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt',padding=True,truncation=True
)

input_ids_val = encoded_data_val["input_ids"]
attention_masks_val = encoded_data_val["attention_mask"]
labels_val = torch.tensor(df_val_small["label"].values)

In [94]:
# -------- TensorDataset (same style as your codebase) --------
dataset_train = TensorDataset(
    input_ids_train,
    attention_masks_train,
    labels_train
)

dataset_val = TensorDataset(
    input_ids_val,
    attention_masks_val,
    labels_val
)

In [95]:
print(len(dataset_train))

print(len(dataset_val))

697
175


## Setting up BERT Pretrained Model

In [96]:
from transformers import BertForSequenceClassification

In [97]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertForSequenceClassification LOAD REPORT from: bert-base-uncased
Key                                        | Status     | 
-------------------------------------------+------------+-
cls.predictions.transform.dense.weight     | UNEXPECTED | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED | 
cls.seq_relationship.weight                | UNEXPECTED | 
cls.predictions.bias                       | UNEXPECTED | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED | 
cls.seq_relationship.bias                  | UNEXPECTED | 
cls.predictions.transform.dense.bias       | UNEXPECTED | 
classifier.weight                          | MISSING    | 
classifier.bias                            | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


### Creating Data Loaders

In [98]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

In [99]:
batch_size = 8

dataloader_train = DataLoader(dataset_train,
                              sampler=RandomSampler(dataset_train),
                              batch_size=batch_size)

dataloader_validation = DataLoader(dataset_val,
                                   sampler=SequentialSampler(dataset_val),
                                   batch_size=batch_size)

## Setting Up Optimiser and Scheduler

In [100]:
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup

In [101]:
# Optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)

In [102]:
epochs = 10

scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps=0,
                                            num_training_steps=len(dataloader_train)*epochs)

## Defining our Performance Metrics

In [103]:
import numpy as np
from sklearn.metrics import (
    classification_report,
    accuracy_score,
    f1_score,
    confusion_matrix
)

In [104]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

In [105]:
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}

    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    print("\nClassification report:")
    print(classification_report(
        labels_flat,
        preds_flat
    ))

    overall_acc = accuracy_score(labels_flat, preds_flat)

    f1_macro    = f1_score(labels_flat, preds_flat, average="macro")

    print("Overall accuracy:", overall_acc)
    print("F1 macro       :", f1_macro)

    # for label in np.unique(labels_flat):
    #     y_preds = preds_flat[labels_flat==label]
    #     y_true = labels_flat[labels_flat==label]
    #     print(f'Class: {label_dict_inverse[label]}')
    #     print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

## Finetuning BERT

In [106]:
import random

seed_val = 0
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [107]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(device)

cuda


In [108]:
def evaluate(dataloader_val):

    model.eval()

    loss_val_total = 0
    predictions, true_vals = [], []

    for batch in dataloader_val:

        batch = tuple(b.to(device) for b in batch)

        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():
            outputs = model(**inputs)

        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)

    loss_val_avg = loss_val_total/len(dataloader_val)

    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)

    return loss_val_avg, predictions, true_vals

In [109]:
for epoch in tqdm(range(1, epochs+1)):

    model.train()

    loss_train_total = 0

    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:

        model.zero_grad()

        batch = tuple(b.to(device) for b in batch)

        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        outputs = model(**inputs)

        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()

        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})


    torch.save(model.state_dict(), f'finetuned_BERT_epoch_{epoch}.model')

    tqdm.write(f'\nEpoch {epoch}')

    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')

    # val_loss, predictions, true_vals = evaluate(dataloader_validation)
    # val_f1 = f1_score_func(predictions, true_vals)
    # tqdm.write(f'Validation loss: {val_loss}')
    # tqdm.write(f'F1 Score (Weighted): {val_f1}')

  0%|          | 0/10 [00:00<?, ?it/s]
Epoch 1:   0%|          | 0/88 [00:00<?, ?it/s][A
Epoch 1:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.247][A
Epoch 1:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.214][A
Epoch 1:   2%|▏         | 2/88 [00:00<00:06, 12.59it/s, training_loss=0.214][A
Epoch 1:   2%|▏         | 2/88 [00:00<00:06, 12.59it/s, training_loss=0.228][A
Epoch 1:   2%|▏         | 2/88 [00:00<00:06, 12.59it/s, training_loss=0.226][A
Epoch 1:   5%|▍         | 4/88 [00:00<00:06, 12.80it/s, training_loss=0.226][A
Epoch 1:   5%|▍         | 4/88 [00:00<00:06, 12.80it/s, training_loss=0.202][A
Epoch 1:   5%|▍         | 4/88 [00:00<00:06, 12.80it/s, training_loss=0.260][A
Epoch 1:   7%|▋         | 6/88 [00:00<00:06, 12.83it/s, training_loss=0.260][A
Epoch 1:   7%|▋         | 6/88 [00:00<00:06, 12.83it/s, training_loss=0.229][A
Epoch 1:   7%|▋         | 6/88 [00:00<00:06, 12.83it/s, training_loss=0.238][A
Epoch 1:   9%|▉         | 8/88 [00:00<00:06, 1


Epoch 1
Training loss: 0.4800108624622226



Epoch 2:   0%|          | 0/88 [00:00<?, ?it/s][A
Epoch 2:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.141][A
Epoch 2:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.025][A
Epoch 2:   2%|▏         | 2/88 [00:00<00:06, 12.83it/s, training_loss=0.025][A
Epoch 2:   2%|▏         | 2/88 [00:00<00:06, 12.83it/s, training_loss=0.207][A
Epoch 2:   2%|▏         | 2/88 [00:00<00:06, 12.83it/s, training_loss=0.019][A
Epoch 2:   5%|▍         | 4/88 [00:00<00:06, 12.76it/s, training_loss=0.019][A
Epoch 2:   5%|▍         | 4/88 [00:00<00:06, 12.76it/s, training_loss=0.018][A
Epoch 2:   5%|▍         | 4/88 [00:00<00:06, 12.76it/s, training_loss=0.261][A
Epoch 2:   7%|▋         | 6/88 [00:00<00:06, 12.67it/s, training_loss=0.261][A
Epoch 2:   7%|▋         | 6/88 [00:00<00:06, 12.67it/s, training_loss=0.027][A
Epoch 2:   7%|▋         | 6/88 [00:00<00:06, 12.67it/s, training_loss=0.015][A
Epoch 2:   9%|▉         | 8/88 [00:00<00:06, 12.65it/s, training_loss=0.015][A
Epoc


Epoch 2
Training loss: 0.22081708727108146



Epoch 3:   0%|          | 0/88 [00:00<?, ?it/s][A
Epoch 3:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.002][A
Epoch 3:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.003][A
Epoch 3:   2%|▏         | 2/88 [00:00<00:06, 12.76it/s, training_loss=0.003][A
Epoch 3:   2%|▏         | 2/88 [00:00<00:06, 12.76it/s, training_loss=0.003][A
Epoch 3:   2%|▏         | 2/88 [00:00<00:06, 12.76it/s, training_loss=0.002][A
Epoch 3:   5%|▍         | 4/88 [00:00<00:06, 12.64it/s, training_loss=0.002][A
Epoch 3:   5%|▍         | 4/88 [00:00<00:06, 12.64it/s, training_loss=0.018][A
Epoch 3:   5%|▍         | 4/88 [00:00<00:06, 12.64it/s, training_loss=0.002][A
Epoch 3:   7%|▋         | 6/88 [00:00<00:06, 12.58it/s, training_loss=0.002][A
Epoch 3:   7%|▋         | 6/88 [00:00<00:06, 12.58it/s, training_loss=0.005][A
Epoch 3:   7%|▋         | 6/88 [00:00<00:06, 12.58it/s, training_loss=0.002][A
Epoch 3:   9%|▉         | 8/88 [00:00<00:06, 12.58it/s, training_loss=0.002][A
Epoc


Epoch 3
Training loss: 0.11829078564981253



Epoch 4:   0%|          | 0/88 [00:00<?, ?it/s][A
Epoch 4:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.001][A
Epoch 4:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.001][A
Epoch 4:   2%|▏         | 2/88 [00:00<00:06, 13.00it/s, training_loss=0.001][A
Epoch 4:   2%|▏         | 2/88 [00:00<00:06, 13.00it/s, training_loss=0.001][A
Epoch 4:   2%|▏         | 2/88 [00:00<00:06, 13.00it/s, training_loss=0.001][A
Epoch 4:   5%|▍         | 4/88 [00:00<00:06, 12.91it/s, training_loss=0.001][A
Epoch 4:   5%|▍         | 4/88 [00:00<00:06, 12.91it/s, training_loss=0.002][A
Epoch 4:   5%|▍         | 4/88 [00:00<00:06, 12.91it/s, training_loss=0.001][A
Epoch 4:   7%|▋         | 6/88 [00:00<00:06, 12.81it/s, training_loss=0.001][A
Epoch 4:   7%|▋         | 6/88 [00:00<00:06, 12.81it/s, training_loss=0.002][A
Epoch 4:   7%|▋         | 6/88 [00:00<00:06, 12.81it/s, training_loss=0.001][A
Epoch 4:   9%|▉         | 8/88 [00:00<00:06, 12.78it/s, training_loss=0.001][A
Epoc


Epoch 4
Training loss: 0.029996472400274466



Epoch 5:   0%|          | 0/88 [00:00<?, ?it/s][A
Epoch 5:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.000][A
Epoch 5:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.001][A
Epoch 5:   2%|▏         | 2/88 [00:00<00:06, 12.69it/s, training_loss=0.001][A
Epoch 5:   2%|▏         | 2/88 [00:00<00:06, 12.69it/s, training_loss=0.000][A
Epoch 5:   2%|▏         | 2/88 [00:00<00:06, 12.69it/s, training_loss=0.001][A
Epoch 5:   5%|▍         | 4/88 [00:00<00:06, 12.53it/s, training_loss=0.001][A
Epoch 5:   5%|▍         | 4/88 [00:00<00:06, 12.53it/s, training_loss=0.001][A
Epoch 5:   5%|▍         | 4/88 [00:00<00:06, 12.53it/s, training_loss=0.000][A
Epoch 5:   7%|▋         | 6/88 [00:00<00:06, 12.48it/s, training_loss=0.000][A
Epoch 5:   7%|▋         | 6/88 [00:00<00:06, 12.48it/s, training_loss=0.001][A
Epoch 5:   7%|▋         | 6/88 [00:00<00:06, 12.48it/s, training_loss=0.000][A
Epoch 5:   9%|▉         | 8/88 [00:00<00:06, 12.49it/s, training_loss=0.000][A
Epoc


Epoch 5
Training loss: 0.029805286398517306



Epoch 6:   0%|          | 0/88 [00:00<?, ?it/s][A
Epoch 6:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.000][A
Epoch 6:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.000][A
Epoch 6:   2%|▏         | 2/88 [00:00<00:06, 12.86it/s, training_loss=0.000][A
Epoch 6:   2%|▏         | 2/88 [00:00<00:06, 12.86it/s, training_loss=0.000][A
Epoch 6:   2%|▏         | 2/88 [00:00<00:06, 12.86it/s, training_loss=0.000][A
Epoch 6:   5%|▍         | 4/88 [00:00<00:06, 12.66it/s, training_loss=0.000][A
Epoch 6:   5%|▍         | 4/88 [00:00<00:06, 12.66it/s, training_loss=0.000][A
Epoch 6:   5%|▍         | 4/88 [00:00<00:06, 12.66it/s, training_loss=0.000][A
Epoch 6:   7%|▋         | 6/88 [00:00<00:06, 12.65it/s, training_loss=0.000][A
Epoch 6:   7%|▋         | 6/88 [00:00<00:06, 12.65it/s, training_loss=0.000][A
Epoch 6:   7%|▋         | 6/88 [00:00<00:06, 12.65it/s, training_loss=0.000][A
Epoch 6:   9%|▉         | 8/88 [00:00<00:06, 12.67it/s, training_loss=0.000][A
Epoc


Epoch 6
Training loss: 0.011531064500476614



Epoch 7:   0%|          | 0/88 [00:00<?, ?it/s][A
Epoch 7:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.000][A
Epoch 7:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.000][A
Epoch 7:   2%|▏         | 2/88 [00:00<00:06, 12.41it/s, training_loss=0.000][A
Epoch 7:   2%|▏         | 2/88 [00:00<00:06, 12.41it/s, training_loss=0.000][A
Epoch 7:   2%|▏         | 2/88 [00:00<00:06, 12.41it/s, training_loss=0.000][A
Epoch 7:   5%|▍         | 4/88 [00:00<00:06, 12.43it/s, training_loss=0.000][A
Epoch 7:   5%|▍         | 4/88 [00:00<00:06, 12.43it/s, training_loss=0.000][A
Epoch 7:   5%|▍         | 4/88 [00:00<00:06, 12.43it/s, training_loss=0.000][A
Epoch 7:   7%|▋         | 6/88 [00:00<00:06, 12.40it/s, training_loss=0.000][A
Epoch 7:   7%|▋         | 6/88 [00:00<00:06, 12.40it/s, training_loss=0.000][A
Epoch 7:   7%|▋         | 6/88 [00:00<00:06, 12.40it/s, training_loss=0.000][A
Epoch 7:   9%|▉         | 8/88 [00:00<00:06, 12.43it/s, training_loss=0.000][A
Epoc


Epoch 7
Training loss: 0.00340939437327589



Epoch 8:   0%|          | 0/88 [00:00<?, ?it/s][A
Epoch 8:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.000][A
Epoch 8:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.000][A
Epoch 8:   2%|▏         | 2/88 [00:00<00:06, 12.93it/s, training_loss=0.000][A
Epoch 8:   2%|▏         | 2/88 [00:00<00:06, 12.93it/s, training_loss=0.000][A
Epoch 8:   2%|▏         | 2/88 [00:00<00:06, 12.93it/s, training_loss=0.000][A
Epoch 8:   5%|▍         | 4/88 [00:00<00:06, 12.88it/s, training_loss=0.000][A
Epoch 8:   5%|▍         | 4/88 [00:00<00:06, 12.88it/s, training_loss=0.000][A
Epoch 8:   5%|▍         | 4/88 [00:00<00:06, 12.88it/s, training_loss=0.000][A
Epoch 8:   7%|▋         | 6/88 [00:00<00:06, 12.75it/s, training_loss=0.000][A
Epoch 8:   7%|▋         | 6/88 [00:00<00:06, 12.75it/s, training_loss=0.000][A
Epoch 8:   7%|▋         | 6/88 [00:00<00:06, 12.75it/s, training_loss=0.000][A
Epoch 8:   9%|▉         | 8/88 [00:00<00:06, 12.74it/s, training_loss=0.000][A
Epoc


Epoch 8
Training loss: 0.004329151278480739



Epoch 9:   0%|          | 0/88 [00:00<?, ?it/s][A
Epoch 9:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.000][A
Epoch 9:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.000][A
Epoch 9:   2%|▏         | 2/88 [00:00<00:06, 12.75it/s, training_loss=0.000][A
Epoch 9:   2%|▏         | 2/88 [00:00<00:06, 12.75it/s, training_loss=0.000][A
Epoch 9:   2%|▏         | 2/88 [00:00<00:06, 12.75it/s, training_loss=0.000][A
Epoch 9:   5%|▍         | 4/88 [00:00<00:06, 12.53it/s, training_loss=0.000][A
Epoch 9:   5%|▍         | 4/88 [00:00<00:06, 12.53it/s, training_loss=0.000][A
Epoch 9:   5%|▍         | 4/88 [00:00<00:06, 12.53it/s, training_loss=0.000][A
Epoch 9:   7%|▋         | 6/88 [00:00<00:06, 12.50it/s, training_loss=0.000][A
Epoch 9:   7%|▋         | 6/88 [00:00<00:06, 12.50it/s, training_loss=0.000][A
Epoch 9:   7%|▋         | 6/88 [00:00<00:06, 12.50it/s, training_loss=0.000][A
Epoch 9:   9%|▉         | 8/88 [00:00<00:06, 12.56it/s, training_loss=0.000][A
Epoc


Epoch 9
Training loss: 0.0003585194734279701



Epoch 10:   0%|          | 0/88 [00:00<?, ?it/s][A
Epoch 10:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.000][A
Epoch 10:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.000][A
Epoch 10:   2%|▏         | 2/88 [00:00<00:06, 12.88it/s, training_loss=0.000][A
Epoch 10:   2%|▏         | 2/88 [00:00<00:06, 12.88it/s, training_loss=0.000][A
Epoch 10:   2%|▏         | 2/88 [00:00<00:06, 12.88it/s, training_loss=0.000][A
Epoch 10:   5%|▍         | 4/88 [00:00<00:06, 12.78it/s, training_loss=0.000][A
Epoch 10:   5%|▍         | 4/88 [00:00<00:06, 12.78it/s, training_loss=0.000][A
Epoch 10:   5%|▍         | 4/88 [00:00<00:06, 12.78it/s, training_loss=0.000][A
Epoch 10:   7%|▋         | 6/88 [00:00<00:06, 12.75it/s, training_loss=0.000][A
Epoch 10:   7%|▋         | 6/88 [00:00<00:06, 12.75it/s, training_loss=0.000][A
Epoch 10:   7%|▋         | 6/88 [00:00<00:06, 12.75it/s, training_loss=0.000][A
Epoch 10:   9%|▉         | 8/88 [00:00<00:06, 12.72it/s, training_loss=0


Epoch 10
Training loss: 0.0003215950610782866





In [110]:
##Optional, when I am uploading th emodel
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

model.to(device)

model.load_state_dict(torch.load('/content/finetuned_BERT_epoch_2.model', map_location=torch.device('cpu')))

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertForSequenceClassification LOAD REPORT from: bert-base-uncased
Key                                        | Status     | 
-------------------------------------------+------------+-
cls.predictions.transform.dense.weight     | UNEXPECTED | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED | 
cls.seq_relationship.weight                | UNEXPECTED | 
cls.predictions.bias                       | UNEXPECTED | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED | 
cls.seq_relationship.bias                  | UNEXPECTED | 
cls.predictions.transform.dense.bias       | UNEXPECTED | 
classifier.weight                          | MISSING    | 
classifier.bias                            | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


<All keys matched successfully>

In [111]:
_, predictions, true_vals = evaluate(dataloader_validation)

In [112]:
accuracy_per_class(predictions, true_vals)


Classification report:
              precision    recall  f1-score   support

           0       0.90      0.86      0.88        88
           1       0.87      0.91      0.89        87

    accuracy                           0.89       175
   macro avg       0.89      0.89      0.89       175
weighted avg       0.89      0.89      0.89       175

Overall accuracy: 0.8857142857142857
F1 macro       : 0.8856806898353802


In [113]:
predictions[:3]

array([[ 0.43597564, -0.8802067 ],
       [ 1.647924  , -1.6333351 ],
       [-1.7186362 ,  1.7910877 ]], dtype=float32)

## Finetuning SciBERT

In [114]:
model = BertForSequenceClassification.from_pretrained("allenai/scibert_scivocab_uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertForSequenceClassification LOAD REPORT from: allenai/scibert_scivocab_uncased
Key                                        | Status     | 
-------------------------------------------+------------+-
cls.predictions.transform.dense.weight     | UNEXPECTED | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED | 
cls.seq_relationship.weight                | UNEXPECTED | 
cls.predictions.bias                       | UNEXPECTED | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED | 
cls.predictions.transform.dense.bias       | UNEXPECTED | 
cls.seq_relationship.bias                  | UNEXPECTED | 
cls.predictions.decoder.weight             | UNEXPECTED | 
cls.predictions.decoder.bias               | UNEXPECTED | 
classifier.weight                          | MISSING    | 
classifier.bias                            | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were ne

In [115]:
import random

seed_val = 0
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [116]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(device)

cuda


In [117]:
def evaluate(dataloader_val):

    model.eval()

    loss_val_total = 0
    predictions, true_vals = [], []

    for batch in dataloader_val:

        batch = tuple(b.to(device) for b in batch)

        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():
            outputs = model(**inputs)

        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)

    loss_val_avg = loss_val_total/len(dataloader_val)

    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)

    return loss_val_avg, predictions, true_vals

In [118]:
for epoch in tqdm(range(1, epochs+1)):

    model.train()

    loss_train_total = 0

    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:

        model.zero_grad()

        batch = tuple(b.to(device) for b in batch)

        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        outputs = model(**inputs)

        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()

        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})


    torch.save(model.state_dict(), f'finetuned_SciBERT_epoch_{epoch}.model')

    tqdm.write(f'\nEpoch {epoch}')

    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')

    # val_loss, predictions, true_vals = evaluate(dataloader_validation)
    # val_f1 = f1_score_func(predictions, true_vals)
    # tqdm.write(f'Validation loss: {val_loss}')
    # tqdm.write(f'F1 Score (Weighted): {val_f1}')

  0%|          | 0/10 [00:00<?, ?it/s]
Epoch 1:   0%|          | 0/88 [00:00<?, ?it/s][A
Epoch 1:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.244][A
Epoch 1:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.276][A
Epoch 1:   2%|▏         | 2/88 [00:00<00:07, 11.73it/s, training_loss=0.276][A
Epoch 1:   2%|▏         | 2/88 [00:00<00:07, 11.73it/s, training_loss=0.260][A
Epoch 1:   2%|▏         | 2/88 [00:00<00:07, 11.73it/s, training_loss=0.224][A
Epoch 1:   5%|▍         | 4/88 [00:00<00:06, 12.09it/s, training_loss=0.224][A
Epoch 1:   5%|▍         | 4/88 [00:00<00:06, 12.09it/s, training_loss=0.235][A
Epoch 1:   5%|▍         | 4/88 [00:00<00:06, 12.09it/s, training_loss=0.192][A
Epoch 1:   7%|▋         | 6/88 [00:00<00:06, 12.23it/s, training_loss=0.192][A
Epoch 1:   7%|▋         | 6/88 [00:00<00:06, 12.23it/s, training_loss=0.244][A
Epoch 1:   7%|▋         | 6/88 [00:00<00:06, 12.23it/s, training_loss=0.216][A
Epoch 1:   9%|▉         | 8/88 [00:00<00:06, 1


Epoch 1
Training loss: 0.7129781232638792



Epoch 2:   0%|          | 0/88 [00:00<?, ?it/s][A
Epoch 2:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.254][A
Epoch 2:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.220][A
Epoch 2:   2%|▏         | 2/88 [00:00<00:06, 12.71it/s, training_loss=0.220][A
Epoch 2:   2%|▏         | 2/88 [00:00<00:06, 12.71it/s, training_loss=0.253][A
Epoch 2:   2%|▏         | 2/88 [00:00<00:06, 12.71it/s, training_loss=0.240][A
Epoch 2:   5%|▍         | 4/88 [00:00<00:06, 12.67it/s, training_loss=0.240][A
Epoch 2:   5%|▍         | 4/88 [00:00<00:06, 12.67it/s, training_loss=0.227][A
Epoch 2:   5%|▍         | 4/88 [00:00<00:06, 12.67it/s, training_loss=0.252][A
Epoch 2:   7%|▋         | 6/88 [00:00<00:06, 12.61it/s, training_loss=0.252][A
Epoch 2:   7%|▋         | 6/88 [00:00<00:06, 12.61it/s, training_loss=0.264][A
Epoch 2:   7%|▋         | 6/88 [00:00<00:06, 12.61it/s, training_loss=0.250][A
Epoch 2:   9%|▉         | 8/88 [00:00<00:06, 12.69it/s, training_loss=0.250][A
Epoc


Epoch 2
Training loss: 0.7101707214658911



Epoch 3:   0%|          | 0/88 [00:00<?, ?it/s][A
Epoch 3:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.262][A
Epoch 3:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.256][A
Epoch 3:   2%|▏         | 2/88 [00:00<00:06, 12.76it/s, training_loss=0.256][A
Epoch 3:   2%|▏         | 2/88 [00:00<00:06, 12.76it/s, training_loss=0.239][A
Epoch 3:   2%|▏         | 2/88 [00:00<00:06, 12.76it/s, training_loss=0.247][A
Epoch 3:   5%|▍         | 4/88 [00:00<00:06, 12.59it/s, training_loss=0.247][A
Epoch 3:   5%|▍         | 4/88 [00:00<00:06, 12.59it/s, training_loss=0.242][A
Epoch 3:   5%|▍         | 4/88 [00:00<00:06, 12.59it/s, training_loss=0.248][A
Epoch 3:   7%|▋         | 6/88 [00:00<00:06, 12.57it/s, training_loss=0.248][A
Epoch 3:   7%|▋         | 6/88 [00:00<00:06, 12.57it/s, training_loss=0.249][A
Epoch 3:   7%|▋         | 6/88 [00:00<00:06, 12.57it/s, training_loss=0.255][A
Epoch 3:   9%|▉         | 8/88 [00:00<00:06, 12.38it/s, training_loss=0.255][A
Epoc


Epoch 3
Training loss: 0.7115101895549081



Epoch 4:   0%|          | 0/88 [00:00<?, ?it/s][A
Epoch 4:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.215][A
Epoch 4:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.240][A
Epoch 4:   2%|▏         | 2/88 [00:00<00:07, 11.91it/s, training_loss=0.240][A
Epoch 4:   2%|▏         | 2/88 [00:00<00:07, 11.91it/s, training_loss=0.236][A
Epoch 4:   2%|▏         | 2/88 [00:00<00:07, 11.91it/s, training_loss=0.229][A
Epoch 4:   5%|▍         | 4/88 [00:00<00:06, 12.23it/s, training_loss=0.229][A
Epoch 4:   5%|▍         | 4/88 [00:00<00:06, 12.23it/s, training_loss=0.201][A
Epoch 4:   5%|▍         | 4/88 [00:00<00:06, 12.23it/s, training_loss=0.235][A
Epoch 4:   7%|▋         | 6/88 [00:00<00:06, 12.37it/s, training_loss=0.235][A
Epoch 4:   7%|▋         | 6/88 [00:00<00:06, 12.37it/s, training_loss=0.226][A
Epoch 4:   7%|▋         | 6/88 [00:00<00:06, 12.37it/s, training_loss=0.231][A
Epoch 4:   9%|▉         | 8/88 [00:00<00:06, 12.41it/s, training_loss=0.231][A
Epoc


Epoch 4
Training loss: 0.7108092721212994



Epoch 5:   0%|          | 0/88 [00:00<?, ?it/s][A
Epoch 5:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.234][A
Epoch 5:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.239][A
Epoch 5:   2%|▏         | 2/88 [00:00<00:06, 12.35it/s, training_loss=0.239][A
Epoch 5:   2%|▏         | 2/88 [00:00<00:06, 12.35it/s, training_loss=0.259][A
Epoch 5:   2%|▏         | 2/88 [00:00<00:06, 12.35it/s, training_loss=0.222][A
Epoch 5:   5%|▍         | 4/88 [00:00<00:06, 12.41it/s, training_loss=0.222][A
Epoch 5:   5%|▍         | 4/88 [00:00<00:06, 12.41it/s, training_loss=0.254][A
Epoch 5:   5%|▍         | 4/88 [00:00<00:06, 12.41it/s, training_loss=0.215][A
Epoch 5:   7%|▋         | 6/88 [00:00<00:06, 12.39it/s, training_loss=0.215][A
Epoch 5:   7%|▋         | 6/88 [00:00<00:06, 12.39it/s, training_loss=0.229][A
Epoch 5:   7%|▋         | 6/88 [00:00<00:06, 12.39it/s, training_loss=0.220][A
Epoch 5:   9%|▉         | 8/88 [00:00<00:06, 12.41it/s, training_loss=0.220][A
Epoc


Epoch 5
Training loss: 0.7026745548302477



Epoch 6:   0%|          | 0/88 [00:00<?, ?it/s][A
Epoch 6:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.246][A
Epoch 6:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.211][A
Epoch 6:   2%|▏         | 2/88 [00:00<00:06, 12.83it/s, training_loss=0.211][A
Epoch 6:   2%|▏         | 2/88 [00:00<00:06, 12.83it/s, training_loss=0.235][A
Epoch 6:   2%|▏         | 2/88 [00:00<00:06, 12.83it/s, training_loss=0.253][A
Epoch 6:   5%|▍         | 4/88 [00:00<00:06, 12.79it/s, training_loss=0.253][A
Epoch 6:   5%|▍         | 4/88 [00:00<00:06, 12.79it/s, training_loss=0.261][A
Epoch 6:   5%|▍         | 4/88 [00:00<00:06, 12.79it/s, training_loss=0.207][A
Epoch 6:   7%|▋         | 6/88 [00:00<00:06, 12.81it/s, training_loss=0.207][A
Epoch 6:   7%|▋         | 6/88 [00:00<00:06, 12.81it/s, training_loss=0.266][A
Epoch 6:   7%|▋         | 6/88 [00:00<00:06, 12.81it/s, training_loss=0.261][A
Epoch 6:   9%|▉         | 8/88 [00:00<00:06, 12.77it/s, training_loss=0.261][A
Epoc


Epoch 6
Training loss: 0.7089288444681601



Epoch 7:   0%|          | 0/88 [00:00<?, ?it/s][A
Epoch 7:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.254][A
Epoch 7:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.282][A
Epoch 7:   2%|▏         | 2/88 [00:00<00:06, 12.77it/s, training_loss=0.282][A
Epoch 7:   2%|▏         | 2/88 [00:00<00:06, 12.77it/s, training_loss=0.222][A
Epoch 7:   2%|▏         | 2/88 [00:00<00:06, 12.77it/s, training_loss=0.266][A
Epoch 7:   5%|▍         | 4/88 [00:00<00:06, 12.69it/s, training_loss=0.266][A
Epoch 7:   5%|▍         | 4/88 [00:00<00:06, 12.69it/s, training_loss=0.217][A
Epoch 7:   5%|▍         | 4/88 [00:00<00:06, 12.69it/s, training_loss=0.195][A
Epoch 7:   7%|▋         | 6/88 [00:00<00:06, 12.70it/s, training_loss=0.195][A
Epoch 7:   7%|▋         | 6/88 [00:00<00:06, 12.70it/s, training_loss=0.223][A
Epoch 7:   7%|▋         | 6/88 [00:00<00:06, 12.70it/s, training_loss=0.233][A
Epoch 7:   9%|▉         | 8/88 [00:00<00:06, 12.67it/s, training_loss=0.233][A
Epoc


Epoch 7
Training loss: 0.7115154097026045



Epoch 8:   0%|          | 0/88 [00:00<?, ?it/s][A
Epoch 8:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.206][A
Epoch 8:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.212][A
Epoch 8:   2%|▏         | 2/88 [00:00<00:06, 12.75it/s, training_loss=0.212][A
Epoch 8:   2%|▏         | 2/88 [00:00<00:06, 12.75it/s, training_loss=0.209][A
Epoch 8:   2%|▏         | 2/88 [00:00<00:06, 12.75it/s, training_loss=0.225][A
Epoch 8:   5%|▍         | 4/88 [00:00<00:06, 12.81it/s, training_loss=0.225][A
Epoch 8:   5%|▍         | 4/88 [00:00<00:06, 12.81it/s, training_loss=0.285][A
Epoch 8:   5%|▍         | 4/88 [00:00<00:06, 12.81it/s, training_loss=0.228][A
Epoch 8:   7%|▋         | 6/88 [00:00<00:06, 12.80it/s, training_loss=0.228][A
Epoch 8:   7%|▋         | 6/88 [00:00<00:06, 12.80it/s, training_loss=0.184][A
Epoch 8:   7%|▋         | 6/88 [00:00<00:06, 12.80it/s, training_loss=0.252][A
Epoch 8:   9%|▉         | 8/88 [00:00<00:06, 12.74it/s, training_loss=0.252][A
Epoc


Epoch 8
Training loss: 0.7195551964369687



Epoch 9:   0%|          | 0/88 [00:00<?, ?it/s][A
Epoch 9:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.239][A
Epoch 9:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.224][A
Epoch 9:   2%|▏         | 2/88 [00:00<00:06, 12.71it/s, training_loss=0.224][A
Epoch 9:   2%|▏         | 2/88 [00:00<00:06, 12.71it/s, training_loss=0.230][A
Epoch 9:   2%|▏         | 2/88 [00:00<00:06, 12.71it/s, training_loss=0.209][A
Epoch 9:   5%|▍         | 4/88 [00:00<00:06, 12.56it/s, training_loss=0.209][A
Epoch 9:   5%|▍         | 4/88 [00:00<00:06, 12.56it/s, training_loss=0.233][A
Epoch 9:   5%|▍         | 4/88 [00:00<00:06, 12.56it/s, training_loss=0.251][A
Epoch 9:   7%|▋         | 6/88 [00:00<00:06, 12.62it/s, training_loss=0.251][A
Epoch 9:   7%|▋         | 6/88 [00:00<00:06, 12.62it/s, training_loss=0.212][A
Epoch 9:   7%|▋         | 6/88 [00:00<00:06, 12.62it/s, training_loss=0.217][A
Epoch 9:   9%|▉         | 8/88 [00:00<00:06, 12.49it/s, training_loss=0.217][A
Epoc


Epoch 9
Training loss: 0.7103574540127408



Epoch 10:   0%|          | 0/88 [00:00<?, ?it/s][A
Epoch 10:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.225][A
Epoch 10:   0%|          | 0/88 [00:00<?, ?it/s, training_loss=0.241][A
Epoch 10:   2%|▏         | 2/88 [00:00<00:06, 12.59it/s, training_loss=0.241][A
Epoch 10:   2%|▏         | 2/88 [00:00<00:06, 12.59it/s, training_loss=0.236][A
Epoch 10:   2%|▏         | 2/88 [00:00<00:06, 12.59it/s, training_loss=0.249][A
Epoch 10:   5%|▍         | 4/88 [00:00<00:06, 12.42it/s, training_loss=0.249][A
Epoch 10:   5%|▍         | 4/88 [00:00<00:06, 12.42it/s, training_loss=0.236][A
Epoch 10:   5%|▍         | 4/88 [00:00<00:06, 12.42it/s, training_loss=0.237][A
Epoch 10:   7%|▋         | 6/88 [00:00<00:06, 12.42it/s, training_loss=0.237][A
Epoch 10:   7%|▋         | 6/88 [00:00<00:06, 12.42it/s, training_loss=0.230][A
Epoch 10:   7%|▋         | 6/88 [00:00<00:06, 12.42it/s, training_loss=0.222][A
Epoch 10:   9%|▉         | 8/88 [00:00<00:06, 12.43it/s, training_loss=0


Epoch 10
Training loss: 0.7077298868786205





In [120]:
##Optional, when I am uploading th emodel
model = BertForSequenceClassification.from_pretrained("allenai/scibert_scivocab_uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

model.to(device)

model.load_state_dict(torch.load('/content/finetuned_SciBERT_epoch_10.model', map_location=torch.device('cpu')))

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertForSequenceClassification LOAD REPORT from: allenai/scibert_scivocab_uncased
Key                                        | Status     | 
-------------------------------------------+------------+-
cls.predictions.transform.dense.weight     | UNEXPECTED | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED | 
cls.seq_relationship.weight                | UNEXPECTED | 
cls.predictions.bias                       | UNEXPECTED | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED | 
cls.predictions.transform.dense.bias       | UNEXPECTED | 
cls.seq_relationship.bias                  | UNEXPECTED | 
cls.predictions.decoder.weight             | UNEXPECTED | 
cls.predictions.decoder.bias               | UNEXPECTED | 
classifier.weight                          | MISSING    | 
classifier.bias                            | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were ne

<All keys matched successfully>

In [121]:
_, predictions, true_vals = evaluate(dataloader_validation)

In [122]:
accuracy_per_class(predictions, true_vals)


Classification report:
              precision    recall  f1-score   support

           0       0.51      0.86      0.64        88
           1       0.56      0.17      0.26        87

    accuracy                           0.52       175
   macro avg       0.53      0.52      0.45       175
weighted avg       0.53      0.52      0.45       175

Overall accuracy: 0.52
F1 macro       : 0.4536128456735058


In [123]:
predictions[:3]

array([[ 0.22120705, -0.09085   ],
       [ 0.16976723, -0.10131858],
       [ 0.37934768,  0.00634021]], dtype=float32)