Reference:  
https://huggingface.co/docs/transformers/v4.17.0/en/tasks/sequence_classification  
https://huggingface.co/course/chapter3/3?fw=pt

## Preparing Variables

In [None]:
# Optional
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
seq_length = 256
pad_approach = "max_length"
bert_huggingface = "indolem/indobertweet-base-uncased" #uncased because we're using lower casing for all text data
num_epoch = 3

files = {
    "train" : "/content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Data/4. Train, Test/train/train_4.csv",
    "eval" : "/content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Data/4. Train, Test/validation/validation_4.csv",
    "test" : "/content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Data/4. Train, Test/test/test_3.csv",
}

save_model_path = "/content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Model/IndoBERTweet/v1"

load_model = True
load_path = "/content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Model/IndoBERTweet/v1"

## Install Prerequisite Library

In [None]:
!pip install transformers
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 5.4 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 6.8 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 65.2 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 55.9 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstallin

## Preparing Datasets for Training
The Datasets must consist of "text" column name and "labels" column name
  
And for the "label" we need to process 0 as negative sentiment and 1 as positive sentiment

In [None]:
from datasets import load_dataset
dataset = load_dataset('csv', data_files=files)

Using custom data configuration default-86ecf9b7890fd58a


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-86ecf9b7890fd58a/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-86ecf9b7890fd58a/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

## Tokenizing Datasets for Training and Evaluation

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(load_path if load_model else bert_huggingface, use_fast=True)

In [None]:
def tokenize_text(dataset_row):
    return tokenizer(str(dataset_row["text"]), max_length=seq_length,
                     truncation=True, padding="max_length")
    
    # # Extract mapping between new and old indices
    # sample_map = tokenize.pop("overflow_to_sample_mapping")
    # for key, values in dataset_row.items():
    #     tokenize[key] = [values[i] for i in sample_map]
    # return tokenize
    
token_datasets = dataset.map(tokenize_text,remove_columns=["text"])

  0%|          | 0/9758 [00:00<?, ?ex/s]

  0%|          | 0/2440 [00:00<?, ?ex/s]

  0%|          | 0/536 [00:00<?, ?ex/s]

## Importing Models

In [None]:
from transformers import BertForSequenceClassification 
model = BertForSequenceClassification . \
        from_pretrained(load_path if load_model else bert_huggingface, num_labels=2)

# The warning is expected because we're importing from BertForPreTraining Model

## Preparing Training

In [None]:
import numpy as np
from datasets import load_metric
from transformers import TrainingArguments, Trainer

def compute_metrics(eval_preds):
    metric = load_metric("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch",
                                  num_train_epochs=num_epoch)

trainer = Trainer(
    model,
    training_args,
    train_dataset=token_datasets["train"],
    eval_dataset=token_datasets["eval"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()
# Validation data => Seimbangin

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: raw. If raw are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 9758
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 3660


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.1463,0.550451,0.902459,0.898029
2,0.0821,0.581411,0.89959,0.898718
3,0.0206,0.625261,0.909016,0.906723


Saving model checkpoint to test-trainer/checkpoint-500
Configuration saved in test-trainer/checkpoint-500/config.json
Model weights saved in test-trainer/checkpoint-500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-500/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-500/special_tokens_map.json
Saving model checkpoint to test-trainer/checkpoint-1000
Configuration saved in test-trainer/checkpoint-1000/config.json
Model weights saved in test-trainer/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-1000/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-1000/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: raw. If raw are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num exam

Downloading builder script:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

Saving model checkpoint to test-trainer/checkpoint-1500
Configuration saved in test-trainer/checkpoint-1500/config.json
Model weights saved in test-trainer/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-1500/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-1500/special_tokens_map.json
Saving model checkpoint to test-trainer/checkpoint-2000
Configuration saved in test-trainer/checkpoint-2000/config.json
Model weights saved in test-trainer/checkpoint-2000/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-2000/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-2000/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: raw. If raw are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num

TrainOutput(global_step=3660, training_loss=0.08227723967182181, metrics={'train_runtime': 1516.835, 'train_samples_per_second': 19.299, 'train_steps_per_second': 2.413, 'total_flos': 3851156517304320.0, 'train_loss': 0.08227723967182181, 'epoch': 3.0})

In [None]:
trainer.save_model(save_model_path)

Saving model checkpoint to /content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Model/IndoBERTweet/v1
Configuration saved in /content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Model/IndoBERTweet/v1/config.json
Model weights saved in /content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Model/IndoBERTweet/v1/pytorch_model.bin
tokenizer config file saved in /content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Model/IndoBERTweet/v1/tokenizer_config.json
Special tokens file saved in /content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Model/IndoBERTweet/v1/special_tokens_map.json
