Reference:  
https://huggingface.co/docs/transformers/v4.17.0/en/tasks/sequence_classification  
https://huggingface.co/course/chapter3/3?fw=pt

## Preparing Variables

In [1]:
# Optional
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
seq_length = 256
pad_approach = "max_length"
bert_huggingface = "indolem/indobertweet-base-uncased" #uncased because we're using lower casing for all text data
num_epoch = 10

files = {
    "train" : "/content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Data/4. Train, Test/train/train_4.csv",
    "eval" : "/content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Data/4. Train, Test/validation/validation_4.csv",
    "test" : "/content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Data/4. Train, Test/test/test_3.csv",
}

save_model_path = "/content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Model/IndoBERTweet/v1"

load_model = True
load_path = "/content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Model/IndoBERTweet/v1"

## Install Prerequisite Library

In [3]:
!pip install transformers
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 5.0 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 7.4 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 49.9 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 69.0 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYA

## Preparing Datasets for Training
The Datasets must consist of "text" column name and "labels" column name
  
And for the "label" we need to process 0 as negative sentiment and 1 as positive sentiment

In [4]:
from datasets import load_dataset
dataset = load_dataset('csv', data_files=files)

Using custom data configuration default-86ecf9b7890fd58a


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-86ecf9b7890fd58a/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-86ecf9b7890fd58a/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

## Tokenizing Datasets for Training and Evaluation

In [5]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(load_path if load_model else bert_huggingface, use_fast=True)

In [6]:
def tokenize_text(dataset_row):
    return tokenizer(str(dataset_row["text"]), max_length=seq_length,
                     truncation=True, padding="max_length")
    
    # # Extract mapping between new and old indices
    # sample_map = tokenize.pop("overflow_to_sample_mapping")
    # for key, values in dataset_row.items():
    #     tokenize[key] = [values[i] for i in sample_map]
    # return tokenize
    
token_datasets = dataset.map(tokenize_text,remove_columns=["text"])

  0%|          | 0/9758 [00:00<?, ?ex/s]

  0%|          | 0/2440 [00:00<?, ?ex/s]

  0%|          | 0/536 [00:00<?, ?ex/s]

## Importing Models

In [7]:
from transformers import BertForSequenceClassification 
model = BertForSequenceClassification . \
        from_pretrained(load_path if load_model else bert_huggingface, num_labels=2)

# The warning is expected because we're importing from BertForPreTraining Model

## Preparing Training

In [8]:
import numpy as np
from datasets import load_metric
from transformers import TrainingArguments, Trainer

def compute_metrics(eval_preds):
    metric = load_metric("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch",
                                  num_train_epochs=num_epoch)

trainer = Trainer(
    model,
    training_args,
    train_dataset=token_datasets["train"],
    eval_dataset=token_datasets["eval"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [9]:
trainer.train()
# Validation data => Seimbangin

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: raw. If raw are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 9758
  Num Epochs = 10
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 12200


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.1707,0.545678,0.892623,0.89056
2,0.1049,0.501432,0.893033,0.894032
3,0.0839,0.703096,0.893852,0.893019
4,0.0484,0.75526,0.881148,0.884646
5,0.0431,0.666594,0.90041,0.897252
6,0.0301,0.801317,0.895082,0.895168
7,0.0157,0.847664,0.893443,0.894395
8,0.0143,0.779792,0.89959,0.896581
9,0.0123,0.872922,0.895492,0.895277
10,0.0092,0.797764,0.90041,0.898708


Saving model checkpoint to test-trainer/checkpoint-500
Configuration saved in test-trainer/checkpoint-500/config.json
Model weights saved in test-trainer/checkpoint-500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-500/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-500/special_tokens_map.json
Saving model checkpoint to test-trainer/checkpoint-1000
Configuration saved in test-trainer/checkpoint-1000/config.json
Model weights saved in test-trainer/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-1000/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-1000/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: raw. If raw are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num exam

Downloading builder script:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

Saving model checkpoint to test-trainer/checkpoint-1500
Configuration saved in test-trainer/checkpoint-1500/config.json
Model weights saved in test-trainer/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-1500/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-1500/special_tokens_map.json
Saving model checkpoint to test-trainer/checkpoint-2000
Configuration saved in test-trainer/checkpoint-2000/config.json
Model weights saved in test-trainer/checkpoint-2000/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-2000/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-2000/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: raw. If raw are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num

TrainOutput(global_step=12200, training_loss=0.05206816607322849, metrics={'train_runtime': 4945.4748, 'train_samples_per_second': 19.731, 'train_steps_per_second': 2.467, 'total_flos': 1.28371883910144e+16, 'train_loss': 0.05206816607322849, 'epoch': 10.0})

In [None]:
trainer.save_model(save_model_path)

Saving model checkpoint to /content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Model/IndoBERTweet/v1
Configuration saved in /content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Model/IndoBERTweet/v1/config.json
Model weights saved in /content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Model/IndoBERTweet/v1/pytorch_model.bin
tokenizer config file saved in /content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Model/IndoBERTweet/v1/tokenizer_config.json
Special tokens file saved in /content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Model/IndoBERTweet/v1/special_tokens_map.json


Preidct Test Data

In [None]:
# Data Test Baru

In [10]:
prediction = trainer.predict(token_datasets["test"])
prediction

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: raw. If raw are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 536
  Batch size = 8


PredictionOutput(predictions=array([[ 4.673685 , -4.7360315],
       [-4.685712 ,  4.9413323],
       [-2.6386654,  2.816788 ],
       ...,
       [-4.8565326,  5.151706 ],
       [ 4.888393 , -5.1074696],
       [ 4.6153975, -4.645399 ]], dtype=float32), label_ids=array([0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1,
       1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
    

In [11]:
# Data Test Lama
dataset_lama =  load_dataset('csv', data_files={"test_lama" : "/content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Data/4. Train, Test/test/test_lama.csv"})
tokenize_lama = dataset_lama.map(tokenize_text,remove_columns=["text"])

Using custom data configuration default-ba03b4502f6bf24b


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-ba03b4502f6bf24b/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-ba03b4502f6bf24b/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/80 [00:00<?, ?ex/s]

In [12]:
prediction = trainer.predict(tokenize_lama["test_lama"])
prediction

***** Running Prediction *****
  Num examples = 80
  Batch size = 8


PredictionOutput(predictions=array([[-4.843047 ,  5.132696 ],
       [-4.8828015,  5.196537 ],
       [-4.9018536,  5.242657 ],
       [-4.8339405,  5.1053925],
       [-4.8965545,  5.260356 ],
       [ 4.5067987, -4.438685 ],
       [-2.3142107,  2.2554526],
       [-4.340966 ,  4.5708933],
       [ 4.865789 , -5.053497 ],
       [-3.7040396,  3.910992 ],
       [-4.8994665,  5.244673 ],
       [-4.822777 ,  5.1111975],
       [-4.904949 ,  5.2261295],
       [-4.6893163,  4.923004 ],
       [-4.907751 ,  5.240487 ],
       [-4.2641234,  4.399062 ],
       [-4.8990664,  5.2203465],
       [-4.488004 ,  4.667173 ],
       [-4.903232 ,  5.227977 ],
       [-4.8929696,  5.2257543],
       [-4.5384636,  4.8713202],
       [ 4.7399783, -4.8540697],
       [-4.904352 ,  5.2510896],
       [-4.903208 ,  5.243996 ],
       [-4.9024935,  5.2272243],
       [-4.890687 ,  5.2199473],
       [-4.9032245,  5.2397013],
       [-4.905522 ,  5.2470555],
       [-4.7852116,  5.012883 ],
       [-4.855