Reference:  
https://huggingface.co/docs/transformers/v4.17.0/en/tasks/sequence_classification  
https://huggingface.co/course/chapter3/3?fw=pt

## Preparing Variables

In [1]:
# Optional
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
seq_length = 256
pad_approach = "max_length"
bert_huggingface = "indolem/indobertweet-base-uncased" #uncased because we're using lower casing for all text data
num_epoch = 5

files = {
    "train" : "/content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Data/4.1 Train Test Baru/25_75 Manual/train.csv",
    "test" : "/content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Data/4.1 Train Test Baru/25_75 Manual/test.csv"
}

save_model_path = "/content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Model/IndoBERTweet/v2"

load_model = False
load_path = "/content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Model/IndoBERTweet/v1"

## Install Prerequisite Library

In [3]:
!pip install transformers
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 29.2 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 67.5 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 5.6 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 57.4 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyY

## Preparing Datasets for Training
The Datasets must consist of "text" column name and "labels" column name
  
And for the "label" we need to process 0 as negative sentiment and 1 as positive sentiment

In [4]:
from datasets import load_dataset
dataset = load_dataset('csv', data_files=files)

Using custom data configuration default-c4ad40dfb3ca8742


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-c4ad40dfb3ca8742/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-c4ad40dfb3ca8742/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

## Tokenizing Datasets for Training and Evaluation

In [5]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(load_path if load_model else bert_huggingface, use_fast=True)

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/230k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [6]:
def tokenize_text(dataset_row):
    return tokenizer(str(dataset_row["text"]), max_length=seq_length,
                     truncation=True, padding="max_length")
    
    # # Extract mapping between new and old indices
    # sample_map = tokenize.pop("overflow_to_sample_mapping")
    # for key, values in dataset_row.items():
    #     tokenize[key] = [values[i] for i in sample_map]
    # return tokenize
    
token_datasets = dataset.map(tokenize_text,remove_columns=["text"])

  0%|          | 0/11916 [00:00<?, ?ex/s]

  0%|          | 0/402 [00:00<?, ?ex/s]

## Importing Models

In [7]:
from transformers import BertForSequenceClassification 
model = BertForSequenceClassification . \
        from_pretrained(load_path if load_model else bert_huggingface, num_labels=2)

# The warning is expected because we're importing from BertForPreTraining Model

Downloading:   0%|          | 0.00/424M [00:00<?, ?B/s]

Some weights of the model checkpoint at indolem/indobertweet-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at indolem/ind

## Preparing Training

In [9]:
import numpy as np
from datasets import load_metric
from transformers import TrainingArguments, Trainer

def compute_metrics(eval_preds):
    metric = load_metric("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch",
                                  num_train_epochs=num_epoch)

trainer = Trainer(
    model,
    training_args,
    train_dataset=token_datasets["train"],
    eval_dataset=token_datasets["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [10]:
trainer.train()
# Validation data => Seimbangin

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: raw. If raw are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 11916
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 7450


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3608,0.865143,0.70398,0.72
2,0.2205,0.951622,0.716418,0.688525
3,0.1208,1.618968,0.721393,0.678161
4,0.0527,1.700948,0.733831,0.705234
5,0.0197,1.914068,0.721393,0.708333


Saving model checkpoint to test-trainer/checkpoint-500
Configuration saved in test-trainer/checkpoint-500/config.json
Model weights saved in test-trainer/checkpoint-500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-500/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-500/special_tokens_map.json
Saving model checkpoint to test-trainer/checkpoint-1000
Configuration saved in test-trainer/checkpoint-1000/config.json
Model weights saved in test-trainer/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-1000/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-1000/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: raw. If raw are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num exam

Downloading builder script:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

Saving model checkpoint to test-trainer/checkpoint-1500
Configuration saved in test-trainer/checkpoint-1500/config.json
Model weights saved in test-trainer/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-1500/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-1500/special_tokens_map.json
Saving model checkpoint to test-trainer/checkpoint-2000
Configuration saved in test-trainer/checkpoint-2000/config.json
Model weights saved in test-trainer/checkpoint-2000/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-2000/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-2000/special_tokens_map.json
Saving model checkpoint to test-trainer/checkpoint-2500
Configuration saved in test-trainer/checkpoint-2500/config.json
Model weights saved in test-trainer/checkpoint-2500/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-2500/tokenizer_config.json
Special tokens file 

TrainOutput(global_step=7450, training_loss=0.16180589291873393, metrics={'train_runtime': 2903.1895, 'train_samples_per_second': 20.522, 'train_steps_per_second': 2.566, 'total_flos': 7838078339174400.0, 'train_loss': 0.16180589291873393, 'epoch': 5.0})

In [11]:
trainer.save_model("/content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Data/4.1 Train Test Baru/25_75 Manual/model")

Saving model checkpoint to /content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Data/4.1 Train Test Baru/25_75 Manual/model
Configuration saved in /content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Data/4.1 Train Test Baru/25_75 Manual/model/config.json
Model weights saved in /content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Data/4.1 Train Test Baru/25_75 Manual/model/pytorch_model.bin
tokenizer config file saved in /content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Data/4.1 Train Test Baru/25_75 Manual/model/tokenizer_config.json
Special tokens file saved in /content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Data/4.1 Train Test Baru/25_75 Manual/model/special_tokens_map.json


Preidct Test Data

In [None]:
# Data Test Baru

In [None]:
prediction = trainer.predict(token_datasets["test"])
prediction

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: raw. If raw are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 536
  Batch size = 8


PredictionOutput(predictions=array([[ 4.168479 , -4.1727095],
       [-4.1338816,  4.250479 ],
       [-3.5601375,  3.9232821],
       ...,
       [-4.2821107,  4.359684 ],
       [ 4.6179876, -4.7419157],
       [ 4.5954313, -4.636561 ]], dtype=float32), label_ids=array([0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1,
       1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
    

In [None]:
# Data Test Lama
dataset_lama =  load_dataset('csv', data_files={"test_lama" : "/content/drive/Shareddrives/Riset Sentimen Vaksin COVID: Yaudahlah/Riset/Data/4. Train, Test/test/test_lama.csv"})
tokenize_lama = dataset_lama.map(tokenize_text,remove_columns=["text"])

Using custom data configuration default-ba03b4502f6bf24b


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-ba03b4502f6bf24b/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-ba03b4502f6bf24b/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/80 [00:00<?, ?ex/s]

In [None]:
prediction = trainer.predict(tokenize_lama["test_lama"])
prediction

***** Running Prediction *****
  Num examples = 80
  Batch size = 8


PredictionOutput(predictions=array([[-4.2783346,  4.3750334],
       [-4.550526 ,  4.70738  ],
       [-4.601518 ,  4.6784077],
       [-4.5784426,  4.717128 ],
       [-4.6905413,  4.792999 ],
       [ 4.3675117, -4.4603066],
       [-4.0715456,  4.0525465],
       [-4.077451 ,  4.223304 ],
       [ 4.4408407, -4.4879465],
       [-3.472335 ,  3.6252003],
       [-4.6091247,  4.690289 ],
       [-4.5809994,  4.7177377],
       [-4.6385183,  4.7554607],
       [-4.5873775,  4.5945077],
       [-4.6637087,  4.713147 ],
       [-1.681029 ,  1.725981 ],
       [-4.5813546,  4.704983 ],
       [-3.7557862,  3.8869555],
       [-4.628789 ,  4.65861  ],
       [-4.244247 ,  4.2899947],
       [ 1.3193146, -1.3727512],
       [ 3.798824 , -3.7335513],
       [-4.6536126,  4.76008  ],
       [-4.672168 ,  4.755148 ],
       [-4.6481004,  4.736709 ],
       [-3.7333608,  3.6642554],
       [-4.6619644,  4.7627797],
       [-4.661133 ,  4.805429 ],
       [-4.4316525,  4.413816 ],
       [-4.401