# SEP532 인공지능 이론과 실제
## Deep Learning Practice 
#### Prof. Ho-Jin Choi
#### School of Computing, KAIST

---

## Advanced Models
### BERT

BERT and other Transformer encoder architectures have been wildly successful on a variety of tasks in NLP (natural language processing). They compute vector-space representations of natural language that are suitable for use in deep learning models. The BERT family of models uses the Transformer encoder architecture to process each token of input text in the full context of all tokens before and after, hence the name: Bidirectional Encoder Representations from Transformers.

BERT models are usually pre-trained on a large corpus of text, then fine-tuned for specific tasks.

![BERT model](images/bert.png)

#### Masked Language Modeling
Masked Language Modeling is a fill-in-the-blank task, where a model uses the context words surrounding a mask token to try to predict what the masked word should be. Masked language modeling is a great way to train a language model in a self-supervised setting (without human-annotated labels). 

![Maksed language model](images/masked-language-model.png)

### Setup
#### Hugginface Transformers
In this notebook, we will use 🤗 Transformers which provides a lot of Transformer architectures and their pre-trained weights.

> 🤗 Transformers provides APIs to easily download and train state-of-the-art pretrained models. 
> Using pretrained models can reduce your compute costs, carbon footprint, and save you time from training a model from scratch. 
> The models can be used across different modalities such as:
> - 📝 Text: text classification, information extraction, question answering, summarization, translation, and text generation in over 100 languages.
> - 🖼️ Images: image classification, object detection, and segmentation.
> - 🗣️ Audio: speech recognition and audio classification.
> - 🐙 Multimodal: table question answering, optical character recognition, information extraction from scanned documents, video classification, and visual question answering.

All models currently supported by HuggingFace can be found at [this link](https://huggingface.co/docs/transformers/en/index#supported-models).

In [2]:
!pip install \
    transformers \
    datasets \
    sentencepiece \
    "git+https://github.com/SKTBrain/KoBERT.git#egg=kobert_tokenizer&subdirectory=kobert_hf"

Collecting kobert_tokenizer
  Cloning https://github.com/SKTBrain/KoBERT.git to /tmp/pip-install-yd4bte3c/kobert-tokenizer_a4ace08c827b4f109dd05c23ab1ec315
  Running command git clone --filter=blob:none --quiet https://github.com/SKTBrain/KoBERT.git /tmp/pip-install-yd4bte3c/kobert-tokenizer_a4ace08c827b4f109dd05c23ab1ec315
  Resolved https://github.com/SKTBrain/KoBERT.git to commit e1f2f37055e7460d8427f6912579c0162cb69831
  Preparing metadata (setup.py) ... [?25ldone
[0m

### Sentiment analysis
This notebook trains a sentiment analysis model to classify movie reviews as positive or negative, based on the text of the review.

We will use the [Naver sentiment movie corpus](https://github.com/e9t/nsmc) that contains the text of 200,000 movie reviews.

### Download the NSMC dataset
Let's download and extract the dataset. Thanks to 🤗 datasets, we can access the NSMC dataset by just calling the function `load_dataset`.

In [3]:
from datasets import load_dataset

raw_datasets = load_dataset('nsmc')

Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

Each item in the NSMC dataset consists of 
- `id`: The review id, provieded by Naver
- `document`: The actual review
- `label`: The sentiment class of the review. (`0`: negative, `1`: positive)

In [4]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'document', 'label'],
        num_rows: 150000
    })
    test: Dataset({
        features: ['id', 'document', 'label'],
        num_rows: 50000
    })
})

In [5]:
raw_datasets['train'][0]

{'id': '9976970', 'document': '아 더빙.. 진짜 짜증나네요 목소리', 'label': 0}

### Loading pre-trained models
BERT is used as a way to fine-tune pre-trained models to sub-tasks that we are interested in. In this notebook, we use KoBERT which is trained on Korean corpus by SKT

In [6]:
from kobert_tokenizer import KoBERTTokenizer
from transformers import BertForSequenceClassification

tokenizer = KoBERTTokenizer.from_pretrained('skt/kobert-base-v1')
model = BertForSequenceClassification.from_pretrained('skt/kobert-base-v1', num_labels=len(set(raw_datasets['train']['label'])))

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'XLNetTokenizer'. 
The class this function is called from is 'KoBERTTokenizer'.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at skt/kobert-base-v1 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Preprocessing dataset
Text inputs need to be transformed to numeric token ids and arranged in several Tensors before being input to BERT. To do that, we will use the `tokenizer` that comes with the BERT model. To process our dataset in one step, use 🤗 Datasets `map` method to apply a preprocessing function over the entire dataset:

In [7]:
def tokenize(examples):
    return tokenizer(examples['document'], padding='max_length', max_length=256, truncation=True)

datasets = raw_datasets.map(tokenize, batched=True)

Loading cached processed dataset at /root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3/cache-9023e8fd499afd9f.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3/cache-01430c6c797aa584.arrow


In [8]:
datasets['train'][0]['input_ids'][:32]

[2,
 3093,
 1698,
 6456,
 54,
 54,
 4368,
 4396,
 7316,
 5655,
 5703,
 2073,
 3,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1]

### Train the model
Similar to `TensorFlow`'s `compile()` and `fit()`, 🤗 Transformers provides a [`Trainer`](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Trainer) class to train the model. All behavior of the `Trainer` class can be adjusted with `TrainingArguments`.

In [9]:
from transformers import TrainingArguments

training_arguments = TrainingArguments(
    output_dir='kobert-nsmc',
    seed=31414,
    num_train_epochs=10,
    per_device_train_batch_size=128,
    gradient_accumulation_steps=2,
    per_device_eval_batch_size=128,
    learning_rate=5e-05,
    warmup_steps=500,
    evaluation_strategy='steps',
    eval_steps=300,
    save_strategy='steps',
    save_steps=300,
    save_total_limit=10,
    load_best_model_at_end=True,
    fp16=True,
)

#### Metrics
`Trainer` does not automatically evaluate model performance during training. We will need to pass `Trainer` a function to compute and report metrics. The 🤗 Datasets library provides a simple accuracy function you can load with the `load_metric` function:

In [10]:
import numpy as np
from datasets import load_metric

metric_accuracy = load_metric("accuracy")

def compute_metrics(logits_and_labels):
    logits, labels = logits_and_labels
    predictions = np.argmax(logits, axis=-1)
    return metric_accuracy.compute(predictions=predictions, references=labels)

#### Trainer
Create a `Trainer` object with your model, training arguments, training and test datasets, and evaluation function:

In [11]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_arguments,
    train_dataset=datasets['train'],
    eval_dataset=datasets['test'],
    compute_metrics=compute_metrics,
)

Using amp half precision backend


Then fine-tune your model by calling `train()`:

In [12]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: id, document. If id, document are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 150000
  Num Epochs = 10
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 2
  Total optimization steps = 5860


Step,Training Loss,Validation Loss,Accuracy
300,No log,0.32457,0.8635
600,0.414300,0.288268,0.87722
900,0.414300,0.262462,0.89108
1200,0.262800,0.27677,0.89528
1500,0.201000,0.265517,0.89666
1800,0.201000,0.310674,0.89656
2100,0.155200,0.301916,0.89534
2400,0.155200,0.36263,0.89634
2700,0.115100,0.357797,0.89584
3000,0.081500,0.420623,0.89448


The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: id, document. If id, document are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 50000
  Batch size = 128
Saving model checkpoint to kobert-nsmc/checkpoint-300
Configuration saved in kobert-nsmc/checkpoint-300/config.json
Model weights saved in kobert-nsmc/checkpoint-300/pytorch_model.bin
Deleting older checkpoint [kobert-nsmc/checkpoint-900] due to args.save_total_limit
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: id, document. If id, document are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 50000
  Batch size = 128
Saving model checkpoint to kobert-ns

TrainOutput(global_step=5860, training_loss=0.12304460034028662, metrics={'train_runtime': 5873.2764, 'train_samples_per_second': 255.394, 'train_steps_per_second': 0.998, 'total_flos': 1.9733329152e+17, 'train_loss': 0.12304460034028662, 'epoch': 10.0})

In [13]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('./kobert-nsmc/checkpoint-300').to('cuda')

OSError: ./kobert-nsmc/checkpoint-300 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.

### Evaluate the model

In [None]:
trainer.evaluate(datasets['test'].shuffle().select(range(10000)))

In [None]:
import torch

for example in datasets['test'].shuffle().select(range(8)):
    input_ids = torch.as_tensor([example['input_ids']]).to('cuda')
    attention_mask = torch.as_tensor([example['attention_mask']]).to('cuda')
    
    output = model(input_ids=input_ids, attention_mask=attention_mask)
    print('Text:', example['document'])
    print('Predicted:', torch.argmax(output.logits).cpu().numpy())
    print('Acutal:', example['label'])
    print()