# Sakamoto BERT

Fine-tune a Japanese BERT model on a task of Sakamoto-tweet classification. 🐳

## Install Transformers

In [0]:
!git clone https://github.com/huggingface/transformers
!cd transformers;pip install .

## Download a pre-trained BERT model

In [0]:
!wget http://nlp.ist.i.kyoto-u.ac.jp/nl-resource/JapaneseBertPretrainedModel/Japanese_L-12_H-768_A-12_E-30_BPE_transformers.zip
!unzip -d bert Japanese_L-12_H-768_A-12_E-30_BPE_transformers.zip

## Prepare data

In [0]:
# !wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
# !python download_glue_data.py --data_dir glue_data --tasks all

In [0]:
!mkdir -p glue_data/CoLA
!cp drive/My\ Drive/sakamotcha-bot/data/sakamoto_cola/* glue_data/CoLA

## Fine-tune the model

In [0]:
!pip install -U future
!cd transformers;pip install -r ./examples/requirements.txt

In [0]:
!python transformers/examples/text-classification/run_glue.py \
    --model_name_or_path bert/Japanese_L-12_H-768_A-12_E-30_BPE_transformers \
    --task_name CoLA \
    --do_train \
    --do_eval \
    --data_dir glue_data/CoLA \
    --max_seq_length 128 \
    --per_gpu_eval_batch_size=8   \
    --per_gpu_train_batch_size=8   \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir sakamoto_bert/

In [0]:
!cp -r sakamoto_bert drive/My\ Drive/sakamotcha-bot/data/

## Do the task

In [0]:
import torch

In [0]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('sakamoto_bert')
model = AutoModelForSequenceClassification.from_pretrained('sakamoto_bert')

In [0]:
sequences = []
with open('glue_data/CoLA/test.tsv') as f:
  for line in f:
    _, sequence = line.strip().split('\t')
    sequences.append(sequence)

In [0]:
classes = ['aozam3', 'sksk_sskn']

for sequence in sequences:
    sequence_tensor = tokenizer.encode(sequence, return_tensors="pt")
    classification_logits = model(sequence_tensor)[0]
    
    results = torch.softmax(classification_logits, dim=1).tolist()[0]

    print(sequence)
    for i in range(2):
        print('{}: {}'.format(classes[i], results[i]))
    print()

# References

- Transformers
    - https://github.com/huggingface/transformers
    - https://huggingface.co/transformers/index.html
    - https://colab.research.google.com/github/pytorch/pytorch.github.io/blob/master/assets/hub/huggingface_pytorch-transformers.ipynb

- CoLA
    - https://nyu-mll.github.io/CoLA/

- BERT Japanese pretrained model
    - http://nlp.ist.i.kyoto-u.ac.jp/index.php?BERT%E6%97%A5%E6%9C%AC%E8%AA%9EPretrained%E3%83%A2%E3%83%87%E3%83%AB

- Qiita
    - https://qiita.com/neonsk/items/27424d6122e00fe632b0
    - https://qiita.com/nekoumei/items/7b911c61324f16c43e7e
    - https://qiita.com/kenta1984/items/7f3a5d859a15b20657f3
    - https://qiita.com/knok/items/9e3b4505d6b8f813943d