# Finetune ruGPT3Small on fun fiction collection



## Install enviroment

In [None]:
!pip3 install urllib3==1.25.4

In [None]:
!pip3 install transformers==2.8.0

In [None]:
!wget https://raw.githubusercontent.com/sberbank-ai/ru-gpts/master/pretrain_transformers.py

In [None]:
!wget https://raw.githubusercontent.com/sberbank-ai/ru-gpts/master/generate_transformers.py

In [None]:
%%writefile setup.sh

export CUDA_HOME=/usr/local/cuda-10.1
git clone https://github.com/NVIDIA/apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex

Writing setup.sh


In [None]:
!sh setup.sh

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

In [None]:
data_path = "drive/My Drive/GPT/fanfics.txt"
!ls "$data_path"

In [None]:
DATA_HOME = "drive/My Drive/GPT/data"
!ls "$DATA_HOME"

## Prepare data

In [None]:
with open(data_path, "r") as file:
    text = file.readlines()

In [None]:
import pandas as pd
import numpy as np

In [None]:
texts_df = pd.DataFrame(text, columns=['text'])
train_df = texts_df.sample(frac=.8)
test_df = texts_df[~texts_df.index.isin(train_df.index)]


In [None]:
def save_text_df(df, filename):
    with open(filename, "w") as f:
        for idx in df.index:
            text = df.loc[idx, 'text']
            f.write(text)

In [None]:
save_text_df(train_df, DATA_HOME + "/train.txt")
save_text_df(test_df, DATA_HOME + "/test.txt")

## Run finetuning
The following code download our model and tokenizer from transformers and finetune model essays.

This took aroung ten minutes and obtain perplexity = 13-16

In [None]:
!unzip "$DATA_HOME/fanfics_1_epoch.zip"

In [None]:
!python pretrain_transformers.py \
    --output_dir=fanfics_model \
    --model_type=gpt2 \
    --model_name_or_path=fanfics_model/checkpoint-17000 \
    --do_train \
    --train_data_file="$DATA_HOME/train.txt" \
    --do_eval \
    --fp16 \
    --eval_data_file="$DATA_HOME/test.txt" \
    --per_gpu_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --num_train_epochs 5 \
    --block_size 2048 \
    --eval_all_checkpoints \
    --overwrite_output_dir

In [None]:
!tensorboard --inspect --event_file=/content/runs/Dec20_20-31-15_c81ee3f61962/events.out.tfevents.1608496275.c81ee3f61962.4314.0

2020-12-20 21:29:12.777217: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
Processing event files... (this can take a few minutes)

These tags are in /content/runs/Dec20_20-31-15_c81ee3f61962/events.out.tfevents.1608496275.c81ee3f61962.4314.0:
audio -
histograms -
images -
scalars
   loss
   lr
tensor -

Event statistics for /content/runs/Dec20_20-31-15_c81ee3f61962/events.out.tfevents.1608496275.c81ee3f61962.4314.0:
audio -
graph -
histograms -
images -
scalars
   first_step           14500
   last_step            17000
   max_step             17000
   min_step             14500
   num_steps            6
   outoforder_steps     []
sessionlog:checkpoint -
sessionlog:start -
sessionlog:stop -
tensor -



## Check our model

In [None]:
from tensorflow.python.summary.summary_iterator import summary_iterator

In [None]:
for summary in summary_iterator("/content/runs/Dec20_20-31-15_c81ee3f61962/events.out.tfevents.1608496275.c81ee3f61962.4314.0"):
    print(summary)

Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
wall_time: 1608496275.9209788
file_version: "brain.Event:2"

wall_time: 1608496597.4696038
step: 14500
summary {
  value {
    tag: "lr"
    simple_value: 4.970543159288354e-05
  }
}

wall_time: 1608496597.4697087
step: 14500
summary {
  value {
    tag: "loss"
    simple_value: 3.4174423217773438
  }
}

wall_time: 1608496932.0207539
step: 15000
summary {
  value {
    tag: "lr"
    simple_value: 4.9410864448873326e-05
  }
}

wall_time: 1608496932.020882
step: 15000
summary {
  value {
    tag: "loss"
    simple_value: 3.423922538757324
  }
}

wall_time: 1608497266.7529678
step: 15500
summary {
  value {
    tag: "lr"
    simple_value: 4.911629730486311e-05
  }
}

wall_time: 1608497266.7531455
step: 15500
summary {
  value {
    tag: "loss"
    simple_value: 3.414184808731079
  }
}

wall_time: 1608497601.7439253
step: 16000
summary {
  value {
    tag: "lr"
    simple_value: 4.882172652287409e-05
  }
}

In [None]:
!zip -r fanfics_1_epoch.zip fanfics_model/checkpoint-17000

In [None]:
!python generate_transformers.py \
    --model_type=gpt2 \
    --model_name_or_path=fanfics_model/checkpoint-17000 \
    --k=5 \
    --p=0.95 \
    --length=200

2020-12-21 14:51:19.792265: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
12/21/2020 14:51:22 - INFO - transformers.tokenization_utils -   Model name 'fanfics_model/checkpoint-17000' not found in model shortcut name list (gpt2, gpt2-medium, gpt2-large, gpt2-xl, distilgpt2). Assuming 'fanfics_model/checkpoint-17000' is a path, a model identifier, or url to a directory containing tokenizer files.
12/21/2020 14:51:22 - INFO - transformers.tokenization_utils -   Didn't find file fanfics_model/checkpoint-17000/added_tokens.json. We won't load it.
12/21/2020 14:51:22 - INFO - transformers.tokenization_utils -   loading file fanfics_model/checkpoint-17000/vocab.json
12/21/2020 14:51:22 - INFO - transformers.tokenization_utils -   loading file fanfics_model/checkpoint-17000/merges.txt
12/21/2020 14:51:22 - INFO - transformers.tokenization_utils -   loading file None
12/21/2020 14:51:22 - INFO - transformers.tokenization_ut