[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jonasengelmann/kafka_tagebuch_bot/blob/master/Kafka_Tagebuch_BOT.ipynb)

Note: You can skip training/ fine-tuning and use my pretrained model to generate, just jump ahead to chapter 3

# 0 Check Prerequisites


We need a GPU with at lot of GPU RAM, so either P4s, T4s or P100s.  K80s unfortunately will not work.

In [None]:
!nvidia-smi

#1 Installation

In [None]:
!git clone https://github.com/gooofy/transformer-lm

Update to latest tested revision to ensure functionality in the future:

In [None]:
%cd transformer-lm
!git checkout eded3a7

Install transformer-lm:

In [None]:
!pip install -r requirements.txt
!python setup.py develop
!pip install json_log_plots

Download Zamia's pretrained German model and unpack it:

In [None]:
!wget https://goofy.zamia.org/zamia-speech/brain/gpt2-german-345M-r20191119.tar.xz -P ../model
!tar xf ../model/gpt2-german-345M-r20191119.tar.xz -C ../model

The model's paramter *seen_tokens* has to be reset to allow finetuning:

In [None]:
import torch
from pathlib import Path
model_path = Path('../model') / 'de345-root' / 'model.pt'
state = torch.load(model_path)
state['seen_tokens'] = 0
torch.save(state, model_path)
del state
torch.cuda.empty_cache()

#2 Fine-Tuning

 ### 2.1 On all of Kafka's work

Upload your dataset here, it should have the following folder structure:


In [None]:
# kafka_dataset/
# |-----valid/
#       |----valid.txt
# |-----test/
#       |----test.txt
# |-----train/
#       |----train.txt

In [None]:
from google.colab import files
uploaded = files.upload()

Unzip dataset:

In [None]:
!unzip -a "kafka_dataset.zip"

Preprocessing: dataset needs to be encoded with the sentencepiece model:

In [None]:
!sp-encode kafka_dataset ../model/de345-root/sp.model kafka_dataset/encoded

To avoid an error, we have to rename the sentencepiece model since it will be copied by transformer-lm into the model folder as *sp.model*, which however already exists:


In [None]:
!mv ../model/de345-root/sp.model ../model/de345-root/sp_old.model

Finetuning: Most parameters are already predetermined by Zamia's German model, we can only change *batch-size* and *epochs*:


In [None]:
!gpt-2 \
    ../model/de345-root \
    kafka_dataset/encoded/ \
    ../model/de345-root/sp_old.model \
    --batch-size 1 \
    --g-accum-gradients 2 \
    --n-ctx 1024 \
    --n-embed 1024 \
    --n-hidden 1024 \
    --n-head 16 \
    --n-layer 24 \
    --epochs 4 \
    --lr=2e-4

Plot loss over steps:

In [None]:
import json_log_plots
json_log_plots.plot("../model/de345-root")

### 2.2 Fine-Tuning II: On Kafka's diaries

Upload your dataset here, it should have the following folder structure:

In [None]:
# kafka_diaries_dataset/
# |-----valid/
#       |----valid.txt
# |-----test/
#       |----test.txt
# |-----train/
#       |----train.txt

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
!unzip -a "kafka_diaries_dataset.zip"

In [None]:
!sp-encode kafka_diaries_dataset ../model/de345-root/sp.model kafka_diaries_dataset/encoded

In [None]:
!mv ../model/de345-root/sp.model ../model/de345-root/sp_old2.model

In [None]:
!rm ../model/de345-root/json-log-plots.log

In [None]:
import torch
from pathlib import Path
model_path = Path('../model') / 'de345-root' / 'model.pt'
state = torch.load(model_path)
state['seen_tokens'] = 0
torch.save(state, model_path)
del state
torch.cuda.empty_cache()

In [None]:
!gpt-2 \
    ../model/de345-root \
    kafka_diaries_dataset/encoded/ \
    ../model/de345-root/sp_old2.model \
    --batch-size 1 \
    --g-accum-gradients 2 \
    --n-ctx 1024 \
    --n-embed 1024 \
    --n-hidden 1024 \
    --n-head 16 \
    --n-layer 24 \
    --epochs 10 \
    --lr=1.5e-5

In [None]:
import json_log_plots
json_log_plots.plot("../model/de345-root")

Quick test:

In [None]:
!gpt-2-gen ../model/de345-root "10. August. "

#3 Text Generation

In [None]:
%cd /content

###A) Use model trained in chapter 2

In [None]:
!mv model/de345-root model/kafka_diary_model

##B) Use my pretrained model

In [None]:
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1qBpljKf0odtwZkG9V8hivOXhVG1g190C' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1qBpljKf0odtwZkG9V8hivOXhVG1g190C" -O model/kafka_diary_model.tar.xz && rm -rf /tmp/cookies.txt

In [None]:
!tar xf model/kafka_diary_model.tar.xz

We will use another fork, which has more generation parameters available:

In [None]:
!git clone https://github.com/lopuhin/transformer-lm

In [None]:
%cd transformer-lm
!git checkout c369833 #last tested revision

In [None]:
!pip install -r requirements.txt
!python setup.py develop
!pip install babel

## Generating a single entry:

Some adjustments are needed to allow for batch generation and omitting *unk* tokens from the output. The code was also edited so that it continues generating tokens until it reaches the end of a sentnece. However, this can sometimes cause it to be stuck in a loop, especially with low values for top_k or temperature. 

In [None]:
# Third party imports:
from pathlib import Path
import numpy as np
import torch

# Local application imports:
from lm import inference
from lm.common import END_OF_LINE, END_OF_TEXT

class ModelWrapperModified(inference.ModelWrapper):
    END_OF_LINE = END_OF_LINE
    END_OF_TEXT = END_OF_TEXT
    
    def __init__(self, model, sp_model, params):
        super().__init__(model, sp_model, params)

    def generate_tokens(self, tokens_prefix, tokens_to_generate, top_k, top_p=1, temperature=1):
        tokens = list(tokens_prefix)
        output_tokens = []
        past = None

        i = 0
        while True:

            if top_p <= 0.0:
                # generate TOP_K potential next tokens
                ntk, presents = self._get_next_top_k(tokens, top_k, past=past)

                # Remove unk tokens:
                ntk = [
                        x
                        for x in ntk
                        if x[1] != '<unk>'
                ]

                # convert log probs to real probs
                logprobs = np.array(list(map(lambda a: a[0], ntk)))
                logprobs /= temperature
                probs = np.exp(logprobs) / np.exp(logprobs).sum()

                # pick next token randomly according to probs distribution
                next_token_n = np.random.choice(len(ntk), p=probs)
                next_token = ntk[next_token_n][1]
            else:
                filtered_logprobs, presents = self._get_next_top_p_nucleus(
                    tokens, top_p, past=past)
          
                filtered_logprobs /= temperature
                next_token_n = torch.multinomial(torch.nn.functional.softmax(
                    filtered_logprobs, dim=-1), num_samples=4)
                
                for x in next_token_n:
                  next_token = self.id_to_token(x)
                  if next_token != '<unk>':
                    break

            if past is None:
                past = presents
            else:
                past = torch.cat([past, presents], dim=-2)

            tokens = [next_token]
            output_tokens.append(next_token)

            i += 1

            # Terminate generation when end of sentence is reached:
            if i >= tokens_to_generate and next_token in ['!', '.', '?']:
                break
            elif next_token == '<endoftext>':
                break

        return output_tokens

class BatchTextGeneration:
    def __init__(self, model_path):
        print("loading model from %s" % model_path)
        self.mw = ModelWrapperModified.load(Path(model_path))

    def generate_text(self, prefix, tokens_to_generate, top_k, top_p=0, temperature=1):
        tokens = self.mw.tokenize(prefix)
        tokens_gen = self.mw.generate_tokens(tokens, tokens_to_generate, top_k, top_p, temperature)
        return self.mw.sp_model.DecodePieces(tokens_gen)

In [None]:
model = BatchTextGeneration('../model/kafka_diary_model')

In [None]:
model.generate_text("19. August. ", tokens_to_generate=30, top_k=30, top_p=0, temperature=1)

## Generating entries for a whole year:

In [None]:
tokens_to_generate = 30
top_k = 30
top_p = 0
temperature = 1

## iterate over dates:
from datetime import timedelta, date, time
from babel.dates import format_datetime

def daterange(start_date, end_date):
    for n in range(int((end_date - start_date).days)):
        yield start_date + timedelta(n)

start_date = date(2020, 1, 1)
end_date = date(2021, 1, 1)
for single_date in daterange(start_date, end_date):
    promt = format_datetime(single_date, "d. MMMM. ", locale='de_DE')
    print(promt + model.generate_text(
        promt,
        tokens_to_generate=tokens_to_generate,
        top_k=top_k,
        top_p=top_p,
        temperature=temperature
    ))