# Meet your Artificial Self - AMLD 2020 Workshop
### Task 1
In this task we will explore the power of modern language models. We will use the [gpt-2-simple](https://github.com/minimaxir/gpt-2-simple/) library by Max Woolf to fine-tune OpenAI's GPT-2 model to generate text that has the same style as the training samples.

### Important resources
* [Workshop Github repo](https://github.com/mar-muel/artificial-self-AMLD-2020/tree/master/3)
* [gpt-2-simple](https://github.com/minimaxir/gpt-2-simple/)


### Approach
We will use example data sets to fine-tune a model and explore the text generated after training dependent on a few parameters. We will also be able to provide a seed sequence to the model and see how the generated text is influenced by it.
For fine-tuning, the model takes the path to a single plain text file with one text sample per line.

# Setting things up
The following cell will clone the repository and install all the necessary dependencies

In [0]:
!nvidia-smi | grep -q 'failed' && echo "STOP! You are using a runtime without a GPU. Change the runtime type before going further!"
%tensorflow_version 1.x
!git clone https://github.com/mar-muel/artificial-self-AMLD-2020.git
%cd artificial-self-AMLD-2020/1
!pip install -r requirements-colab.txt

The next cell will prepare the data sets we can use in this task.

In [0]:
!python prepare.py all --short-filename true --preserve-lines true
!cp -r data ../..
%cd ../..

Import the packages needed for this task.

In [0]:
import gpt_2_simple as gpt2
import os
import requests
import glob
import pickle
import pandas as pd
import re
import unicodedata
import argparse
import logging
from functools import partial
import ipywidgets as widgets
from IPython.display import display

logging.basicConfig(level=logging.INFO, format='%(asctime)s [%(levelname)-5.5s] [%(name)-12.12s]: %(message)s')
log = logging.getLogger(__name__)

We define the following two helpers we will use later.

In [0]:
class color:
   BLUE = '\033[94m'
   BOLD = '\033[1m'
   END = '\033[0m'

def print_params(title, **kwargs):
    print(color.BOLD + title + color.END)
    print(30*'-')
    for key, value in kwargs.items():
        print(key, "=", color.BLUE + str(value) + color.END)
    print(30*'-')

def start_session(sess):
    try:
        gpt2.reset_session(sess)
    except:
        pass
    return gpt2.start_tf_sess()

# Specification of input data set we will use

In [0]:
#@title Choose existing or define own dataset { run: "auto" }

#@markdown Specify run_name and input file (located in 'data/').
run_name = 'run1' #@param {type: "string"}
data_file = 'data.txt' #@param {type: "string"}

#@markdown Choose 'custom' to use settings above. Other usecases will override 'run_name' and 'data_file'.
usecase = 'javascript' #@param ["custom…","chess","tweets","music","shakespeare","javascript","typescript","json","html"] {allow-input: true}
model_name = '124M' #@param ["124M"] {allow-input: true}

if usecase != 'custom…':
    run_name = usecase
    data_file = usecase + '.txt'
data_path = os.path.join('data', data_file)
sess = None

print_params(usecase, run_name=run_name, model_name=model_name, data_path=data_path)

# Let's fine-tune the GPT-2 model!
Choose the number of steps the model will be fine-tuned for. You can adjust the parameters on the right to specifiy how often you get updates on the training process, how often samples of the current model are printed, and every how many steps the model is saved.

Beside the number of steps, these parameters do not influence the training. The model will be saved automatically when done fine-tuning with the amount of steps specified. You can stop the fine-tuning anytime and the current training state of the model will be saved.

In [0]:
#@title Finetuning { run: "auto" }

#@markdown Number of steps the model will be finetuned for.
steps = 250 #@param {type:"slider", min:10, max:500, step:10}

#@markdown Specification of how many steps output will be produced.
sample_every = 100 #@param {type:"slider", min:10, max:200, step:20}
save_every = 50 #@param {type:"slider", min:0, max:100, step:10}
print_every = 20 #@param {type:"slider", min:0, max:50, step:5}

def finetune(sess, run_name, model_name, data_path, steps, sample_every, save_every, print_every, b):
    log.info(f'Run fine-tuning for run {run_name} using GPT2 model {model_name}...')
    if not os.path.isdir(os.path.join("models", model_name)):
        log.info(f"Downloading {model_name} model...")
        gpt2.download_gpt2(model_name=model_name)
    sess = start_session(sess)
    gpt2.finetune(sess, data_path, checkpoint_dir='runs', model_name=model_name, run_name=run_name, steps=steps, sample_every=sample_every, save_every=save_every, print_every=print_every)

print_params('Fine-tuning ' + run_name, steps=steps, sample_every=sample_every, save_every=save_every, print_every=print_every)

finetune_handler = partial(finetune, sess, run_name, model_name, data_path, steps, sample_every, save_every, print_every)
button = widgets.Button(description="Start fine-tuning")
button.on_click(finetune_handler)
display(button)

# Text generation
We can now generate text mimiking the style of the learned samples.

You can play around with the three parameters `length`, `temperature`, and `top_k` to influnce the generated text. Further, you can provide a seed sequence that will be the beginning of the generated text.

Use the different data sets to explore how the fine-tuning works and what its' limits are. You can also use custom data sets. Just copy them to the data folder and specify the path above.

In [0]:
#@title Text Generation { run: "auto" }

#@markdown Parameters for text generation.
length = 800 #@param {type:"slider", min:0, max:1000, step:5}
temperature = 0.7 #@param {type:"slider", in:0, max:2, step:0.1}
top_k = 0 #@param {type:"slider", min:0, max:5, step:0.1}

def generate(sess, run_name, length, temperature, top_k, message, b):
    print('Input: ', message)
    output = gpt2.generate(sess, checkpoint_dir='runs', run_name=run_name, prefix=message, length=length, temperature=temperature, top_k=top_k, return_as_list=True)
    text = output[0].split("\n")[0]
    print('Output:', color.BLUE + text + color.END, '\n')

sess = start_session(sess)
gpt2.load_gpt2(sess, checkpoint_dir='runs', run_name=run_name)

text = widgets.Text(value='', placeholder='Beginning of sequence...', disabled=False)
button = widgets.Button(description="Start text generation")

generate_handler = partial(generate, sess, run_name, length, temperature, top_k)
button.on_click(lambda b : generate_handler(text.value, b))

print()
print_params(usecase, length=length, temperature=temperature, top_k=top_k)
box = widgets.GridBox([text, button], layout=widgets.Layout(grid_template_columns="repeat(2, 350px)"))
display(box)
print()

### Save model to Google Drive
If you are happy with your model consider saving it to your Google Drive. Note that all data on this notebook will be lost after a certain time of inactivity. Note that the model size is quite big (~500MB) so make sure you have enough space in your Google Drive.

This will save only your final model state (from your directory `run_name` directory).

In [0]:
#@title Save model to Google Drive { run: "auto" }

#@markdown Directory (within your Google Drive) where you want to save the model to.
drive_location = "My Drive/AMLD/models/task1/" #@param {type:"string"}

#@markdown Do you want to mount your Google Drive?
mount_drive = True #@param {type:"boolean"}
#@markdown Follow instructions below this cell for mounting Google Drive and click the **[Save to Google Drive]** button when it appears.

if mount_drive:
    from google.colab import drive
    import shutil

    mount_location = "./drive/"
    log.info(f'Mount Google Drive...')
    drive.mount(mount_location)

    source_directory = os.path.join('./runs', run_name)
    target_directory = os.path.join(mount_location, drive_location, run_name)

    def save_model(mount_location, source_directory, target_directory, b):
        log.info(f'Copying from {source_directory} to {target_directory}...')
        shutil.copytree(source_directory, target_directory)
        log.info('Successfully copied your model!')

    print_params('Save model to Google Drive', source_directory=source_directory, target_directory=target_directory)

    save_model_handler = partial(save_model, mount_location, source_directory, target_directory)
    button = widgets.Button(description="Save to Google Drive")
    button.on_click(save_model_handler)
    display(button)