
Here is a tutorial about generating text using a SOTA inspired language generation model, distilgpt2. This model lighter in weight and faster in language generation than the original OpenAI GPT2. Using this tutorial, you can train a language generation model which can generate text for any subject in English. Here, we will generate movie reviews by fine-tuning distilgpt2 on a sample of IMDB movie reviews.


Click on the link below and a file will be downloaded containing IMDB sample dataset of 1000 samples

http://files.fast.ai/data/examples/imdb_sample.tgz

Upload this file in this colab notebook using the upload button on the top left 

In [0]:
### Extract the csv file from the uploaded tgz file

import tarfile
with tarfile.open('imdb_sample.tgz', 'r:gz') as tar:
    tar.extractall()

In [0]:
import pandas as pd

In [0]:
data = pd.read_csv('imdb_sample/texts.csv')

In [20]:
### This is how the CSV look like
data

Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False
...,...,...,...
995,negative,There are many different versions of this one ...,True
996,positive,Once upon a time Hollywood produced live-actio...,True
997,negative,Wenders was great with Million $ Hotel.I don't...,True
998,negative,Although a film with Bruce Willis is always wo...,True


Let's get the number of samples

In [55]:
data.shape

(1000, 3)

For Finetuning distilgpt2, we just need the text field

In [0]:
texts = list(set(data['text']))

In [23]:
len(texts)

1000

Store the reviews in a txt file where each line of txt file is a single review 

In [0]:
file_name = 'testing.txt'
with open(file_name, 'w') as f:
    f.write(" |EndOfText|\n".join(texts))

Now, let's come to Transformers by Huggingface, and unleash the Transformers (Autobots... just kidding)

In [25]:
!pip install transformers
!git clone https://github.com/huggingface/transformers.git

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/13/33/ffb67897a6985a7b7d8e5e7878c3628678f553634bd3836404fef06ef19b/transformers-2.5.1-py3-none-any.whl (499kB)
[K     |████████████████████████████████| 501kB 2.7MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/a6/b4/7a41d630547a4afd58143597d5a49e07bfd4c42914d8335b2a5657efc14b/sacremoses-0.0.38.tar.gz (860kB)
[K     |████████████████████████████████| 870kB 9.0MB/s 
[?25hCollecting tokenizers==0.5.2
[?25l  Downloading https://files.pythonhosted.org/packages/d1/3f/73c881ea4723e43c1e9acf317cf407fab3a278daab3a69c98dcac511c04f/tokenizers-0.5.2-cp36-cp36m-manylinux1_x86_64.whl (3.7MB)
[K     |████████████████████████████████| 3.7MB 20.4MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/74/f4/2d5214cbf13d06e7cb2c20d84115ca25b53ea76fa1f0ade0e3c9749de214/sentencepiece-0.1.85-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |█████

Make 2 directories. 

1) weights - for storing the weights of distilgpt2

2) tokenizer - for storing the tokenizer of distilgpt2

In [39]:
dir_ = "models/"
!mkdir {dir_}
dir_ = "models/gpt2/"
!mkdir {dir_}
weights_dir = "models/gpt2/weights"
tokenizer_dir = "models/gpt2/tokenizer"
!mkdir {weights_dir}
!mkdir {tokenizer_dir}

mkdir: cannot create directory ‘models/’: File exists
mkdir: cannot create directory ‘models/gpt2/’: File exists
mkdir: cannot create directory ‘models/gpt2/tokenizer’: File exists


Store the tokenizer files in tokenizer_dir

In [27]:
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
tokenizer.save_pretrained(tokenizer_dir)

HBox(children=(IntProgress(value=0, description='Downloading', max=1042301, style=ProgressStyle(description_wi…




HBox(children=(IntProgress(value=0, description='Downloading', max=456318, style=ProgressStyle(description_wid…




('models/gpt2/tokenizer/vocab.json',
 'models/gpt2/tokenizer/merges.txt',
 'models/gpt2/tokenizer/special_tokens_map.json',
 'models/gpt2/tokenizer/added_tokens.json')

Now, its time for Training (or fine tuning) distilgpt2 with IMDB reviews
Given below is a command containing few parameters to help Transformers finetune distilgpt2. now, let's understand what these parameters mean

1) output_dir: It is the weights_dir we made where our finetuned model will be stored in the form of checkpoints

2) tokenizer_name: It is the tokenizer_dir we made where tokenizer for distilgpt2 is stored

3) line_by_line: It helps in preparation of data where each line of text is treated separately as a single observation

4) model_name_or_path: It tells the kind of model we are currently dealing with

5) per_gpu_train_batch_size: It tells the batch size for each gpu

6) do_train: It tells pytorch to start training mode

7) train_data_file: This is where we give the input text data 

8) num_train_epochs: Number of epochs for finetuning

Rest of the parameters are self explanatory. For more information check https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py

Now, let the training begin...

In [0]:
cmd = '''
python transformers/examples/run_language_modeling.py \
--output_dir {0} \
--tokenizer_name {1} \
--line_by_line \
--model_type gpt2 \
--overwrite_cache \
--model_name_or_path distilgpt2 \
--per_gpu_train_batch_size 2 \
--do_train \
--overwrite_output_dir \
--train_data_file testing.txt \
--num_train_epochs 3.0 \
--logging_steps 50 \
--save_steps 100 \
--save_total_limit 2 \
--seed 100
'''.format(weights_dir,tokenizer_dir)

In [41]:
!{cmd}

03/07/2020 14:50:17 - INFO - transformers.configuration_utils -   loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/distilgpt2-config.json from cache at /root/.cache/torch/transformers/eb0f77b3f095880586731f57e2fe19060d71d1036ef8daf727bd97a17fb66a43.a41f80bd12c111d611dcd5546611b7e47c16a0a995f83df2f7b437a20b6849b5
03/07/2020 14:50:17 - INFO - transformers.configuration_utils -   Model config GPT2Config {
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": null,
  "do_sample": false,
  "embd_pdrop": 0.1,
  "eos_token_ids": null,
  "finetuning_task": null,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "is_decoder": false,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "length_penalty": 1.0,
  "max_length": 20,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_layer": 6,
  "n_positions": 1024,
  "num_beams": 1,
  "num_labels": 1,
  "num_return_sequen

Although, Huggingface provides a run_generation.py file for language generation. Running it from a command (as it takes the input), makes it load the model and the tokenizer everytime you run the file which slows downs generation. To reduce the I/O overhead, I have restructured the run_generation.py file in the following code which only loads the model and tokenizer once in a model and a tokenizer object and we can use these objects to generate text over and over again

In [0]:
from transformers import GPT2LMHeadModel

def get_model_tokenizer(weights_dir, device = 'cuda'):
    print("Loading Model ...")
    model = GPT2LMHeadModel.from_pretrained(weights_dir)
    model.to('cuda')
    print("Model Loaded ...")
    tokenizer = GPT2Tokenizer.from_pretrained(weights_dir)
    return model, tokenizer

def generate_messages(
    model,
    tokenizer,
    prompt_text,
    stop_token,
    length,
    num_return_sequences,
    temperature = 0.7,
    k=20,
    p=0.9,
    repetition_penalty = 1.0,
    device = 'cuda'
):

    MAX_LENGTH = int(10000)
    def adjust_length_to_model(length, max_sequence_length):
        if length < 0 and max_sequence_length > 0:
            length = max_sequence_length
        elif 0 < max_sequence_length < length:
            length = max_sequence_length  # No generation bigger than model size
        elif length < 0:
            length = MAX_LENGTH  # avoid infinite loop
        return length
        
    length = adjust_length_to_model(length=length, max_sequence_length=model.config.max_position_embeddings)

    encoded_prompt = tokenizer.encode(prompt_text, add_special_tokens=False, return_tensors="pt")

    encoded_prompt = encoded_prompt.to(device)

    output_sequences = model.generate(
            input_ids=encoded_prompt,
            max_length=length + len(encoded_prompt[0]),
            temperature=temperature,
            top_k=k,
            top_p=p,
            repetition_penalty=repetition_penalty,
            do_sample=True,
            num_return_sequences=num_return_sequences,
        )

    if len(output_sequences.shape) > 2:
        output_sequences.squeeze_()

    generated_sequences = []

    for generated_sequence_idx, generated_sequence in enumerate(output_sequences):
        #print("=== GENERATED SEQUENCE {} ===".format(generated_sequence_idx + 1))
        generated_sequence = generated_sequence.tolist()

        # Decode text
        text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True)

        # Remove all text after the stop token
        text = text[: text.find(stop_token) if stop_token else None]

        # Add the prompt at the beginning of the sequence. Remove the excess text that was used for pre-processing
        total_sequence = (
            prompt_text + text[len(tokenizer.decode(encoded_prompt[0], clean_up_tokenization_spaces=True)) :]
        )

        generated_sequences.append(total_sequence)
    return generated_sequences

In [43]:
model, tokenizer = get_model_tokenizer(weights_dir, device = 'cuda')

Loading Model ...
Model Loaded ...


These are the hyper-parameters for sampling of tokens(one token at a time)

In [0]:
temperature = 1.0
k=400
p=0.9
repetition_penalty = 1.0
num_return_sequences = 5
length = 1000
stop_token = '|EndOfText|'
prompt_text = "this is"

In [54]:
%%time
generate_messages(
    model,
    tokenizer,
    prompt_text,
    stop_token,
    length,
    num_return_sequences,
    temperature = temperature,
    k=k,
    p=p,
    repetition_penalty = repetition_penalty
)

CPU times: user 12.7 s, sys: 2.03 s, total: 14.7 s
Wall time: 14.7 s


["this is an amazing film, but i thought they had this film. It would be perfect if the cast was more independent. The only reason i was scared is because if they didn't have a director, they would've done the same. I'm not a big fan of any actors, but this movie is one of the better ones. The acting is very good, and the dialogue is nice. The script is always very touching and it makes my heart feel fresh. Overall it was a good movie. ",
 'this is still a problem. I read this from the review and found it out.<br /><br />I am a woman who is tired of seeing this horror flick. Although its a very bad movie i do think it is really bad and this movie is definitely worth watching. ',
 "this is, of course, bad. I also like watching a movie with an audience. The actors were all wasted, but if I had a screenwriting opportunity, I would have liked to see more like this, but the director was stupid. He was an idiot, but that being said, I really don't care. And I get the feeling from watching th