# Fine-tuning a language model with huggingface


This notebook explores generating fake podcast transcripts from "Welcome to Night Vale", a favourite podcast of mine from 2012.

The source code for this notebook, along with any extra bits I created while I got it working, are available via Github here. Feel free to take a look, although it is much less presentable than this notebook: https://github.com/mathematiguy/welcome-to-nightvale

---

In [1]:
# Install packages
! pip3 install -r requirements.txt



## Welcome to Night Vale

"Welcome to Night Vale" is a podcast presented as a radio show for the fictional town of Night Vale, reporting on the strange events that occur within it. It was created in 2012 by Joseph Fink and Jeffrey Cranor, and is full of Lovecraftian cosmic horror with a comical twist.

The host of "Welcome to Night Vale" is Cecil Baldwin, who is played by American voice actor Cecil Palmer.

In this notebook, we will be training a language model to generate fake podcast transcripts from the show using data collected from https://cecilspeaks.tumblr.com/, which contains transcripts for 191 episodes of the show.

## Listen to the podcast

For the curious, you can listen to an episode here:

In [2]:
from IPython.display import IFrame

IFrame("https://www.youtube.com/embed/due3u22Licw", width=560, height=315)

---

## Get the data

To collect the transcripts, I wrote a webscraper using scrapy (https://scrapy.org/) and stored the files in a google drive to be downloaded in the cell below.

In [3]:
# Delete the data if it already exists
! rm -rf wtnv.zip transcripts

# Download the data
! gdown --id "1szUGhMsH9SFF52AZKRse_gZs7zB7ke8F"

# Unzip the data
! unzip wtnv.zip -d transcripts

Downloading...
From: https://drive.google.com/uc?id=1szUGhMsH9SFF52AZKRse_gZs7zB7ke8F
To: /home/jovyan/welcome-to-nightvale/wtnv.zip
100%|██████████████████████████████████████| 1.24M/1.24M [00:00<00:00, 1.87MB/s]
Archive:  wtnv.zip
  inflating: transcripts/1-pilot     
  inflating: transcripts/2-glow-cloud  
  inflating: transcripts/3-station-management  
  inflating: transcripts/4-pta-meeting  
  inflating: transcripts/5-the-shape-in-grove-park  
  inflating: transcripts/6-the-drawbridge  
  inflating: transcripts/7-history-week  
  inflating: transcripts/8-the-lights-in-radon-canyon  
  inflating: transcripts/9-pyramid   
  inflating: transcripts/10-feral-dogs  
  inflating: transcripts/11-wheat-amp-wheat-by-products  
  inflating: transcripts/12-the-candidate  
  inflating: transcripts/13-a-story-about-you  
  inflating: transcripts/14-the-man-in-the-tan-jacket  
  inflating: transcripts/15-street-cleaning-day  
  inflating: transcripts/16-the-phone-call  
  inflating: transcripts/

## Take a peek at a transcript

We have just downloaded 191 podcast transcripts, which are listed above. Next we can inspect one of the files using `head`, which displays the first 10 lines of a file.

If you would like to see the rest of the file, or more files, then feel free to explore the `transcripts/` directory on the left.

In [4]:
# Look at the data
! head transcripts/1-pilot

1 - Pilot

A friendly desert community, where the sun is hot, the moon is beautiful, and mysterious lights pass overhead while we all pretend to sleep. Welcome to Night Vale. 

Hello listeners. To start things off, I’ve been asked to read to read this brief notice. The City Council announces the opening of a new dog park at the corner of Earl and Summerset, near the Ralphs. They would like to remind everyone that dogs are not allowed in the dog park. People are not allowed in the dog park. It is possible you will see hooded figures in the dog park. Do not approach them. Do not approach the dog park. The fence is electrified and highly dangerous. Try not to look at the dog park and especially do not look for any period of time at the hooded figures. The dog park will not harm you.

And now the news. Old Woman Josie, out near the car lot, says the Angels revealed themselves to her. Said they were ten feet tall, radiant, one of them was black. Said they helped her with various household c

## Listen to the transcript

Here's another iFrame with episode 1 in it so you can compare the recording to the transcript above:

In [5]:
IFrame("https://www.youtube.com/embed/Ujksjzqrhys", width=560, height=315)

---

## Split the data into train/test sets

Splitting data into different sets is a common operation for training any machine learning model. Because the model will learn from data, we need to set aside some data that we will not show the language model for evaluation. This helps us to detect overfitting.

There is a trade-off between providing more training data, and avoiding overfitting. The more training data you provide, the better your model performs. But the smaller your test set, the more likely you are to overfit.

For this project, we'll take the first 90% of the podcast transcripts and add them to a training set, while the remaining 10% are written to a test set. This decision is somewhat arbitrary, and you could argue for putting almost everything in the training set for this application.

### How we chose to split the data

There are 191 transcripts, so we can take the first 171 and pipe them to file which we'll call `train.txt`, and take the remaining 20 transcripts and pipe them to `test.txt`.

In the cell below, I used `bash` command to concatenate the training + test text files automatically.

### An unnecessary bash diversion

If you wanna know how this works at a high level, here's a summary (but feel free to ignore it):

- `ls` lists the files in the `transcripts` directory, `-t` orders the files in sequence.
- Then we use `head` to grab the first 171 files for the training set, or in the latter case, we use `tail` to grab to last 20 files
- Then `xargs` is a bit like a for loop. It runs `cat` over each file, which prints the file contents to stdout.
- Then we pipe the file contents to the output file (`train.txt` or `test.txt`) using the `>` symbol.

In [6]:
# Save the first 171 transcripts to train.txt
! ls transcripts -t | head -n 171 | xargs -I {} cat transcripts/{} > train.txt

# Save the last 20 transcripts to train.txt
! ls transcripts -t | tail -n 20 | xargs -I {} cat transcripts/{} > test.txt

Once this cell has been run, the files `train.txt` and `test.txt` should now exist. We will inspect them in the next step just to make sure we did it right, but you can always check them directly using the panel on the left.

## Inspect the training + test data

We can take a quick look at `train.txt` and `test.txt`, and count the number of lines in each. Alternatively open the files in a separate tab/window and look at them for yourself.

In [7]:
# Show the first 10 lines of train.txt
! head train.txt

1 - Pilot

A friendly desert community, where the sun is hot, the moon is beautiful, and mysterious lights pass overhead while we all pretend to sleep. Welcome to Night Vale. 

Hello listeners. To start things off, I’ve been asked to read to read this brief notice. The City Council announces the opening of a new dog park at the corner of Earl and Summerset, near the Ralphs. They would like to remind everyone that dogs are not allowed in the dog park. People are not allowed in the dog park. It is possible you will see hooded figures in the dog park. Do not approach them. Do not approach the dog park. The fence is electrified and highly dangerous. Try not to look at the dog park and especially do not look for any period of time at the hooded figures. The dog park will not harm you.

And now the news. Old Woman Josie, out near the car lot, says the Angels revealed themselves to her. Said they were ten feet tall, radiant, one of them was black. Said they helped her with various household c

In [8]:
# Show the first 10 lines of test.txt
! head test.txt

169 - The Whittler

[

]

Let us go then you and I,

when the evening is spread out against the sky,



In [9]:
# Show the number of lines in train.txt and test.txt
! wc -l train.txt test.txt

  23928 train.txt
   3008 test.txt
  26936 total


### Train test split completed!

Now we have successfully prepared our dataset for training, we are ready to start building our fine-tuning pipeline.

----

## Fine-tuning a language model

In this step, we are going to fine-tune a language model using the `huggingface` (https://huggingface.co) team's excellent `transformers` package.

`transformers` allows you to download, use and manipulate a wide range of language models trained for different applications. They provide tools for downloading model weights, using them for inference, training from scratch and fine-tuning. It's also pretty easy to use.

More information about `transformers` is available here: https://huggingface.co/transformers/

## GPT-2

We are going to be using the `GPT-2` language model, which was pre-trained by Open AI who first published it in 2019. You can read more about `GPT-2` here: https://openai.com/blog/gpt-2-1-5b-release/

We are going to use the `GPT2Tokenizer`, and `GPT2LMHeadModel` tools.
- A `tokenizer` breaks up a string of text into words or word-fragments in a way that the language model understands.
- An `LMHeadModel` exposes the Language Model of the GPT-2 model, which makes text generation possible.

There are other kinds of Language Models which are specialised for different tasks. You can find a bunch of summaries for all the models supported by `transformers` here: https://huggingface.co/transformers/model_summary.html

In [10]:
# Import transformers stuff
from transformers import (
    GPT2Tokenizer,
    GPT2LMHeadModel,
    TextDataset,
    DataCollatorForLanguageModeling,
    Trainer,
    TrainingArguments,
)

### Load pre-trained GPT-2 models

Now that we have imported the library, we need to download the `GPT-2` model weights so we can start using them. The cell below downloads them straight to disk from `huggingface`'s servers.

In [11]:
# Load a tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Load GPT2 model
model = GPT2LMHeadModel.from_pretrained("gpt2")

## Load our training datasets

Next, we need to load our train + test datasets into a format that the `GPT-2` can use. Notice we pass the `train.txt` and `test.txt` files here.

In [12]:
train_dataset = TextDataset(
    tokenizer=tokenizer, file_path='train.txt', block_size=128
)

test_dataset = TextDataset(tokenizer=tokenizer, file_path='test.txt', block_size=128)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False, # GPT-2 only supports mlm=False
)



## Setting hyperparameters

In the cell below, we set hyperparameters which will affect how our model will run. This table gives a summary of what each hyperparameter does, and how it affects the model if you set it too high or too low.

| Hyperparameter name | What it does | What if its too big | What if its too small |
| --- | --- | --- | --- |
| `num_train_epochs` | The number of times the language model will review the text during training. | The model takes a long time to train. | The model will not learn much and the results will lean towards the pre-trained model instead of the new data you have provided. |
| `per_device_train_batch_size` | The size of the batches in which the training loop will consume the training data. | You will run out of GPU memory. | The train will take a long time. |
| `per_device_eval_batch_size` | The size of the batches in which the training loop will consume the test data. | You will run out of GPU memory. | The model evaluation will take a long time. |
| `eval_steps` | Sets how frequently the model will run an evaluation step (that is, it will run inference on the test data to check for overfitting). | You will run evaluation too often which will slow down the training loop. | You may overfit by a lot before you are able to notice. |
| `save_steps` | Sets how frequently the model will write checkpoints to disk in order to save its progress. | If your train fails you will need to re-run a lot of computation. | You will create too many checkpoints and run out of disk space. |

In actuality, there are _many_ more hyperparameters than this. You can read more about them here: https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments

Now we set the hyperparameter values in the cell below:

In [13]:
num_train_epochs = 1  ## You can set this higher, but it will take longer

# Mostly these can be left as they are - feel free to play with them and see what happens however
per_device_train_batch_size = 18
per_device_eval_batch_size = 16
eval_steps = 400
save_steps = 800

training_args = TrainingArguments(
    output_dir='wtnv_model',                                 # The folder where we save the model
    overwrite_output_dir=True,                               # overwrite the content of the output directory
    num_train_epochs=num_train_epochs,                       # number of training epochs
    per_device_train_batch_size=per_device_train_batch_size, # batch size for training
    per_device_eval_batch_size=per_device_eval_batch_size,   # batch size for evaluation
    eval_steps=eval_steps,                                   # Number of update steps between two evaluations.
    save_steps=save_steps,                                   # after # steps model is saved
    prediction_loss_only=True,
)

## Train the model

Now we have done all that work, we can finally train a model.

In [14]:
# Initialise a Trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

# Run the model train
trainer.train()

# Save the model when it's done
trainer.save_model()

Step,Training Loss


## Generate some text

Now that we have trained a model, it's time to load it into memory and run some text through it to see the results for ourself.

In order to generate text, we need to do the following:

- Load the trained model into memory
- Feed some text to start the new transcript
- Convert the text to a vector of token IDs that the model understands

In [16]:
# Load the model into memory
wtnv_model = GPT2LMHeadModel.from_pretrained("wtnv_model", local_files_only=True)

If you want the new transcript to start with different text, change the `seed_text` variable:

In [17]:
# Feed some text to start the new transcript
seed_text = 'Welcome to Night Vale'  #@param {type:"string"}

`seed_ids` is a vector of IDs which represent the words in the source sentence.

The `tokenizer` object provides a map which converts words to IDs and IDs back to words. The number of IDs should match the number of words in the `seed_text`, and if you use the same word more than once the IDs should match.

In [18]:
# Convert the sentence to a tensor of token IDs
seed_ids = tokenizer.encode(seed_text, return_tensors='pt')
print(seed_ids)

tensor([[14618,   284,  5265, 31832]])


You can decode the `input_ids` using the `tokenizer.decode` methods as follows:

In [19]:
# Convert the seed_ids back to text
print(tokenizer.decode(seed_ids[0]))

Welcome to Night Vale


In the following cell we will actually generate text using the new model's `generate` method. There are a lot of interesting details between how sampling language models work, but the general idea is that the language model provides probabilities for next words given the words so far.

Then you sample these probabilities using one of a number of strategies, and the way you sample these probabilities affects the text that comes out at the end.

For more information on how `generate` works, take a look at this blog post: https://huggingface.co/blog/how-to-generate

In [20]:
%%time
output = wtnv_model.generate(
    
    seed_ids,       # The seed text IDs
    do_sample=True,
    
    # Tweak these values and see what they do:
    max_length=500,
    top_k=0,
    top_p=0.92,
    temperature=0.9
    
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


CPU times: user 55min 17s, sys: 2min 42s, total: 57min 59s
Wall time: 7min 21s


In [21]:
print(tokenizer.decode(output[0]))

Welcome to Night Vale.

More on this in a minute.

SOMETHING:

Night Vale is in trouble with another apocalyptic story that's brought you a good look at the missing people, and also some hope.

Friday night's Night Vale Play House hosted by the non-profit Daily Welcome to Night Vale, which has planned another national carnival, this time in the middle of nowhere, at the site of the world-famous prison complex, and they're taking place downtown next to the Main Street Mall. Fun Fact: you just have to go where it isn't obvious to you.

And then this afternoon at 6:00pm, Night Vale City Council will vote on what to do with the missing, the last couple of hours of life. As you know, folks with powers have been at the main ballroom of The City Hall, wondering what the whole thing is up to. Well, that’s not a good thing. That’s not a good idea. The city’s new mayor, Dolores Umbridge, told us that she’s gonna vote no on bringing the Night Vale government to town. She will instead say that she

## Here's one I prepared earlier

Now, we can download a 20 epoch model train that I ran separately on the same data. It took about 20 minutes to train on an Nvidia RTX 2080 Ti GPU. Because this one has been trained considerably longer, the results should be noticably different.

The model weights are about 500MB zipped, so downloading it will just a few seconds.

In [22]:
# Download the model
! gdown --id "1igtAPMjk-fDFCSC2BS_W3FyPyJSK2FSJ"

# Unzip the model
! unzip wtnv20_model.zip -d wtnv20_model

Downloading...
From: https://drive.google.com/uc?id=1igtAPMjk-fDFCSC2BS_W3FyPyJSK2FSJ
To: /home/jovyan/welcome-to-nightvale/wtnv20_model.zip
463MB [00:12, 36.4MB/s] 
Archive:  wtnv20_model.zip
  inflating: wtnv20_model/config.json  
  inflating: wtnv20_model/pytorch_model.bin  
  inflating: wtnv20_model/training_args.bin  


Now we can quickly re-initialise the new model and generate another transcript.

In [None]:
%%time
# Load the 20-epoch model into memory
wtnv_model = GPT2LMHeadModel.from_pretrained("wtnv20_model", local_files_only=True)

# Feed some text to start the new transcript
seed_text = 'Welcome to Night Vale'  #@param {type:"string"}

# Convert the sentence to a tensor of token IDs
seed_ids = tokenizer.encode(seed_text, return_tensors='pt')

output = wtnv_model.generate(
    
    seed_ids,       # The seed text IDs
    do_sample=True,
    
    # Tweak these values and see what they do:
    max_length=1000,
    top_k=0,
    top_p=0.92,
    temperature=0.9
    
)

print(tokenizer.decode(output[0]))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


## Stuff to do next:

- Google some of the characters and places to see if they exist in the world of Welcome to Night Vale or not
- Take some time to generate new texts with different variables and get a feel for how the output changes.
- Make some comments on the weaknesses of the model output.
- See if you can make a transcript that you think is pretty good
- See if you can make a transcript that is pretty terrible