<a href="https://colab.research.google.com/github/pthoehne/DialoGPT/blob/master/Learning_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Getting Started with Machine Learning

---


*An Introduction to Google Colab, Python and aitextgen*



**Before we get started**: Save a copy of this notebook in your own Google Drive to keep and edit ('File' > 'Save a Copy in Drive')

# Step 1: Working in Google Colab

Google Colab is an incredibly useful and user-friendly resource that allow us to write, execute, and share code all in our browsers. These notebooks are cloud-based, meaning all the code execution happens on Google's own servers, **not on your local machine!** So, whenever we download a library or execute code, have no fear that we are downloading strange and arcane code onto your own personal computer - it all stays safely on the cloud. We will download some files onto our actual machines during this session, but nothing will be installed through the execution of code.

One of the biggest advantages of Google Colab for machine learning is that it allows us to use a cloud-based GPU! A GPU is a "Graphics Processing Unit," as opposed to a "Central Processing Unit" or CPU. While GPUs were made to handle graphical operations, their design makes them ideal for machine learning. GPUs are not always available on colab due to usage constraints, but when they are they allow us to train our models at an accelerated rate. However, this introduction will work for both CPU and GPU training. To try to use a GPU, navigate to 'Runtime' > 'Change runtime type' and select 'GPU.' 

Because this is a cloud-based resource, the "instance" will reset after a period of time or when interrupted. **This means that the models and tokenizers that we generate here are temporary unless we save them locally to our own computers or to our Google Drives.** So, be sure to save any generated files you want to keep, and be sure to re-execute all your code if the runtime is interrupted!

Google works with Python, a very handy programming language. You will see a lot of Python code in the course of this tutorial, but again, no fear. **You need no coding experience whatsoever to complete this lesson and begin training your own model!** All of the code is pre-written, and simply requires you to execute it by pressing the 'Run Cell' brackets to the left of the code. Still, I will explain what the code is doing as we make our way through this introduction. **Be sure to run all the cells in order**, as some depend on previous cells having been executed! We will progress through the cells together, so do not worry about having to rush through them all at once.

Now, before we go any further, **let's execute some code.**

In [1]:
print ("Hello World!")

Hello World!


You did it! You executed code, and are now ready to press bravely on into the world of machine learning.

# Step 2: Creating a Corpus 

That was a lot of technical information. Now, let's dive right into preparing the data on which we will train our model.

Today, we will be attempting to train a model on a corpus a data source from early 20th-century cookbooks! *The goal is to try to train a model to generate what it thinks a 20th-century should look like, and then to evaluate the promise and limitations of such an approach.*

First, however, we need that training material. GPT-2, which is the machine learning model we are using today, trained using .txt files. We will need to create a .txt file large enough to give the model plenty of language on which to train.

To do this, we will navigate the [*Early American Cookbooks*](https://babel.hathitrust.org/cgi/mb?a=listis&c=1934413200) collection, curated by Gioia Stevens of NYU Libraries. The collection is hosted on HathiTrust There are two major advantages to this collection. 

> 1) The works can be downloaded as .txt files. To avoid the delay of logging in, you can download and inspect one such file [here](https://drive.google.com/file/d/1rBT_6qxp5PvNHnevZkIoV_K42vFEORkj/view?usp=sharing). If you would like to download your own texts, sign in, navigate to 'Download' to the left of the text, select '.txt', click 'Whole book,' and hit download. 

> 2) Critically, this downloaded .txt is full of 'clean' text. The .txt files generated for many historical texts are done through OCR, or optical character  recognition. This automates the work of creating searchable text, but can result in messy and inaccurate outputs. See [this example](https://drive.google.com/file/d/1DsZ7wC4jE6_UJmWP2CXrEoAqUfXZUxqa/view?usp=sharing) from a 1923 issue of the *Omaha Morning Bee* sourced from *Chronicling America*.

> Our model will be attempting to learn and replicate whatever text we train it on. It cannot tell 'good' text from 'bad.' So, if we feed it messy input we will get a messy output. Luckily, however, our cookbooks are in [good shape](https://drive.google.com/file/d/1qLrw8DVX_LYk_ebSzrq3A7i_rxXcCZRv/view?usp=sharing). 

>As you can see, there are some minor issues, but the text looks pretty good.

Now, GPT2 benefits from *lots* of text, so our next step now would be to compile a number of similar cookbooks together into one big .txt file that will serve as our **corpus** for training. We would then go through and clean that corpus, deleting the page numbers, indexes, and whatever else we do not want our model learning. In our case, that is everything but the recipes  themselves. However, as we only have a limited time together, I have gone ahead and done this for us. Find the cleaned corpus [here](https://drive.google.com/file/d/1IYbfUR7nepQbnYvo3_f_4N1_FBTqxMff/view?usp=sharing). Go ahead and download that file now, and open it up. You will notice I also added '<|endoftext|>' between the recipes. This serves as a token during training to let the model know where each recipe begins and ends.

Now, we have our corpus and it is ready for training! Let's begin.



# Step 3: aitextgen and Training

This code will take a little while to execute, so go ahead and run it while I talk a little bit about aitextgen and what we are doing here.

In [2]:
!pip uninstall -qqy torch torchvision torchtext torchaudio fastai 
!pip install -qq torch==1.9.0 pytorch-lightning==1.7.7 aitextgen gdown

import logging
logging.basicConfig(
        format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )

from aitextgen import aitextgen
from aitextgen.TokenDataset import TokenDataset, merge_datasets
from aitextgen.utils import build_gpt2_config
from aitextgen.tokenizers import train_tokenizer
import gdown

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m831.4/831.4 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m708.1/708.1 KB[0m [31m54.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m572.2/572.2 KB[0m [31m57.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m512.4/512.4 KB[0m [31m49.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m101.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.3/88.3 KB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m98.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━


Today, we will be using Max Woolf's [aitextgen](https://docs.aitextgen.io/) to train our model and generate outputs from that trained model. Woolf is a data scientist who created aitextgen in hopes of making it more accessible and simple to work with GPT2. 

aitextgen is a Python library, so we will need to install it before we start training. Remember, this installation happens in the cloud and not on your personal machine.

Let's go ahead and install the necessary libraries now. Execute the code below to begin.

**Next**, we need to upload our corpus into the notebook and identify it. To upload the file, click the folder icon on the left side of your screen. This should expand into a window entitled 'Files.' Right click in this window, select 'Upload,' and upload the corpus. 

Next, we need to identify the corpus file and use it to train a tokenizer. GPT2 turns the strings of letters and words into tokens for its training. This will generate a new file under 'Files' named 'aitextgen.tokenizer.json.'

In [3]:
!gdown 1IYbfUR7nepQbnYvo3_f_4N1_FBTqxMff

cookbook_corpus = "cookbooks_corpus.txt"

train_tokenizer(cookbook_corpus)


Downloading...
From: https://drive.google.com/uc?id=1IYbfUR7nepQbnYvo3_f_4N1_FBTqxMff
To: /content/cookbooks_corpus.txt
  0% 0.00/764k [00:00<?, ?B/s]100% 764k/764k [00:00<00:00, 175MB/s]


Now we will specify the configurations for our GPT2 build. Most of this refers to the size of the vocabulary used to train the model, the maximum length for the model, token embeddings, and so on. We do not need to adjust this for an introductory session, but if you would like to learn more feel free to explore Woolf's [site](https://docs.aitextgen.io/tutorials/model-from-scratch/).

*Note*: to_gpu is set to True, but if you are not able to use a GPU delete 'True' and just type 'False.'



In [6]:
config = build_gpt2_config(vocab_size=5000, max_length=512, dropout=0.0, n_embd=256, n_layer=8, n_head=8)

ai = aitextgen(config=config,
               tokenizer_file="aitextgen.tokenizer.json",
               to_gpu=True)

INFO:aitextgen:Constructing model from provided config.
INFO:aitextgen:GPT2 loaded with 7M parameters.
INFO:aitextgen:Using a custom tokenizer.


With that all done, we can try to generate some text. Run the cell below.

In [7]:
ai.generate(5)

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


&heeseise st�(Jtoeason ucup pin�ne standpe(R ofchooreFinala veal fine choress pinbreadredspoonfulhickenixture the " cornbread thickenoneave p�
hardmmerTvery W cookpppp�4oft� beef� whicherer i��ong�us cook Letho buttered�{endpoonhicken wellPEgeason wine��ide�hicken water� 1 oz cho tom
carefullyJro� oun " F� sesof/ pudding or 5ableable] teaspoonful�flour� ; carefullyissolchowellzzid� allarm�choixtureleanissolved� softell hours_�easina
vealhopten72 water through beat beatJelly 14sauceallyallyally lb crumbstenetwitheasful tea qeas water off When sugarJFrinkleatf ;red{eas�� ozew corn li cooked ixture� hard Then
he cookeas choeasina cookatoesightly have have chooneher fill but butoilarageut slooeason�with�boileasonasseatenatoesissolved ozasseasg oz ozafro fine tablespoonful tablespoonful ;eason "�urresssalt


That does not look great...because we have not trained our model! 

It is time to begin training. Run the cell below.

This will take a while, even though we have only set this training to 5,000 steps. Every 1,000 steps, the model will generate some sample text so we can see how it is progressing.

Some useful notes:

**'num_steps'** refers how many 'steps' will occur over the course of training. The more steps, the longer the training and the better the model, as it will allow the model to train over the whole corpus. 

**'Learning rate'** refers to the size of the training step. We will leave this alone for this introduction. 

**'batch_size'** refers to the number of batches into which we will divide our tokens at each step of training. Higher batch sizes tend to cause Out of Memory (OOM) errors unless you have access to lots of RAM, so we will keep this at '1.' 

**'Loss'** and **'average loss'** tell us how are model is doing. Over the course of training, this number should fall. The lower the loss, the better the trained model is performing. 

For a deeper dive into some of these settings and the concepts behind them, see Chantal Brousseau's excellent [article](https://programminghistorian.org/en/lessons/interrogating-national-narrative-gpt) in the *Programming Historian* (a wonderful resource for learning about many DH tools). 

In [8]:
ai.train(cookbook_corpus,
         line_by_line=False,
         from_cache=False,
         num_steps=5000,
         generate_every=1000,
         save_every=1000,
         save_gdrive=False,
         learning_rate=1e-3,
         batch_size=1,
         )

INFO:aitextgen:Loading text from cookbooks_corpus.txt with generation length of 512.


  0%|          | 0/20104 [00:00<?, ?it/s]

INFO:aitextgen.TokenDataset:Encoding 20,104 sets of tokens from cookbooks_corpus.txt.
  rank_zero_deprecation(
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
  rank_zero_deprecation(
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/5000 [00:00<?, ?it/s]

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


[1m1,000 steps reached: saving model to /trained_model[0m
[1m1,000 steps reached: generating sample texts.[0m
 Season with a little fine chopped



A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


[1m2,000 steps reached: saving model to /trained_model[0m
[1m2,000 steps reached: generating sample texts.[0m
dfted cayen.-FromFrom “




A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


[1m3,000 steps reached: saving model to /trained_model[0m
[1m3,000 steps reached: generating sample texts.[0m
Toast.
Scher ätipe stripe a
aise fire) end of cold water, 14 of flour and a few minutes).
add 12 small
adding the
add 72 a cup of salt and strain it boil 10 minutes;
baking powder. Let it a little
Put i
y to a few
t and stir in a stiff
therol, add 1 cup of cream ; stirring
the mould
chopped onion. Add i pint of a buttered
and cook 10 minutes and a few



A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


[1m4,000 steps reached: saving model to /trained_model[0m
[1m4,000 steps reached: generating sample texts.[0m
Rocoanut Soup.
Cover Ice of water with I
tablespoonfuls water; add ½ a teaspoonful of pepper and cook in
b it up, and stirring contwarm water), when
until quick. Whip in a half a bring
melted sugar. Then add a little cold
a pint of sugar, and if
mouned in a quitead of gelatine.



A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=5000` reached.


[1m5,000 steps reached: saving model to /trained_model[0m
[1m5,000 steps reached: generating sample texts.[0m
–Strawberry wine.



INFO:aitextgen:Saving trained model pytorch_model.bin to /trained_model


**The initial training is complete!** Run the cell below to generate text from this short training.

In [9]:
model_folder = "./trained_model"

tokenizer_file = "aitextgen.tokenizer.json"

config = "config.json"

ai2 = aitextgen(tokenizer_file=tokenizer_file, model_folder=model_folder, model="pytorch_model.bin", config=config)

ai2.generate(n=5,
            max_length=512,
            temperature=0.7
            )

INFO:aitextgen:Loading model from provided weights and config in /./trained_model.
INFO:aitextgen:GPT2 loaded with 7M parameters.
INFO:aitextgen:Using a custom tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


-Strientiled Rice Cuttripe.
Boil 6 or 3 or vegetables,
for ten minutes, and the
just before serving.
Add a little cold water, salt and
and add a little vanilla, a pinch of salt,
and stir over the yolks of the eggs
well together, and add the whites of 5 eggs
and serve cold.

—Broquettes.
When very hot stir in a double boiler with the fire until the
for readyr; add i pint of cream
well beaten yolks of 4 of 4 eggs and the whites of
eggs, a pinch of salt and white wine; then add
well to the mixture; stir in the fire and add 1/2 cup-
ful of milk and 3 tablespoonfuls of butter ; let boil
it gently until tender and then add a few
minutes. Stir until the yolks of 4 eggs
cream ; then add 1/2 cup of the yolks of 2 tablespoonfuls of
lemon and the juice of 12 lemon-
lemon and the whites of the eggs and
from the whites of the whites of the eggs.
Shels and dissolve orange with the piece of
and bake in a moderate oven until done.

–Ston of Liled Oyster.

Sh Real Sauce.
Take a pint of paragusated suga

This generated text looks *much* better than the pre-training generated text. Some if it is even beginning to resemble a cookbook. Still, this text is still fragmented and full of errors.

**To properly train a robust model takes a *long* time.** We only have an hour together today, and that would not be nearly enough time to train a proper model. 

So, in the interest of time, I went ahead a trained our corpus ahead of time. This training For good measure, I used a GPU cluster to train this model over the course of **500,000 steps**. This long training brought the loss down to an average of just .081.

Go ahead and download the tokenizer [here](https://drive.google.com/file/d/1zhUChyyoFmkF8JNDPAL75MPzUWLZewsU/view?usp=sharing) and the model folder [here](https://drive.google.com/drive/folders/18kOhXcavq4FaGQj6uyKP-KT5RHT726uJ?usp=sharing). Download both files from this folder.

**Now upload the tokenizer into the 'Files' window as you did with the corpus earlier. Then, right click in the 'Files' window and select 'New Folder.' Name this folder 'trained_model(cookbooks)' and upload both files from the model folder.**

# Step 4: Exploring a Fully Trained Model 

Go ahead a run this cell as many times as you want, exploring the generated text! The text will not be perfect, but it will help us evaluate the promises and limitations of using GPT2 and machine learning models as tools for analysis. Share your findings in this [Google Doc](https://docs.google.com/document/d/1kghwCiXj49TJM6mmCpOIir9nPaYAUHjVOUHSmeF_ST0/edit?usp=sharing), and we will explore them together. 



In [10]:
!mkdir trained_model_cookbooks

In [11]:
!gdown 1zhUChyyoFmkF8JNDPAL75MPzUWLZewsU

Downloading...
From: https://drive.google.com/uc?id=1zhUChyyoFmkF8JNDPAL75MPzUWLZewsU
To: /content/aitextgen.tokenizer(cookbooks).json
  0% 0.00/31.0k [00:00<?, ?B/s]100% 31.0k/31.0k [00:00<00:00, 49.3MB/s]


In [21]:
%cd trained_model_cookbooks

!gdown 1jKmdGlT_kYFImY2aHQtIYTVbHwOdGocy 
!gdown 1zZQRrAvF8DKriIDqHmvT5YEwIbrC4ye3

%cd /content

/content/trained_model_cookbooks
Downloading...
From: https://drive.google.com/uc?id=1jKmdGlT_kYFImY2aHQtIYTVbHwOdGocy
To: /content/trained_model_cookbooks/config.json
100% 780/780 [00:00<00:00, 1.57MB/s]
Downloading...
From: https://drive.google.com/uc?id=1zZQRrAvF8DKriIDqHmvT5YEwIbrC4ye3
To: /content/trained_model_cookbooks/pytorch_model.bin
100% 35.6M/35.6M [00:00<00:00, 335MB/s]
/content


In [None]:
model_folder = "./trained_model_cookbooks"

tokenizer_file = "aitextgen.tokenizer(cookbooks).json"

config = "config.json"

ai2 = aitextgen(tokenizer_file=tokenizer_file, model_folder=model_folder, model="pytorch_model.bin", config=config)

ai2.generate(n=5,
            max_length=512,
            temperature=0.7
            )

AssertionError: ignored

# Conclusion

You did it! You made it through the steps needed to produce a corpus, train a model, and explore the output. Feel free to return to this guide and tweak any of the settings to your liking. You may also replace our cookbook corpus with a corpus of your own.

*Bon Appétit!*
