update docs for v0.5.0

minimaxir · Apr 18, 2021 · ad6e3b2 · ad6e3b2
1 parent 7c7888a
commit ad6e3b2
Show file tree

Hide file tree

Showing 13 changed files with 143 additions and 94 deletions.
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@ A robust Python tool for text-based AI training and generation using [OpenAI's](
 
 aitextgen is a Python package that leverages [PyTorch](https://pytorch.org), [Hugging Face Transformers](https://github.com/huggingface/transformers) and [pytorch-lightning](https://github.com/PyTorchLightning/pytorch-lightning) with specific optimizations for text generation using GPT-2, plus _many_ added features. It is the successor to [textgenrnn](https://github.com/minimaxir/textgenrnn) and [gpt-2-simple](https://github.com/minimaxir/gpt-2-simple), taking the best of both packages:
 
-- Finetunes on a pretrained 124M/355M/774M GPT-2 model from OpenAI or a 1325M/355M GPT Neo model from EleutherAI...or create your own GPT-2/GPT Neo model + tokenizer and train from scratch!
+- Finetunes on a pretrained 124M/355M/774M GPT-2 model from OpenAI or a 125M/350M GPT Neo model from EleutherAI...or create your own GPT-2/GPT Neo model + tokenizer and train from scratch!
 - Generates text faster than gpt-2-simple and with better memory efficiency!
 - With Transformers, aitextgen preserves compatibility with the base package, allowing you to use the model for other NLP tasks, download custom GPT-2 models from the HuggingFace model repository, and upload your own models! Also, it uses the included `generate()` function to allow a massive amount of control over the generated text.
 - With pytorch-lightning, aitextgen trains models not just on CPUs and GPUs, but also _multiple_ GPUs and (eventually) TPUs! It also includes a pretty training progress bar, with the ability to add optional loggers.
@@ -35,7 +35,7 @@ Here's how you can quickly test out aitextgen on your own computer, even if you
 
 For generating text from a pretrained GPT-2 model:
 
-```python
+```py3
 from aitextgen import aitextgen
 
 # Without any parameters, aitextgen() will download, cache, and load the 124M GPT-2 "small" model
@@ -56,7 +56,7 @@ aitextgen generate --prompt "I believe in unicorns because" --to_file False
 
 Want to train your own mini GPT-2 model on your own computer? You can follow along [in this Jupyter Notebook](/notebooks/training_hello_world.ipynb) or, download this [text file of Shakespeare's plays](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt), cd to that directory in a Terminal, open up a `python3` console and go:
 
-```python
+```py3
 from aitextgen.TokenDataset import TokenDataset
 from aitextgen.tokenizers import train_tokenizer
 from aitextgen.utils import GPT2ConfigCPU

diff --git a/docs/dataset.md b/docs/dataset.md
@@ -1,54 +1,67 @@
 # TokenDataset
 
-aitextgen has a special class, `TokenDataset`, used for managing tokenized datasets to be fed into model training. (this is in contrast with other GPT-2 finetuning approaches, which tokenizes at training time although you can still do that if you want)
+aitextgen has a special class, `TokenDataset`, used for managing tokenized datasets to be fed into model training. (this is in contrast with other GPT-2 finetuning approaches, which tokenizes at training time although you can still do that by passing a `file_path` and other relevant parameters to `ai.train()`.)
 
 This has a few nice bonuses, including:
 
 - Tokenize a dataset on a local machine ahead of time and compress it, saving time/bandwidth transporting data to a remote machine
 - Supports both reading a dataset line-by-line (including single-column CSVs), or bulk texts.
+- Debug and log the loaded texts.
 - Merge datasets together without using external libraries
 - Cross-train on multiple datasets to "blend" them together.
 
 ## Creating a TokenDataset For GPT-2 Finetuning
 
-The easiest way to create a TokenDataset is to provide a target file. If no `vocab_file` and `merges_file` are provided, it will use the default GPT-2 tokenizer.
+The easiest way to create a TokenDataset is to provide a target file. If no `tokenizer_file` is provided, it will use the default GPT-2 tokenizer.
 
-```python
+```py3
 from aitextgen.TokenDataset import TokenDataset
 
 data = TokenDataset("shakespeare.txt")
 ```
 
 If you pass a single-column CSV and specify `line_by_line=True`, the TokenDataset will parse it row-by-row, and is the recommended way to handle multiline texts.
 
-```python
+```py3
 data = TokenDataset("politics.csv", line_by_line=True)
 ```
 
 You can also manually pass a list of texts to `texts` instead if you've processed them elsewhere.
 
-```python
+```py3
 data = TokenDataset(texts = ["Lorem", "Ipsum", "Dolor"])
 ```
 
+## Block Size
+
+`block_size` is another parameter that can be passed when creating a TokenDataset, more useful for custom models. This should match the context window (e.g. the `n_positions` or `max_position_embeddings` config parameters). By default, it will choose `1024`: the GPT-2 context window.
+
+When implicitly loading a dataset via `ai.train()`, the `block_size` will be set to what is supported by the corresponding model `config`.
+
+## Debugging a TokenDataset
+
+When loading a dataset, a progress bar will appear showing how many texts are loaded and
+
+If you want to see what exactly is input to the model during training, you can access a slice via `data[0]`.
+
 ## Saving/Loading a TokenDataset
 
 When creating a TokenDataset, you can automatically save it as a compressed gzipped numpy array when completed.
 
-```python
+```py3
 data = TokenDataset("shakespeare.txt", save_cache=True)
 ```
 
 Or save it after you've loaded it with the `save()` function.
 
-```python
+```py3
 data = TokenDataset("shakespeare.txt")
 data.save()
 ```
 
 By default, it will save to `dataset_cache.tar.gz`. You can then reload that into another Python session by specifying the cache.
 
-```python
+```py3
 data = TokenDataset("dataset_cache.tar.gz", from_cache=True)
 ```
 
@@ -58,7 +71,7 @@ data = TokenDataset("dataset_cache.tar.gz", from_cache=True)
 
 ## Using TokenDatasets with a Custom GPT-2 Model
 
-The default TokenDataset has a `block_size` of `1024`, which corresponds to the _context window of the default GPT-2 model_. If you're using a custom model w/ a different maximum. Additionally, you must explicitly provide the vocab and merges files to rebuild the tokenizer, as the tokenizer will be different than the normal GPT-2 one.
+The default TokenDataset has a `block_size` of `1024`, which corresponds to the _context window of the default GPT-2 model_. If you're using a custom model w/ a different maximum. Additionally, you must explicitly provide the tokenizer file to rebuild the tokenizer, as the tokenizer will be different than the normal GPT-2 one.
 
 See the [Model From Scratch](tutorials/model-from-scratch.md) docs for more info.
 
@@ -72,7 +85,7 @@ Merging processed TokenDatasets can be done with the `merge_datasets()` function
 !!! note "About Merging"
     The current implementation merges by subset count, so equalization may not be perfect, but it will not significantly impact training.
 
-```python
+```py3
 from aitextgen.TokenDataset import TokenDataset, merge_datasets
 
 data1 = TokenDataset("politics1000.csv", line_by_line=True)   # 1000 samples

diff --git a/docs/generate-performance.md b/docs/generate-performance.md
@@ -10,7 +10,7 @@ PyTorch has the ability to quantize models on the CPU. Currently, it will only q
 
 To quantize a model after it's loaded, just run:
 
-```python
+```py3
 ai.quantize()
 ```
 
@@ -22,13 +22,13 @@ Certain GPUs, notably the cheap T4 and the expensive V100, support the ability t
 
 Assuming you are using a compatable GPU and already have [apex](https://github.com/NVIDIA/apex) installed, you can convert a model to the "half" FP16 mode with this:
 
-```python
+```py3
 ai.to_fp16()
 ```
 
 If you want to convert the model _before_ loading it into GPU memory (which may help avoid memory leaks), you can instantiate the model like this:
 
-```python
+```py3
 ai.to_fp16(to_gpu=True, to_fp16=True)
 ```
 

diff --git a/docs/generate.md b/docs/generate.md
@@ -7,14 +7,18 @@ Thanks to the base Transformers package, aitextgen has more options for generati
 See [this article](https://huggingface.co/blog/how-to-generate) by Huggingface engineer Patrick von Platen for how sampling and these parameters are used in practice.
 
 - `n`: Number of texts generated.
-- `max_length`: Maximum length of the generated text (default: 200; for GPT-2, the maximum is 1024.)
-- `prompt`: Prompt that starts the generated text and is included in the generate text. (used to be `prefix` in previous tools)
+- `max_length`: Maximum length of the generated text (default: 200; for GPT-2, the maximum is 1024; for GPT Neo, the maximum is 2048)
+- `prompt`: Prompt that starts the generated text and is included in the generated text.
 - `temperature`: Controls the "craziness" of the text (default: 0.7)
 - `top_k`: If nonzero, limits the sampled tokens to the top _k_ values. (default: 0)
 - `top_p`: If nonzero, limits the sampled tokens to the cumulative probability
 
 Some lesser-known-but-still-useful-parameters that are unique to Transformers:
 
+<!--prettier-ignore-->
+!!! warning "Performance"
+    Enabling these parameters may slow down generation.
+
 - `num_beams`: If greater than 1, executes beam search for cleaner text.
 - `repetition_penalty`: If greater than 1.0, penalizes repetition in a text to avoid infinite loops.
 - `length_penalty`: If greater than 1.0, penalizes text proportional to the length
@@ -30,17 +34,17 @@ Given a `aitextgen` object with a loaded model + tokenizer named `ai`:
     want to generate on the GPU, make sure you call `ai.to_gpu()` beforehand, or
     load the model into the GPU using `ai = aitextgen(to_gpu=True)`
 
-- `ai.generate()`: Generates and prints text to console. If `prompt` is used, the `prompt` is bolded. (a la [Talk to Transformer](https://talktotransformer.com))
+- `ai.generate()`: Generates and prints text to console. If `prompt` is used, the `prompt` is **bolded**.
 - `ai.generate_one()`: A helper function which generates a single text and returns as a string (good for APIs)
 - `ai.generate_samples()`: Generates multiple samples at specified temperatures: great for debugging.
-- `ai.generate_to_file()`: Generates a bulk amount of texts to file. (this accepts a `batch_size` parameter which is useful if using on a GPU)
+- `ai.generate_to_file()`: Generates a bulk amount of texts to file. (this accepts a `batch_size` parameter which is useful if using on a GPU, as it can generate texts in parallel with no performance loss)
 
 <!-- prettier-ignore -->
-!!! note "Cleanup"
-    By default, the `cleanup` parameter is set to True, which automatically removes texts that are blatantly malformed (e.g. only 2 characters long). Therefore, there may be less than `n` results returned. You can disabled this behavior by setting `cleanup=False`.
+!!! note "lstrip and nonempty_output"
+    By default, the `lstrip` and `nonempty_output` parameters to `generate` are set to `True`, which alters the behavior of the generated text in a way that is most likely preferable.  `lstrip`: Removes all whitespace at the beginning of the generated space. `nonempty_output`: If the output is empty (possible on shortform content), skip it if generating multiple texts, or try again if it's a single text. If `min_length` is specified, the same behavior occurs for texts below the minimum length after processing.
 
 ## Seed
 
-aitextgen has a new `seed` parameter for generation. Using any generate function with a `seed` parameter (must be an integer) and all other models/parameters the same, and the generated text will be identical. This allows for reproducible generations.
+aitextgen has a new `seed` parameter for generation. Using any generate function with a `seed` parameter (must be an integer) and all other models/parameters the same, and the generated text will be identical. This allows for reproducible generations in case someone accuses you of faking the AI output.
 
 For `generate_to_file()`, the 8-digit number at the end of the file name will be the seed used to generate the file, making reprodicibility easy.
diff --git a/docs/index.md b/docs/index.md
@@ -1,12 +1,14 @@
 # aitextgen
 
-A robust tool for advanced AI text generation via [GPT-2](https://openai.com/blog/better-language-models/).
+_Last Updated: April 18th, 2021 (aitextgen v0.5.0)_
 
-aitextgen is a Python package that leverages [PyTorch](https://pytorch.org), [Huggingface Transformers](https://github.com/huggingface/transformers) and [pytorch-lightning](https://github.com/PyTorchLightning/pytorch-lightning) with specific optimizations for text generation using GPT-2, plus _many_ added features. It is the successor to [textgenrnn](https://github.com/minimaxir/textgenrnn) and [gpt-2-simple](https://github.com/minimaxir/gpt-2-simple), taking the best of both packages:
+A robust Python tool for text-based AI training and generation using [OpenAI's](https://openai.com) [GPT-2](https://openai.com/blog/better-language-models/) and [EleutherAI's](https://www.eleuther.ai) [GPT Neo/GPT-3](https://github.com/EleutherAI/gpt-neo) architecture.
 
-- Finetunes on a pretrained 124M GPT-2 model from OpenAI...or create your own GPT-2 model + tokenizer and train from scratch!
-- Generates text faster than gpt-2-simple and with better memory efficiency! (even [from the 1.5B GPT-2 model](tutorials/generate_1_5b/)!)
-- With Transformers, aitextgen preserves compatibility with the base package, allowing you to use the model for other NLP tasks, download custom GPT-2 models from the Huggingface model repository, and upload your own models! Also, it uses the included `generate()` function to allow a massive amount of control over the generated text.
+aitextgen is a Python package that leverages [PyTorch](https://pytorch.org), [Hugging Face Transformers](https://github.com/huggingface/transformers) and [pytorch-lightning](https://github.com/PyTorchLightning/pytorch-lightning) with specific optimizations for text generation using GPT-2, plus _many_ added features. It is the successor to [textgenrnn](https://github.com/minimaxir/textgenrnn) and [gpt-2-simple](https://github.com/minimaxir/gpt-2-simple), taking the best of both packages:
+
+- Finetunes on a pretrained 124M/355M/774M GPT-2 model from OpenAI or a 125M/355M GPT Neo model from EleutherAI...or create your own GPT-2/GPT Neo model + tokenizer and train from scratch!
+- Generates text faster than gpt-2-simple and with better memory efficiency!
+- With Transformers, aitextgen preserves compatibility with the base package, allowing you to use the model for other NLP tasks, download custom GPT-2 models from the HuggingFace model repository, and upload your own models! Also, it uses the included `generate()` function to allow a massive amount of control over the generated text.
 - With pytorch-lightning, aitextgen trains models not just on CPUs and GPUs, but also _multiple_ GPUs and (eventually) TPUs! It also includes a pretty training progress bar, with the ability to add optional loggers.
 - The input dataset is its own object, allowing you to not only easily encode megabytes of data in seconds, cache, and compress it on a local computer before transporting to a remote server, but you are able to _merge_ datasets without biasing the resulting dataset, or _cross-train_ on multiple datasets to create blended output.
 

diff --git a/docs/load-model.md b/docs/load-model.md
@@ -8,45 +8,62 @@ There are several ways to load models.
 
 ## Loading an aitextgen model
 
-The closer to the default 124M GPT-2 model, the fewer files you need!
+For the base case, loading the default 124M GPT-2 model via Huggingface:
 
-For the base case, loading the default model via Huggingface:
-
-```python
+```py3
 ai = aitextgen()
 ```
 
 The downloaded model will be downloaded to `cache_dir`: `/aitextgen` by default.
 
-If you've finetuned a 124M GPT-2 model using aitextgen, you can pass the generated `pytorch_model.bin` to aitextgen:
+If you're loading a custom model for a different GPT-2/GPT-Neo architecture _from scratch_ but with the normal GPT-2 tokenizer, you can pass only a config.
 
-```python
-ai = aitextgen(model="pytorch_model.bin")
+```py3
+from aitextgen.utils import GPT2ConfigCPU
+config = GPT2ConfigCPU()
+ai = aitextgen(config=config)
 ```
 
-If you're loading a finetuned model of a different GPT-2 architecture, you'll must also pass the generated `config.json` to aitextgen:
+While training/finetuning a model, two files will be created: the `pytorch_model.bin` which contains the weights for the model, and a `config.json` illustrating the architecture for the model. Both of these files are needed to reload the model.
+
+If you've finetuned a model using aitextgen (the default model), you can pass the **folder name** containing the generated `pytorch_model.bin` and `config.json` to aitextgen (e.g. `trained_model`, which is where trained models will be saved by default).
+
+<!--prettier-ignore-->
+!!! note "Same Directory"
+    If both files are in the current directory, you can pass `model_folder="."`.
 
-```python
-ai = aitextgen(model="pytorch_model.bin", config=config)
+```py3
+ai = aitextgen(model_folder="trained_model")
 ```
 
-If you want to download an alternative GPT-2 model from Huggingface's repository of models, pass that model name to `model`.
+These examples assume you are using the default GPT-2 tokenizer. If you have a _custom tokenizer_, you'll need to pass that along with loading the model.
+
+```py3
+ai3 = aitextgen(model_folder="trained_model",
+                tokenizer_file="aitextgen.tokenizer.json")
+```
 
-```python
+If you want to download an alternative GPT-2 model from Hugging Face's repository of models, pass that model name to `model`.
+
+```py3
 ai = aitextgen(model="minimaxir/hacker-news")
 ```
 
 The model and associated config + tokenizer will be downloaded into `cache_dir`.
 
-## Loading TensorFlow-based GPT-2 models
+This can also be used to download the [pretrained GPT Neo models](https://huggingface.co/EleutherAI) from EleutherAI.
 
-aitextgen lets you download the models from Google's servers that OpenAI had uploaded back when GPT-2 was first released in 2019. These models are then converted to a PyTorch format.
+```py3
+ai = aitextgen(model="EleutherAI/gpt-neo-125M")
+```
+
+## Loading TensorFlow-based GPT-2 models
 
-It's counterintuitive, but it's _substantially_ faster than downloading from Huggingface's servers, especially if you are running your code on Google Cloud Platform (e.g. Colab notebooks)
+aitextgen lets you download the models from Microsoft's servers that OpenAI had uploaded back when GPT-2 was first released in 2019. These models are then converted to a PyTorch format.
 
 To use this workflow, pass the corresponding model number to `tf_gpt2`:
 
-```python
+```py3
 ai = aitextgen(tf_gpt2="124M")
 ```
 

diff --git a/docs/loggers.md b/docs/loggers.md
@@ -6,14 +6,14 @@ You can create loggers with popular tools such as [TensorBoard](https://www.tens
 
 For example, if you want to create a TensorBoard logger, you can create it:
 
-```python
+```py3
 from pytorch_lightning import loggers
 
 tb_logger = loggers.TensorBoardLogger('logs/')
 ```
 
 Then pass it to the `loggers` parameter for `ai.train()`.
 
-```python
+```py3
 ai.train(train_data=data, loggers=tb_logger)
 ```
diff --git a/docs/save-model.md b/docs/save-model.md
@@ -2,15 +2,15 @@
 
 There are are multiple ways to save models.
 
-Whenever a model is saved, two files are generated: `pytorch_model.bin` which contains the model weights, and `config.json` which is needed to load the model if it is not the base 124M GPT-2.
+Whenever a model is saved, two files are generated: `pytorch_model.bin` which contains the model weights, and `config.json` which is needed to load the model.
 
 Assuming we have an aitextgen model `ai`:
 
 ## Ad Hoc saving
 
 The aitextgen model can be saved at any time using `save`.
 
-```python
+```py3
 ai.save()
 ```
 
@@ -24,7 +24,7 @@ If you are using Google Colaboratory, you can mount your personal Google Drive t
 
 First mount your Google Drive using `mount_gdrive()`:
 
-```python
+```py3
 from aitextgen.colab import mount_gdrive, copy_file_to_gdrive
 mount_gdrive()
 ```
@@ -33,7 +33,7 @@ You'll be asked for an auth code; input it and press enter, and a `My Drive` fol
 
 You can drag and drop the model files into the Google Drive, or use `copy_file_to_gdrive` to copy them programmatically.
 
-```python
+```py3
 copy_file_to_gdrive("pytorch_model.bin")
 copy_file_to_gdrive("config.json")
 ```
@@ -48,7 +48,7 @@ Concerned about timeouts in Google Colab? aitextgen has a feature that will copy
 
 As long as your drive is mounted as above, pass `save_gdrive = True` to the `train()` function:
 
-```python
+```py3
 ai.train(save_gdrive=True)
 ```