Skip to content

Commit

Permalink
update docs for v0.5.0
Browse files Browse the repository at this point in the history
  • Loading branch information
minimaxir committed Apr 18, 2021
1 parent 7c7888a commit ad6e3b2
Show file tree
Hide file tree
Showing 13 changed files with 143 additions and 94 deletions.
6 changes: 3 additions & 3 deletions README.md
Expand Up @@ -4,7 +4,7 @@ A robust Python tool for text-based AI training and generation using [OpenAI's](

aitextgen is a Python package that leverages [PyTorch](https://pytorch.org), [Hugging Face Transformers](https://github.com/huggingface/transformers) and [pytorch-lightning](https://github.com/PyTorchLightning/pytorch-lightning) with specific optimizations for text generation using GPT-2, plus _many_ added features. It is the successor to [textgenrnn](https://github.com/minimaxir/textgenrnn) and [gpt-2-simple](https://github.com/minimaxir/gpt-2-simple), taking the best of both packages:

- Finetunes on a pretrained 124M/355M/774M GPT-2 model from OpenAI or a 1325M/355M GPT Neo model from EleutherAI...or create your own GPT-2/GPT Neo model + tokenizer and train from scratch!
- Finetunes on a pretrained 124M/355M/774M GPT-2 model from OpenAI or a 125M/350M GPT Neo model from EleutherAI...or create your own GPT-2/GPT Neo model + tokenizer and train from scratch!
- Generates text faster than gpt-2-simple and with better memory efficiency!
- With Transformers, aitextgen preserves compatibility with the base package, allowing you to use the model for other NLP tasks, download custom GPT-2 models from the HuggingFace model repository, and upload your own models! Also, it uses the included `generate()` function to allow a massive amount of control over the generated text.
- With pytorch-lightning, aitextgen trains models not just on CPUs and GPUs, but also _multiple_ GPUs and (eventually) TPUs! It also includes a pretty training progress bar, with the ability to add optional loggers.
Expand Down Expand Up @@ -35,7 +35,7 @@ Here's how you can quickly test out aitextgen on your own computer, even if you

For generating text from a pretrained GPT-2 model:

```python
```py3
from aitextgen import aitextgen

# Without any parameters, aitextgen() will download, cache, and load the 124M GPT-2 "small" model
Expand All @@ -56,7 +56,7 @@ aitextgen generate --prompt "I believe in unicorns because" --to_file False

Want to train your own mini GPT-2 model on your own computer? You can follow along [in this Jupyter Notebook](/notebooks/training_hello_world.ipynb) or, download this [text file of Shakespeare's plays](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt), cd to that directory in a Terminal, open up a `python3` console and go:

```python
```py3
from aitextgen.TokenDataset import TokenDataset
from aitextgen.tokenizers import train_tokenizer
from aitextgen.utils import GPT2ConfigCPU
Expand Down
33 changes: 23 additions & 10 deletions docs/dataset.md
@@ -1,54 +1,67 @@
# TokenDataset

aitextgen has a special class, `TokenDataset`, used for managing tokenized datasets to be fed into model training. (this is in contrast with other GPT-2 finetuning approaches, which tokenizes at training time although you can still do that if you want)
aitextgen has a special class, `TokenDataset`, used for managing tokenized datasets to be fed into model training. (this is in contrast with other GPT-2 finetuning approaches, which tokenizes at training time although you can still do that by passing a `file_path` and other relevant parameters to `ai.train()`.)

This has a few nice bonuses, including:

- Tokenize a dataset on a local machine ahead of time and compress it, saving time/bandwidth transporting data to a remote machine
- Supports both reading a dataset line-by-line (including single-column CSVs), or bulk texts.
- Debug and log the loaded texts.
- Merge datasets together without using external libraries
- Cross-train on multiple datasets to "blend" them together.

## Creating a TokenDataset For GPT-2 Finetuning

The easiest way to create a TokenDataset is to provide a target file. If no `vocab_file` and `merges_file` are provided, it will use the default GPT-2 tokenizer.
The easiest way to create a TokenDataset is to provide a target file. If no `tokenizer_file` is provided, it will use the default GPT-2 tokenizer.

```python
```py3
from aitextgen.TokenDataset import TokenDataset

data = TokenDataset("shakespeare.txt")
```

If you pass a single-column CSV and specify `line_by_line=True`, the TokenDataset will parse it row-by-row, and is the recommended way to handle multiline texts.

```python
```py3
data = TokenDataset("politics.csv", line_by_line=True)
```

You can also manually pass a list of texts to `texts` instead if you've processed them elsewhere.

```python
```py3
data = TokenDataset(texts = ["Lorem", "Ipsum", "Dolor"])
```

## Block Size

`block_size` is another parameter that can be passed when creating a TokenDataset, more useful for custom models. This should match the context window (e.g. the `n_positions` or `max_position_embeddings` config parameters). By default, it will choose `1024`: the GPT-2 context window.

When implicitly loading a dataset via `ai.train()`, the `block_size` will be set to what is supported by the corresponding model `config`.

## Debugging a TokenDataset

When loading a dataset, a progress bar will appear showing how many texts are loaded and

If you want to see what exactly is input to the model during training, you can access a slice via `data[0]`.

## Saving/Loading a TokenDataset

When creating a TokenDataset, you can automatically save it as a compressed gzipped numpy array when completed.

```python
```py3
data = TokenDataset("shakespeare.txt", save_cache=True)
```

Or save it after you've loaded it with the `save()` function.

```python
```py3
data = TokenDataset("shakespeare.txt")
data.save()
```

By default, it will save to `dataset_cache.tar.gz`. You can then reload that into another Python session by specifying the cache.

```python
```py3
data = TokenDataset("dataset_cache.tar.gz", from_cache=True)
```

Expand All @@ -58,7 +71,7 @@ data = TokenDataset("dataset_cache.tar.gz", from_cache=True)

## Using TokenDatasets with a Custom GPT-2 Model

The default TokenDataset has a `block_size` of `1024`, which corresponds to the _context window of the default GPT-2 model_. If you're using a custom model w/ a different maximum. Additionally, you must explicitly provide the vocab and merges files to rebuild the tokenizer, as the tokenizer will be different than the normal GPT-2 one.
The default TokenDataset has a `block_size` of `1024`, which corresponds to the _context window of the default GPT-2 model_. If you're using a custom model w/ a different maximum. Additionally, you must explicitly provide the tokenizer file to rebuild the tokenizer, as the tokenizer will be different than the normal GPT-2 one.

See the [Model From Scratch](tutorials/model-from-scratch.md) docs for more info.

Expand All @@ -72,7 +85,7 @@ Merging processed TokenDatasets can be done with the `merge_datasets()` function
!!! note "About Merging"
The current implementation merges by subset count, so equalization may not be perfect, but it will not significantly impact training.

```python
```py3
from aitextgen.TokenDataset import TokenDataset, merge_datasets

data1 = TokenDataset("politics1000.csv", line_by_line=True) # 1000 samples
Expand Down
6 changes: 3 additions & 3 deletions docs/generate-performance.md
Expand Up @@ -10,7 +10,7 @@ PyTorch has the ability to quantize models on the CPU. Currently, it will only q

To quantize a model after it's loaded, just run:

```python
```py3
ai.quantize()
```

Expand All @@ -22,13 +22,13 @@ Certain GPUs, notably the cheap T4 and the expensive V100, support the ability t

Assuming you are using a compatable GPU and already have [apex](https://github.com/NVIDIA/apex) installed, you can convert a model to the "half" FP16 mode with this:

```python
```py3
ai.to_fp16()
```

If you want to convert the model _before_ loading it into GPU memory (which may help avoid memory leaks), you can instantiate the model like this:

```python
```py3
ai.to_fp16(to_gpu=True, to_fp16=True)
```

Expand Down
18 changes: 11 additions & 7 deletions docs/generate.md
Expand Up @@ -7,14 +7,18 @@ Thanks to the base Transformers package, aitextgen has more options for generati
See [this article](https://huggingface.co/blog/how-to-generate) by Huggingface engineer Patrick von Platen for how sampling and these parameters are used in practice.

- `n`: Number of texts generated.
- `max_length`: Maximum length of the generated text (default: 200; for GPT-2, the maximum is 1024.)
- `prompt`: Prompt that starts the generated text and is included in the generate text. (used to be `prefix` in previous tools)
- `max_length`: Maximum length of the generated text (default: 200; for GPT-2, the maximum is 1024; for GPT Neo, the maximum is 2048)
- `prompt`: Prompt that starts the generated text and is included in the generated text.
- `temperature`: Controls the "craziness" of the text (default: 0.7)
- `top_k`: If nonzero, limits the sampled tokens to the top _k_ values. (default: 0)
- `top_p`: If nonzero, limits the sampled tokens to the cumulative probability

Some lesser-known-but-still-useful-parameters that are unique to Transformers:

<!--prettier-ignore-->
!!! warning "Performance"
Enabling these parameters may slow down generation.

- `num_beams`: If greater than 1, executes beam search for cleaner text.
- `repetition_penalty`: If greater than 1.0, penalizes repetition in a text to avoid infinite loops.
- `length_penalty`: If greater than 1.0, penalizes text proportional to the length
Expand All @@ -30,17 +34,17 @@ Given a `aitextgen` object with a loaded model + tokenizer named `ai`:
want to generate on the GPU, make sure you call `ai.to_gpu()` beforehand, or
load the model into the GPU using `ai = aitextgen(to_gpu=True)`

- `ai.generate()`: Generates and prints text to console. If `prompt` is used, the `prompt` is bolded. (a la [Talk to Transformer](https://talktotransformer.com))
- `ai.generate()`: Generates and prints text to console. If `prompt` is used, the `prompt` is **bolded**.
- `ai.generate_one()`: A helper function which generates a single text and returns as a string (good for APIs)
- `ai.generate_samples()`: Generates multiple samples at specified temperatures: great for debugging.
- `ai.generate_to_file()`: Generates a bulk amount of texts to file. (this accepts a `batch_size` parameter which is useful if using on a GPU)
- `ai.generate_to_file()`: Generates a bulk amount of texts to file. (this accepts a `batch_size` parameter which is useful if using on a GPU, as it can generate texts in parallel with no performance loss)

<!-- prettier-ignore -->
!!! note "Cleanup"
By default, the `cleanup` parameter is set to True, which automatically removes texts that are blatantly malformed (e.g. only 2 characters long). Therefore, there may be less than `n` results returned. You can disabled this behavior by setting `cleanup=False`.
!!! note "lstrip and nonempty_output"
By default, the `lstrip` and `nonempty_output` parameters to `generate` are set to `True`, which alters the behavior of the generated text in a way that is most likely preferable. `lstrip`: Removes all whitespace at the beginning of the generated space. `nonempty_output`: If the output is empty (possible on shortform content), skip it if generating multiple texts, or try again if it's a single text. If `min_length` is specified, the same behavior occurs for texts below the minimum length after processing.

## Seed

aitextgen has a new `seed` parameter for generation. Using any generate function with a `seed` parameter (must be an integer) and all other models/parameters the same, and the generated text will be identical. This allows for reproducible generations.
aitextgen has a new `seed` parameter for generation. Using any generate function with a `seed` parameter (must be an integer) and all other models/parameters the same, and the generated text will be identical. This allows for reproducible generations in case someone accuses you of faking the AI output.

For `generate_to_file()`, the 8-digit number at the end of the file name will be the seed used to generate the file, making reprodicibility easy.
12 changes: 7 additions & 5 deletions docs/index.md
@@ -1,12 +1,14 @@
# aitextgen

A robust tool for advanced AI text generation via [GPT-2](https://openai.com/blog/better-language-models/).
_Last Updated: April 18th, 2021 (aitextgen v0.5.0)_

aitextgen is a Python package that leverages [PyTorch](https://pytorch.org), [Huggingface Transformers](https://github.com/huggingface/transformers) and [pytorch-lightning](https://github.com/PyTorchLightning/pytorch-lightning) with specific optimizations for text generation using GPT-2, plus _many_ added features. It is the successor to [textgenrnn](https://github.com/minimaxir/textgenrnn) and [gpt-2-simple](https://github.com/minimaxir/gpt-2-simple), taking the best of both packages:
A robust Python tool for text-based AI training and generation using [OpenAI's](https://openai.com) [GPT-2](https://openai.com/blog/better-language-models/) and [EleutherAI's](https://www.eleuther.ai) [GPT Neo/GPT-3](https://github.com/EleutherAI/gpt-neo) architecture.

- Finetunes on a pretrained 124M GPT-2 model from OpenAI...or create your own GPT-2 model + tokenizer and train from scratch!
- Generates text faster than gpt-2-simple and with better memory efficiency! (even [from the 1.5B GPT-2 model](tutorials/generate_1_5b/)!)
- With Transformers, aitextgen preserves compatibility with the base package, allowing you to use the model for other NLP tasks, download custom GPT-2 models from the Huggingface model repository, and upload your own models! Also, it uses the included `generate()` function to allow a massive amount of control over the generated text.
aitextgen is a Python package that leverages [PyTorch](https://pytorch.org), [Hugging Face Transformers](https://github.com/huggingface/transformers) and [pytorch-lightning](https://github.com/PyTorchLightning/pytorch-lightning) with specific optimizations for text generation using GPT-2, plus _many_ added features. It is the successor to [textgenrnn](https://github.com/minimaxir/textgenrnn) and [gpt-2-simple](https://github.com/minimaxir/gpt-2-simple), taking the best of both packages:

- Finetunes on a pretrained 124M/355M/774M GPT-2 model from OpenAI or a 125M/355M GPT Neo model from EleutherAI...or create your own GPT-2/GPT Neo model + tokenizer and train from scratch!
- Generates text faster than gpt-2-simple and with better memory efficiency!
- With Transformers, aitextgen preserves compatibility with the base package, allowing you to use the model for other NLP tasks, download custom GPT-2 models from the HuggingFace model repository, and upload your own models! Also, it uses the included `generate()` function to allow a massive amount of control over the generated text.
- With pytorch-lightning, aitextgen trains models not just on CPUs and GPUs, but also _multiple_ GPUs and (eventually) TPUs! It also includes a pretty training progress bar, with the ability to add optional loggers.
- The input dataset is its own object, allowing you to not only easily encode megabytes of data in seconds, cache, and compress it on a local computer before transporting to a remote server, but you are able to _merge_ datasets without biasing the resulting dataset, or _cross-train_ on multiple datasets to create blended output.

Expand Down
49 changes: 33 additions & 16 deletions docs/load-model.md
Expand Up @@ -8,45 +8,62 @@ There are several ways to load models.

## Loading an aitextgen model

The closer to the default 124M GPT-2 model, the fewer files you need!
For the base case, loading the default 124M GPT-2 model via Huggingface:

For the base case, loading the default model via Huggingface:

```python
```py3
ai = aitextgen()
```

The downloaded model will be downloaded to `cache_dir`: `/aitextgen` by default.

If you've finetuned a 124M GPT-2 model using aitextgen, you can pass the generated `pytorch_model.bin` to aitextgen:
If you're loading a custom model for a different GPT-2/GPT-Neo architecture _from scratch_ but with the normal GPT-2 tokenizer, you can pass only a config.

```python
ai = aitextgen(model="pytorch_model.bin")
```py3
from aitextgen.utils import GPT2ConfigCPU
config = GPT2ConfigCPU()
ai = aitextgen(config=config)
```

If you're loading a finetuned model of a different GPT-2 architecture, you'll must also pass the generated `config.json` to aitextgen:
While training/finetuning a model, two files will be created: the `pytorch_model.bin` which contains the weights for the model, and a `config.json` illustrating the architecture for the model. Both of these files are needed to reload the model.

If you've finetuned a model using aitextgen (the default model), you can pass the **folder name** containing the generated `pytorch_model.bin` and `config.json` to aitextgen (e.g. `trained_model`, which is where trained models will be saved by default).

<!--prettier-ignore-->
!!! note "Same Directory"
If both files are in the current directory, you can pass `model_folder="."`.

```python
ai = aitextgen(model="pytorch_model.bin", config=config)
```py3
ai = aitextgen(model_folder="trained_model")
```

If you want to download an alternative GPT-2 model from Huggingface's repository of models, pass that model name to `model`.
These examples assume you are using the default GPT-2 tokenizer. If you have a _custom tokenizer_, you'll need to pass that along with loading the model.

```py3
ai3 = aitextgen(model_folder="trained_model",
tokenizer_file="aitextgen.tokenizer.json")
```

```python
If you want to download an alternative GPT-2 model from Hugging Face's repository of models, pass that model name to `model`.

```py3
ai = aitextgen(model="minimaxir/hacker-news")
```

The model and associated config + tokenizer will be downloaded into `cache_dir`.

## Loading TensorFlow-based GPT-2 models
This can also be used to download the [pretrained GPT Neo models](https://huggingface.co/EleutherAI) from EleutherAI.

aitextgen lets you download the models from Google's servers that OpenAI had uploaded back when GPT-2 was first released in 2019. These models are then converted to a PyTorch format.
```py3
ai = aitextgen(model="EleutherAI/gpt-neo-125M")
```

## Loading TensorFlow-based GPT-2 models

It's counterintuitive, but it's _substantially_ faster than downloading from Huggingface's servers, especially if you are running your code on Google Cloud Platform (e.g. Colab notebooks)
aitextgen lets you download the models from Microsoft's servers that OpenAI had uploaded back when GPT-2 was first released in 2019. These models are then converted to a PyTorch format.

To use this workflow, pass the corresponding model number to `tf_gpt2`:

```python
```py3
ai = aitextgen(tf_gpt2="124M")
```

Expand Down
4 changes: 2 additions & 2 deletions docs/loggers.md
Expand Up @@ -6,14 +6,14 @@ You can create loggers with popular tools such as [TensorBoard](https://www.tens

For example, if you want to create a TensorBoard logger, you can create it:

```python
```py3
from pytorch_lightning import loggers

tb_logger = loggers.TensorBoardLogger('logs/')
```

Then pass it to the `loggers` parameter for `ai.train()`.

```python
```py3
ai.train(train_data=data, loggers=tb_logger)
```
10 changes: 5 additions & 5 deletions docs/save-model.md
Expand Up @@ -2,15 +2,15 @@

There are are multiple ways to save models.

Whenever a model is saved, two files are generated: `pytorch_model.bin` which contains the model weights, and `config.json` which is needed to load the model if it is not the base 124M GPT-2.
Whenever a model is saved, two files are generated: `pytorch_model.bin` which contains the model weights, and `config.json` which is needed to load the model.

Assuming we have an aitextgen model `ai`:

## Ad Hoc saving

The aitextgen model can be saved at any time using `save`.

```python
```py3
ai.save()
```

Expand All @@ -24,7 +24,7 @@ If you are using Google Colaboratory, you can mount your personal Google Drive t

First mount your Google Drive using `mount_gdrive()`:

```python
```py3
from aitextgen.colab import mount_gdrive, copy_file_to_gdrive
mount_gdrive()
```
Expand All @@ -33,7 +33,7 @@ You'll be asked for an auth code; input it and press enter, and a `My Drive` fol

You can drag and drop the model files into the Google Drive, or use `copy_file_to_gdrive` to copy them programmatically.

```python
```py3
copy_file_to_gdrive("pytorch_model.bin")
copy_file_to_gdrive("config.json")
```
Expand All @@ -48,7 +48,7 @@ Concerned about timeouts in Google Colab? aitextgen has a feature that will copy

As long as your drive is mounted as above, pass `save_gdrive = True` to the `train()` function:

```python
```py3
ai.train(save_gdrive=True)
```

Expand Down

0 comments on commit ad6e3b2

Please sign in to comment.