# Natural Language Processing Walkthrough
Please make a copy of this notebook to work on in colab or your favorite jupyter environment. 

Note that this is an advanced tutorial walking through a specific use case of aitextgen. It is highly recommended that you visit Max Woolf's [tutorial](https://colab.research.google.com/drive/144MdX5aLqrQ3-YW-po81CQMrD6kpgpYh?usp=sharing#scrollTo=LdpZQXknFNY3) on aitextgen custom model training before continuing forward with this tutorial.

If you find any typos or have questions, enhancement requests, or suggestions, please reach out to [mitchellbcutts@gmail.com](mailto:mitchellbcutts@gmail.com).


# Setup
This tutorial will walk you through the python and machine learning side of a GPT text generation project. The dataset for this project is taken from [kaggle, you can click here to check it out](https://www.kaggle.com/neisse/scrapped-lyrics-from-6-genres?select=artists-data.csv). The overall project this tutorial will build is a *Rock Lyric Generator*!!!

Disclaimer: This tutorial was built using three sources that are worth checking out:
  - Max Woolf's [tutorial](https://colab.research.google.com/drive/144MdX5aLqrQ3-YW-po81CQMrD6kpgpYh?usp=sharing#scrollTo=LdpZQXknFNY3) on aitextgen custom model training
  - Max Woolf's [tutorial](https://colab.research.google.com/drive/15qBZx5y9rdaQSyWpsreMDnTiZ5IlN0zD?usp=sharing) on aitextgen GPT-2 fine tuning
  - Francois St-Amant's [tutorial](https://towardsdatascience.com/how-to-fine-tune-gpt-2-for-text-generation-ae2ea53bc272) on Fine Tuning GPT-2 and evaluation

For CPU training see: https://github.com/minimaxir/aitextgen/blob/master/notebooks/training_hello_world.ipynb

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd /content/drive/MyDrive/LYRICBOT
!pip3 install -q aitextgen

[Errno 2] No such file or directory: '/content/drive/MyDrive/LYRICBOT'
/content
[K     |████████████████████████████████| 572 kB 5.6 MB/s 
[K     |████████████████████████████████| 3.8 MB 54.6 MB/s 
[K     |████████████████████████████████| 87 kB 6.7 MB/s 
[K     |████████████████████████████████| 527 kB 52.9 MB/s 
[K     |████████████████████████████████| 134 kB 57.9 MB/s 
[K     |████████████████████████████████| 952 kB 55.6 MB/s 
[K     |████████████████████████████████| 397 kB 58.9 MB/s 
[K     |████████████████████████████████| 829 kB 56.2 MB/s 
[K     |████████████████████████████████| 596 kB 53.6 MB/s 
[K     |████████████████████████████████| 1.1 MB 53.4 MB/s 
[K     |████████████████████████████████| 67 kB 6.0 MB/s 
[K     |████████████████████████████████| 895 kB 46.0 MB/s 
[K     |████████████████████████████████| 6.5 MB 39.8 MB/s 
[K     |████████████████████████████████| 94 kB 3.7 MB/s 
[K     |████████████████████████████████| 271 kB 59.7 MB/s 
[K     |███

In [None]:
#imports
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
import logging
logging.basicConfig(
        format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )
from sklearn.model_selection import train_test_split
from aitextgen import aitextgen
from aitextgen.TokenDataset import TokenDataset
from aitextgen.utils import build_gpt2_config
from aitextgen.tokenizers import train_tokenizer
import pandas as pd
import numpy as np

03/18/2022 19:52:01 — INFO — numexpr.utils — NumExpr defaulting to 2 threads.


# Linux, Command Line and Pandas Dataset Trimming
Since Colab and Cocalc run on Linux, I'm going to take advantage of some Linux commands to download my dataset.

I then will use pandas to subset my dataset as shown below:

- _original dataset:_ over 190000 song lyrics that are in multiple languages and genres 

- _finalized dataset:_ about 94000 song lyrics that are in english and are of the "rock" genre. 

Now our model will have much more specific data to train upon and we can expect it to output english language based on rock songs.

**Note: because this is a comprehensive tutorial, I wanted to include some preprocessing steps and options. If your data is simply a text file, you will not need these steps.**

In [None]:
# will download the dataset files to your local repository! uncomment this the first time you are using this notebook.
!gdown 13WCZxed3GcGMvcGX69y5D2rfhFtA8w0a
# will unzip the zip file of the dataset. uncomment this the first time you are using this notebook.
!unzip lyrics_dataset.zip
%ls #use this to check if they are there

In [None]:
artists = pd.read_csv("artists-data.csv")
lyrics = pd.read_csv("lyrics-data.csv")

In [None]:
artists.dropna(axis=0)
artists = artists[artists["Genres"].str.contains("Rock", na=False)]
artists.head() #now we have all the artists that make rock music

In [None]:
artists["Link"].astype('string')
artist_links = artists.Link.values.tolist() #creating a list of artist names that make rock music so we can cross reference the lyrics data to see if they belong to rock or an adjacent genre. 

In [None]:
lyrics.dropna(axis=0)
lyrics = lyrics[lyrics['language'] == 'en']
lyrics.ALink.astype("string")
lyrics.head() #now we have all english lyrics

In [None]:
master_lyrics = lyrics[lyrics["ALink"].isin(artist_links)]
master_lyrics.index #our improved dataset is 94992 rock or rock adjacent songs.

In [None]:
master_lyrics.head()

# Train/Test Split and Preprocessing for Training
Now that we have our data frame subsetted into 94992 samples of rock lyrics, we will perform a train/test split on our data so we can perform BLEU metric evaluation later in the project.

Note that performing a train/test split is not something you will always have to do, and you should avoid this if your dataset is simply a text file and can't be broken up into several observations (such as the classic example of training on [the complete shakespeare works](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt)). If you don't perform a train/test split, note that you will be evaluating your model qualitatively by human inspection. 

This dataset is a good candidate to train/test split because it has several hundred different instances and is not just a single text file. To perform the split effectively to where we will be able to train, we will also save the  training data to a `TokenDataset` type from aitextgen which will encode the data for training!

**Notes:** we will be manipulating the test dataframe later when we get to evaluation. For now, we will leave it as is. Also, if your dataset is a text file, you will not need to perform these steps and will perform manual evaluation. 

In [None]:
%pwd

In [None]:
master_lyrics.Lyric.astype("string")
train, test = train_test_split(master_lyrics, test_size=0.002) #we will use sklearn's train/test split function to split our data 75/25
print("training shape: ", train.shape, " testing shape: ", test.shape)

training_samples = train.Lyric.values.tolist()
# print("\n\nprinting part of a training sample: \n\n" + training_samples[0][15:240]) # if you want to inspect output

# Training

Now here is where the magic happens! This next section will guide you through training your model. 

The main decision you will have to make is whether you will train your own custom tokenizer + model from scratch or simply tune an existing model to fit your needs. Here is a snippet from `aitextgen`'s documentation that should help you decide.
    
    The original GPT-2 model released by OpenAI was trained on English webpages linked to from Reddit. 
    It has a strong bias toward longform content (multiple paragraphs). 
    
    If that is *not* your use case, you may get a better generation quality *and* speed by training your own model and Tokenizer.
    Examples of good use cases for training your own tokenizer:
    
    - Short-form content (e.g. Tweets, Reddit post titles)
    - Non-English Text
    - Heavily Encoded Text
    
    It still will require a *massive* amount of training time (several hours) but will be more flexible.

In this tutorial, we will cover both [training our own custom model and tokenizer](https://docs.aitextgen.io/tutorials/model-from-scratch/) as well as [fine tuning a GPT-2 model](https://colab.research.google.com/drive/15qBZx5y9rdaQSyWpsreMDnTiZ5IlN0zD?usp=sharing). Note that you only need to do one in your project. 

In [None]:
!nvidia-smi -L #figure out what type of GPU we are using

## Training your own custom model and tokenizer. 
Follow these steps to train your own custom model and tokenizer. Remember, here are the use cases for training your own tokenizer.

- Short-form content (e.g. Tweets, Reddit post titles)
- Non-English Text
- Heavily Encoded Text

**IMPORTANT NOTE FOR TIME MANAGEMENT:** Note that this will require a *massive* amount of training time (several hours) but will be more flexible so **if you are working on short-form text generation, use the fine-tuning method first** (next section) so you at very least have a working model before taking all the time needed to train your own tokenizer and custom model.



### Training the Tokenizer
The `train_tokenizer()` function wraps the training method for the tokenizer package from Huggingface. We are going to need to give it a file to train on, so we will simply export our training lyrics to `.txt` format in order to satisfy the `train_tokenizer()` argument requirements.

After the training is completed, this will save one file: aitextgen.tokenizer.json, which is needed to rebuild the tokenizer (in other words save it for later).

In [None]:
file_name = "/content/drive/MyDrive/LYRICBOT/tokenizer_input.txt"
train.Lyric.astype("string")
with open(file_name, 'a') as f:
    dfAsString = train["Lyric"].to_string(header=False, index=False)
    f.write(dfAsString)

In [None]:
train_tokenizer(file_name)
tokenizer_file = "aitextgen.tokenizer.json"

In [None]:
training_dataset = TokenDataset(texts=training_samples, tokenizer_file=tokenizer_file, save_cache=True)
training_file = "/content/drive/MyDrive/LYRICBOT/dataset_cache.tar.gz"
# training_dataset = TokenDataset("dataset_cache.tar.gz", from_cache=True) #if you want to load in for later!

### Specify a model configuration
You can use build_gpt2_config to specify a model configuration. You most likely will want to adjust max_length (context window size) and n_embd (embedding size). The config used here is the one used to build a demo Reddit model; I recommend you experiment with these parameters and read up about context windows and embedding size before camp!

In [None]:
config = build_gpt2_config(vocab_size=5000, max_length=32, dropout=0.0, n_embd=256, n_layer=8, n_head=8)
config

### Instantiating Your Custom GPT-2 Model

In [None]:
ai_custom = aitextgen(config=config, tokenizer_file=tokenizer_file)

### Training Your Custom GPT-2 Model

The next cell will start the actual training of GPT-2 in aitextgen. It runs for `num_steps`, and a progress bar will appear to show training progress, current loss (the lower the better the model), and average loss (to give a sense on loss trajectory).

The model will be saved every `save_every` steps in `trained_model` by default, and when training completes. If you mounted your Google Drive, the model will _also_ be saved there in a unique folder.

The training might time out after 4ish hours; if you did not mount to Google Drive, make sure you end training and save the results so you don't lose them! (if this happens frequently, you may want to consider using [Colab Pro](https://colab.research.google.com/signup))

Important parameters for `train()`:

- **`line_by_line`**: Set this to `True` if the input text file is a single-column CSV, with one record per row. aitextgen will automatically process it optimally.
- **`from_cache`**: If you compressed your dataset locally (as noted in the previous section) and are using that cache file, set this to `True`.
- **`num_steps`**: Number of steps to train the model for.
- **`generate_every`**: Interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: Interval of steps to save the model: the model will be saved in the VM to `/trained_model`.
- **`save_gdrive`**: Set this to `True` to copy the model to a unique folder in your Google Drive, if you have mounted it in the earlier cells
- **`batch_size`**: Batch size of the model training; setting it too high will cause the GPU to go OOM. _Unlike finetuning, since you are using a small model, you can massively increase the batch size to normalize the training_.
- **`fp16`**: Enables half-precision training for faster/more memory-efficient training. Only works on a T4 or V100 GPU.

Here are other important parameters for `train()` that are useful but you likely do not need to change.

- **`learning_rate`**: Learning rate of the model training.


In [None]:
ai_custom.train(training_file,
                line_by_line=False,
                from_cache=True,
                num_steps=10000,
                generate_every=500,
                save_every=500,
                save_gdrive=True,
                learning_rate=1e-3,
                output_dir = "/content/drive/MyDrive/LYRICBOT/custom_model",
                batch_size=256)

## Fine-Tuning GPT-2  
Follow these steps to fine tune an existing model. This will perform best on longer texts, but it can be used as a baseline model for shorter texts as well. 

For this project, training a custom model is probably a better idea given that this is short instances of text. Think about what would be best for your project!



In [None]:
training_dataset = TokenDataset(texts=training_samples, save_cache=True) #not using a custom tokenizer so redefining this variable to match that.
training_file = "/content/drive/MyDrive/LYRICBOT/dataset_cache.tar.gz"
'''
Commented Below are some options for models you can use on this project. We have found distilgpt2 to be faster than GPT-NEO and
GPT-NEO to be faster than GPT-2 124M. GPT-2 and GPT-NEO have similar quality where GPT-neo has better performance. 
Uncomment the model you want to use!
'''
# ai = aitextgen(tf_gpt2="124M", to_gpu=True) #will download model into current directory.
# ai_tuned = aitextgen(model="EleutherAI/gpt-neo-125M", to_gpu=True) #will download model into current directory.
ai_tuned = aitextgen(model="distilgpt2", to_gpu=True) #will download model into current directory.

### Finetune GPT-2

The next cell will start the actual finetuning of GPT-2 in aitextgen. It runs for `num_steps`, and a progress bar will appear to show training progress, current loss (the lower the better the model), and average loss (to give a sense on loss trajectory).

The model will be saved every `save_every` steps in `trained_model` by default, and when training completes. If you mounted your Google Drive, the model will _also_ be saved there in a unique folder.

The training might time out after 4ish hours; if you did not mount to Google Drive, make sure you end training and save the results so you don't lose them! (if this happens frequently, you may want to consider using [Colab Pro](https://colab.research.google.com/signup))

Important parameters for `train()`:

- **`line_by_line`**: Set this to `True` if the input text file is a single-column CSV, with one record per row. aitextgen will automatically process it optimally.
- **`from_cache`**: If you compressed your dataset locally (as noted in the previous section) and are using that cache file, set this to `True`.
- **`num_steps`**: Number of steps to train the model for.
- **`generate_every`**: Interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: Interval of steps to save the model: the model will be saved in the VM to `/trained_model`.
- **`save_gdrive`**: Set this to `True` to copy the model to a unique folder in your Google Drive, if you have mounted it in the earlier cells
- **`fp16`**: Enables half-precision training for faster/more memory-efficient training. Only works on a T4 or V100 GPU.

Here are other important parameters for `train()` that are useful but you likely do not need to change.

- **`learning_rate`**: Learning rate of the model training.
- **`batch_size`**: Batch size of the model training; setting it too high will cause the GPU to go OOM. (if using `fp16`, you can increase the batch size more safely)

In [None]:
ai_tuned.train(training_file,
         line_by_line=False,
         from_cache=True,
         num_steps=3000,
         generate_every=500,
         save_every=500,
         save_gdrive=True,
         learning_rate=1e-3,
         batch_size=1,
         output_dir = "tuned_model")

# Inference, Generation and Results
Now that our models are trained, let's see what their output looks like in terms of lyric generation! The following code is very simple but will how to load in models and generate text in seconds!

Your best resource for learning how to generate text can be found on [aitextgen's documentation](https://docs.aitextgen.io/generate/).

###Loading in Models

In [None]:
#LOAD IN A MODEL, let's use the custom model
ai = aitextgen(model_folder="custom_model", tokenizer_file="/content/drive/MyDrive/LYRICBOT/aitextgen.tokenizer.json", to_gpu=True)

### Generating Lyrics

In [None]:
ai.generate(n=1, batch_size=5, prompt="I wanna dance with somebody", max_length=500, temperature=1, top_p=0.9)

# Evaluation
In classification, we use metrics like accuracy, precision, recall, F1-score, the list goes on...

In text generation, it is very hard to quantify how well a model is generating text based on a prompt. For this reason, most use cases will require manual inspection of output (reading and judging the quality), and this evaluation method will be the norm for many teams at camp. 

With that said, there are metrics that do exist, and this example will walk you through how to use the BLEU metric to see how well your model generates text given a prompt. 

Note: This code was adapted from https://towardsdatascience.com/how-to-fine-tune-gpt-2-for-text-generation-ae2ea53bc272. You can read more about the logic on that page!

***NOTE: BLEU (BiLingual Evaluation Understudy) is a metric for automatically evaluating machine-translated text. The BLEU score is a number between zero and one that measures the similarity of the machine-translated text to a set of high quality reference translations.***

In [None]:
def text_generation(ai, test_data):
  generated_lyrics = []
  for i in range(len(test_data)):
    if(len(test_data["Lyric"][i]) == 0):
      continue
    try:
      x = ai.generate(n=1, prompt=test_data["Lyric"][i], max_length=2000, temperature=0.7, return_as_list=True,  nonempty_output= False)
      generated_lyrics.append(x[0])
    except AssertionError as error:
      # Output expected AssertionErrors.
      generated_lyrics.append(np.NaN)
      # print("Exception thrown: ", error)
    except Exception as exception:
      # Output unexpected Exceptions.
      generated_lyrics.append(np.NaN)
      # print("Exception thrown: ", exception)
  return generated_lyrics

In [None]:
# # this code was adapted from https://towardsdatascience.com/how-to-fine-tune-gpt-2-for-text-generation-ae2ea53bc272
import statistics
from nltk.translate.bleu_score import sentence_bleu

test_set = pd.DataFrame(test["Lyric"])
test_set['True_end_lyrics'] = test_set['Lyric'].str.split().str[20:40].apply(' '.join)
test_set['Lyric'] = test_set['Lyric'].str.split().str[0:10].apply(' '.join)

generated_lyrics = text_generation(ai, test_set)


In [None]:
test_set['Generated_lyrics'] = pd.Series(generated_lyrics)
test_set.dropna(inplace=True)
final_set = test_set.reset_index()
final_set.head()

scores = []
samples_count = len(final_set)

for i in range(len(final_set)):
  reference = final_set['True_end_lyrics'][i]
  candidate = final_set['Generated_lyrics'][i]
  scores.append(sentence_bleu(reference, candidate))

score = statistics.mean(scores)

print(f'\n\nOverall BLEU score for this model, ran on a set of {samples_count} testing samples: {score}')