# Building an AWD-LSTM Model for Story Generation

Text generation encompasses the creation of new text based on given input prompts. This task is well-demonstrated by tools like `GitHub's Copilot`, renowned for generating code. Beyond code, text generation capabilities include:

> - **Story Generation**: For instance, inputting "Once upon a time" into a GPT-2 model can lead to the creation of imaginative stories.
>
> - **Poetry Generation**: Models can also be used to compose poetry.
>
> - **Paragraph Completion**: They are capable of filling in missing parts of a paragraph, maintaining the flow and context.
>
> - **Article Summarization**: These models can succinctly summarize lengthy articles.
>
> - **Question Answering**: Given a context, models can be designed to answer various questions.

In this tutorial, we aim to train an AWD-LSTM model, a specific architecture designed for text generation, particularly for the purpose of story generation. Our focus will be on training a causal language model, which is tasked with predicting the next token in a sequence, thereby learning the pattern and structure of the input data.

**Our Workflow Includes:**

> - **Dataset Loading**: First, we'll load our data from CSV files, prepping it for the model.
>
> - **Text Tokenization and Preprocessing**: We'll process the text, converting it into a tokenized format suitable for our model.
>
> - **Batch Creation**: Next, we organize the tokenized text into batches for efficient training.
> - **Custom Training the AWD-LSTM Model**: We will fine-tune the AWD-LSTM model, adapting it to better suit the task of story generation.
>
> - **Model Evaluation**: Finally, we'll evaluate the performance of our trained model using various metrics to ensure its effectiveness in generating text.


## Parameters for notebook execution:

It's better to store all the parameters we need for a successful execution in one place. This way it's easy to manage the parameters. 

In [None]:
# parameters

# SAMPLE
TRAIN_ROWS=50000
TEST_ROWS=5000

# PATH OF CSV FILES
DIR_PATH = './Downloads'
TRAIN_PATH= DIR_PATH+"/data/train_df.csv"
VALID_PATH=DIR_PATH+"/data/valid_df.csv"
TEST_PATH= DIR_PATH+"/data/test_df.csv"

# DATA PROCESSING
CONTEXT_LEN=256

# HYPERPARAMETERS
TRAIN_BS= 64
TEST_BS= 64 
EPOCHS=5

### Loading the Data

The first step in our model's journey is to load the dataset from `CSV` files. This stage is critical as it provides the raw material – the text data – that our model will learn and generate stories from. We ensure that the data is correctly imported and ready for the next stages of processing and model training.

In [None]:
# import required libraries
from fastai.text.all import *
import pandas as pd

In [None]:
# load the data
train_df = pd.read_csv(TRAIN_PATH)
test_df = pd.read_csv(TEST_PATH)
valid_df = pd.read_csv(VALID_PATH)

Let's drop the unecessary columns:
- `Unnamed: 0`
- `prompts`


In [None]:
train_df.drop(columns=['Unnamed: 0', 'prompts'], inplace=True)
test_df.drop(columns=['Unnamed: 0', 'prompts'], inplace=True)
valid_df.drop(columns=['Unnamed: 0', 'prompts'], inplace=True)

Selecting a subset of the data

In [None]:
train_df = train_df.sample(n=TRAIN_ROWS, random_state=42)
valid_df = valid_df.sample(n=TEST_ROWS, random_state=42)

`Fastai` requires us to tell it whether the data is coming from the training set or the testing set. So, we need to create a `is_valid` column which tells us whether the row represents data from the training set or the validation set. 

In [None]:
train_df['is_valid']=False
valid_df['is_valid']=True

In [None]:
all_df = pd.concat([train_df, valid_df],ignore_index=True)

### Creating DataBlock and Dataloaders

Following the data loading phase, we proceed to set up the `DataBlock` and `Dataloaders`:

1. **DataBlock Configuration**: We configure the `DataBlock` to handle text processing tasks like tokenization and splitting the dataset into training and validation sets.

2. **Preprocessing Steps**: The `DataBlock` facilitates any necessary custom preprocessing to ensure data consistency and structure.

3. **Dataloaders Initialization**: Using the `DataBlock`, we create `Dataloaders` which batch the data, optimizing it for model training.

4. **Efficient Data Handling**: The `Dataloaders` manage tasks such as shuffling and parallel data loading, enhancing training efficiency.

5. **Training Readiness**: With `Dataloaders`, the data is structured and presented in a way that is ideal for the training of our model.

In [None]:
dsets= DataBlock(blocks=TextBlock.from_df('stories',min_freq=3,is_lm=True),
                    get_x=ColReader('text'),
                    splitter=ColSplitter())

In [None]:
dls = dsets.dataloaders(source=all_df,bs=TRAIN_BS,seq_len=CONTEXT_LEN)

## Training the Model:
After the data processing is done, we need to train the model. 

In [None]:
learn = language_model_learner(dls, AWD_LSTM, drop_mult=0.3, 
    metrics=[accuracy, Perplexity()]
    ).to_fp16()

In [None]:
learn.model.reset()

In [None]:
# find the optimal learning rate
learn.lr_find()

In [None]:
# train the language model
learn.fit_one_cycle(2, 1e-2)

In [None]:

learn.fit_one_cycle(1, 2e-3)

In [None]:
learn.save("story_awd_lstm")

Path('models/story_awd_lstm.pth')

## __Perplexity__

In [None]:
valid_perplexity = learn.validate()[2]

In [None]:
valid_perplexity

25.749881744384766

In [None]:
def data(n=10):
    for i in range(n):

        # pass first 10 words
        yield " ".join(dataset['test'][i]['stories'].split(" ")[:10])

In [None]:
predictions=[]
references=[]

i=0
for out in pipe(data(),num_return_sequences=1,num_beams=3,
                do_sample=True, max_new_tokens=100,
                pad_token_id=tokenizer.eos_token_id):

    references.append(" ".join(dataset['test'][i]['stories'].split(" ")[10:100]))
    predictions.append(" ".join(out[0]['generated_text'].split(" ")[10:100]))

    i+=1