# Data Preparation
* Before the model receives the data, we need to pre-process and tokenize it. The following notebook will explore the best way to do that. https://pytorch.org/tutorials/beginner/transformer_tutorial.html

In [10]:
import torch
import pandas as pd
import numpy as np
import sys

### Dataset Selection
* Found more related dataset - prepared by Stanford for their Alpaca model, it is a series of prompts and responses from OpenAI's text-davinci-003
* 52k unique instructions
* https://github.com/tatsu-lab/stanford_alpaca

### Modify Data for our Needs
* Original data was split into instruction, input, and answer
* We are only interested in two columns (input and answer)

In [None]:
# df = pd.read_json("datasets/alpaca/alpaca_data.json")
# df["instruction"] = df["instruction"].fillna("") + " " + df["input"].fillna("")
# df = df.drop(columns=["input"])

# split = int(len(df) * 0.80)
# train_df = df.iloc[:split]
# test_df = df.iloc[split:]
# train_df.to_csv("datasets/alpaca/train.csv")
# test_df.to_csv("datasets/alpaca/test.csv")
# train_df.head()

### Prep Dataframe as Iterator for vocab
* We combine train and test to ensure all words in each are included in the vocabulary

In [4]:
train_df = pd.read_csv("datasets/alpaca/train.csv")
test_df = pd.read_csv("datasets/alpaca/test.csv")
frames = [train_df, test_df]
comb_df = pd.concat(frames)
comb_df = comb_df.drop(comb_df.columns[0], axis=1)
comb_df.head()

Unnamed: 0,instruction,output
0,Give three tips for staying healthy.,1.Eat a balanced diet and make sure to include...
1,What are the three primary colors?,"The three primary colors are red, blue, and ye..."
2,Describe the structure of an atom.,"An atom is made up of a nucleus, which contain..."
3,How can we reduce air pollution?,There are a number of ways to reduce air pollu...
4,Describe a time when you had to make a difficu...,I had to make a difficult decision when I was ...


In [5]:
print(f"Lengths match? {len(train_df) + len(test_df) == len(comb_df)}")

Lengths match? True


* Next, we combine question and answer into a single column (just need to put all words in vocab)

In [6]:
comb_df["instruction"] = comb_df["instruction"].fillna("") + " " + comb_df["output"].fillna("")
comb_df = comb_df.drop(columns=["output"]).rename(columns={"instruction": "text"})

In [7]:
comb_df.iloc[0, 0]

'Give three tips for staying healthy.  1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'

* Turn df into iterator

In [8]:
def gen_rows(df):
    for row in df.itertuples(index=False):
        yield str(row)

combined_text = gen_rows(comb_df)

### Tokenizing
* Torch has a built-in tokenizer, but it apparently is very naive. By combinind with another library (spacy), we can have a more nuanced understanding (not just a simple split, understands that "don't" should be split into "do" and "not"). 

In [9]:
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer('spacy', language='en_core_web_sm')

### Vocabulary
* This vocabularly associates integer values with each token extracted by our tokenizer. This process can take a while (we make an entry for EVERY token in our dataset)

In [10]:
from torchtext.vocab import build_vocab_from_iterator
def yield_tokens(data_iter):
    for text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(combined_text), min_freq=1, specials=["<unk>", "<pad>", "<sos>", "<eos>"])
vocab.set_default_index(vocab["<unk>"])
torch.save(vocab, 'vocabs/alpaca/vocab.pth')

* Because this previous process is so slow, we save the object after the initial calculation. This allows us to simply load the object in later iterations.

In [11]:
vocab = torch.load('vocabs/alpaca/vocab.pth')

In [26]:
type(vocab)

torchtext.vocab.vocab.Vocab

### Dataset
* This dataset also cleans up the data, removing entries with questions or answers that are too long
* Fills each entry with padding up to maximum length value (to keep things consistent)

In [12]:
sys.path.append("datasets/alpaca")
from alpaca_dataset import Alpaca_Dataset

In [13]:
train_dataset = Alpaca_Dataset("datasets/alpaca/train.csv", vocab, tokenizer)
test_dataset = Alpaca_Dataset("datasets/alpaca/test.csv", vocab, tokenizer)

In [14]:
torch.save(train_dataset,"datasets/alpaca/train_dataset.pth")
torch.save(test_dataset,"datasets/alpaca/test_dataset.pth")

### Dataloader

In [15]:
loader = torch.utils.data.DataLoader(train_dataset,batch_size=8,shuffle=True)

In [16]:
next(iter(loader))

[tensor([[  876,     5,    38,  ...,     1,     1,     1],
         [18889,     8,   240,  ...,     1,     1,     1],
         [ 6041,     8,  3874,  ...,     1,     1,     1],
         ...,
         [  754,     8,  5810,  ...,     1,     1,     1],
         [ 3176,     5,   229,  ...,     1,     1,     1],
         [ 8023,    24,   933,  ...,     1,     1,     1]]),
 tensor([[   2,   17,  674,  ...,    1,    1,    1],
         [   2,   17,  240,  ...,    1,    1,    1],
         [   2,   17, 3874,  ...,    1,    1,    1],
         ...,
         [   2,   50,  548,  ...,    1,    1,    1],
         [   2, 1201,   39,  ...,    1,    1,    1],
         [   2,   50, 3253,  ...,    1,    1,    1]])]