# Poem Generation using FastAI


In [1]:
pip install fastai


Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'C:\Users\Nada\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.


In [2]:
pip install transformers


Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'C:\Users\Nada\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.


In this tutorial you will see how to fine-tune a pretrained transformer model from the transformers library by HuggingFace. It can be very simple with FastAI's data loaders. It's possible to use any of the pretrained models from HuggingFace. Below we will experiment with GPT2. 

## Import Libraries


In [3]:
# from fastbook import *
from fastai.text.all import *
from transformers import GPT2LMHeadModel, GPT2TokenizerFast

In [4]:
pretrained_weights = 'gpt2'
tokenizer = GPT2TokenizerFast.from_pretrained(pretrained_weights)
model = GPT2LMHeadModel.from_pretrained(pretrained_weights)

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


## Read Data
This data is organized by folder. There are two main folders: forms (e.g. haiku, sonnet, etc.) and topics (e.g. love, peace, etc.). Those main folders contain subfolders for the subcategories and then the poem txt files are contained in those.
With fastai, it's quite easy to read the data with the the get_text_files function. You can select all folders or select specific ones.

In [8]:
path = 'C:/Users/Nada/OneDrive/Desktop/gl3/2eme semestre/ppp/hedhi nhessha bech tenjah'

In [9]:
poems = get_text_files(path, folders = ['forms','topics'])
print("There are",len(poems),"poems in the dataset")

There are 20657 poems in the dataset


We'll start off with training the model on ballads. There are only 100 ballads so it won't take as long to train. However you can add more poem forms. For instance, a haiku would be very cool to experiment with and to see if it maintains the 5,7,5 syllable structure. You can also change the path to the topics folder instead of poem forms and you can try out a bunch of poem topics like love, anger, depression, etc.. 

In [10]:
ballads = get_text_files(path+'/forms', folders = ['ballad'])
print("There are",len(ballads),"ballads in the dataset")

There are 100 ballads in the dataset


In [11]:
txt = poems[0].open().read(); #read the first file
print(txt)

2 ABC of H.k. and China revised vision.
Barrels tears are wines and salts.
With a whisk on goody tails!
Wiggle maces to fix the heads.
Heads in jack on boxes are ceased.
Cry to paranoid truly bosses.
Bosses are jokers take your boys.
Studs are bogs with fire apples.
True predicates worth cases.â€™
Descents wash in badly bands.
Wholly sales are smart with cats.
Who got tenth honors in China?
Homage grand to play and plays!
Trim the times of hearts then cry.
Tanks in steels but voice wail.
Bossy dragged by tails that whisked.
Go very timid and love the wise.
Hands are lent but laws are ends.
Cases on courts are borrowed lands.
Length long with treads to retch!
Straps on times and watch here.
Arrays tanks but all are men.
Cross all suctions steal the ends.
Cave on minds are cages on objects.
Rouser rockets powers holes.
Confine curses to stop our wounds.
Whirl your bodies and jump on grounds.
Crouch of soldiers after kicks with flings.
Block one leg and hit the middle.
Cauchy3 know the tr

## Prepare the Data



In [14]:
ballads = L(o.open(encoding='utf-8').read() for o in ballads) # to make things easy we will gather all texts in one numpy array

In [15]:
def flatten(A):
    rt = []
    for i in A:
        if isinstance(i,list): rt.extend(flatten(i))
        else: rt.append(i)
    return rt
  
all_ballads = flatten(ballads)

In [16]:
class TransformersTokenizer(Transform):
    def __init__(self, tokenizer): self.tokenizer = tokenizer
    def encodes(self, x): 
        toks = self.tokenizer.tokenize(x)
        return tensor(self.tokenizer.convert_tokens_to_ids(toks))
    def decodes(self, x): return TitledStr(self.tokenizer.decode(x.cpu().numpy()))

In [17]:
splits = [range_of(70), range(100)] # use a 70/30 split
tls = TfmdLists(all_ballads, TransformersTokenizer(tokenizer), splits=splits, dl_type=LMDataLoader)

In [18]:
show_at(tls.train, 0)

The burden of hard hitting. Slug away
Like Honus Wagner or like Tyrus Cobb.
Else fandom shouteth: "Who said you could play?
Back to the jasper league, you minor slob!"
Swat, hit, connect, line out, goet on the job.
Else you shall feel the brunt of fandom's ire
Biff, bang it, clout it, hit it on the knob -
This is the end of every fan's desire.
The burden of good pitching. Curved or straight.
Or in or out, or haply up or down,
To puzzle him that standeth by the plate,
To lessen, so to speak, his bat-renown:
Like Christy Mathewson or Miner Brown,
So pitch that every man can but admire
And offer you the freedom of the town -
This is the end of every fan's desire.
The burden of loud cheering. O the sounds!
The tumult and the shouting from the throats
Of forty thousand at the Polo Grounds
Sitting, ay, standing sans their hats and coats.
A mighty cheer that possibly denotes
That Cub or Pirate fat is in the fire;
Or, as H. James would say, We've got their goats -
This is the end of every fan'

In [19]:
bs,sl = 4,256
dls = tls.dataloaders(bs=bs, seq_len=sl)

Token indices sequence length is longer than the specified maximum sequence length for this model (1214 > 1024). Running this sequence through the model will result in indexing errors


In [20]:
dls.show_batch(max_n=2)

Unnamed: 0,text,text_
0,"For God, our God is a gallant foe\nThat playeth behind the veil.\nI have loved my God as a child at heart\nThat seeketh deep bosoms for rest,\nI have loved my God as a maid to man—\nBut lo, this thing is best:\nTo love your God as a gallant foe that plays behind the veil;\nTo meet your God as the night winds meet beyond Arcturus' pale.\nI have played with God for a woman,\nI have staked with my God for truth,\nI have lost to my God as a man, clear-eyed—\nHis dice be not of ruth.\nFor I am made as a naked blade,\nBut hear ye this thing in sooth:\nWho loseth to God as man to man\nShall win at the turn of the game.\nI have drawn my blade where the lightnings meet\nBut the ending is the same:\nWho loseth to God as the sword blades lose\nShall win at the end of the game.\nFor God, our God is","God, our God is a gallant foe\nThat playeth behind the veil.\nI have loved my God as a child at heart\nThat seeketh deep bosoms for rest,\nI have loved my God as a maid to man—\nBut lo, this thing is best:\nTo love your God as a gallant foe that plays behind the veil;\nTo meet your God as the night winds meet beyond Arcturus' pale.\nI have played with God for a woman,\nI have staked with my God for truth,\nI have lost to my God as a man, clear-eyed—\nHis dice be not of ruth.\nFor I am made as a naked blade,\nBut hear ye this thing in sooth:\nWho loseth to God as man to man\nShall win at the turn of the game.\nI have drawn my blade where the lightnings meet\nBut the ending is the same:\nWho loseth to God as the sword blades lose\nShall win at the end of the game.\nFor God, our God is"
1,"with a kiss.\nWhales in the wake like capes and Alps\nQuaked the sick sea and snouted deep,\nDeep the great bushed bait with raining lips\nSlipped the fins of those humpbacked tons\nAnd fled their love in a weaving dip.\nOh, Jericho was falling in their lungs!\nShe nipped and dived in the nick of love,\nSpun on a spout like a long-legged ball\nTill every beast blared down in a swerve\nTill every turtle crushed from his shell\nTill every bone in the rushing grave\nRose and crowed and fell!\nGood luck to the hand on the rod,\nThere is thunder under its thumbs;\nGold gut is a lightning thread,\nHis fiery reel sings off its flames,\nThe whirled boat in the burn of his blood\nIs crying from nets to knives,\nOh the shearwater birds and their boatsized brood\nOh the bulls of Biscay and their calves\nAre making under the green, laid veil\nThe long-legged beautiful bait their wives.\nBreak the black news and paint on a sail\nHuge","a kiss.\nWhales in the wake like capes and Alps\nQuaked the sick sea and snouted deep,\nDeep the great bushed bait with raining lips\nSlipped the fins of those humpbacked tons\nAnd fled their love in a weaving dip.\nOh, Jericho was falling in their lungs!\nShe nipped and dived in the nick of love,\nSpun on a spout like a long-legged ball\nTill every beast blared down in a swerve\nTill every turtle crushed from his shell\nTill every bone in the rushing grave\nRose and crowed and fell!\nGood luck to the hand on the rod,\nThere is thunder under its thumbs;\nGold gut is a lightning thread,\nHis fiery reel sings off its flames,\nThe whirled boat in the burn of his blood\nIs crying from nets to knives,\nOh the shearwater birds and their boatsized brood\nOh the bulls of Biscay and their calves\nAre making under the green, laid veil\nThe long-legged beautiful bait their wives.\nBreak the black news and paint on a sail\nHuge weddings"


## Fine-tuning the model

In [21]:
class DropOutput(Callback):
    def after_pred(self): self.learn.pred = self.pred[0]

In [22]:
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), cbs=[DropOutput], metrics=Perplexity()).to_fp16()

In [23]:
learn.validate()



(#2) [4.1808180809021,65.41934967041016]

In [24]:
learn.lr_find()

KeyboardInterrupt: 

In [None]:
learn.fit_one_cycle(1, 1e-4)

In [None]:
model.save_pretrained('C:\Users\Nada\OneDrive\Desktop\gl3\2eme semestre\ppp\hedhi nhessha bech tenjah\model')

## Poem Generation Example

In [None]:
prompt = 'love is ridiculous' # create an initial text prompt to start your generated text
prompt_ids = tokenizer.encode(prompt)
inp = tensor(prompt_ids)[None].cuda()
inp.shape

Adding the `num_beams` and `no_repeat_ngram_size` arguments make a huge difference. This can be explained [here](https://huggingface.co/blog/how-to-generate). Basically beam search reduces the risk of missing hidden high probability word sequences by keeping the most likely num_beams of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest probability. Without beam search you will obtain a more greedy search. Beam search will always find an output sequence with higher probability than greedy search, but is not guaranteed to find the most likely output. Moreover, without the `no_repeat_ngram_size` you will likely obtain a repeated output. Thus we add a penalty that makes sure that no n-gram appears twice by manually setting the probability of next words that could create an already seen n-gram to 0.

In [None]:
preds = learn.model.generate(inp, max_length=60, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(preds[0].cpu().numpy(), skip_special_tokens=True))

In [None]:
prompt = "I don't know what I would do"
prompt_ids = tokenizer.encode(prompt)
inp = tensor(prompt_ids)[None].cuda()
preds = learn.model.generate(inp, max_length=60, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(preds[0].cpu().numpy(), skip_special_tokens=True))