# Creating the finetuning training dataset

This notebook prepares a dataset to train a model that will
create a line in the style of a Shakespearean sonnet when given a single word.

See [./finetuning_1_adaptertune.ipynb](./finetuning_1_adaptertune.ipynb) for the actual training and inference.

## Set up.

Install the necessary packages, set up the API keys etc.

In [1]:
#%pip install --quiet -r requirements.txt

In [1]:
from dotenv import load_dotenv
load_dotenv("../keys.env");

## Download stop, obscene words

Words that we should not index.

In [2]:
#!wget --quiet https://raw.githubusercontent.com/stopwords-iso/stopwords-en/master/stopwords-en.txt

In [3]:
#!wget --quiet https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en -O obscene.txt

In [4]:
def get_as_set(filename):
    with open(filename) as ifp:
        words = ifp.readlines()
        words = [word.strip() for word in words]
        return set(words)

In [22]:
stopwords = get_as_set('stopwords-en.txt')
# add a few more Shakespearean words
stopwords.update(['thy', 'thine', 'tis', 'thou'])
obscene = get_as_set('obscene.txt')

In [23]:
#print(obscene)
print(list(stopwords)[:10])

['asking', 'nu', 'gotten', 'past', 'way', 'ir', 'behind', 'gives', 'gmt', 'same']


In [7]:
#!wget --quiet https://raw.githubusercontent.com/orgtre/google-books-ngram-frequency/main/ngrams/1grams_english.csv

In [8]:
import pandas as pd
word_freq = pd.read_csv('1grams_english.csv').set_index('ngram')

In [9]:
word_freq.loc['remember']['freq']

35043274.0

## Download sonnets

From Project Gutenberg

In [10]:
#!wget --quiet https://www.gutenberg.org/cache/epub/1041/pg1041.txt -O sonnets.txt

In [12]:
!head sonnets.txt

﻿The Project Gutenberg eBook of Shakespeare's Sonnets
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.



In [19]:
sonnet_lines = []
with open('sonnets.txt', 'r') as ifp:
    # skip the first 25 or so lines as they are copyright etc.
    for lineno in range(25):
        next(ifp)
    for line in ifp:
        line = line.strip()
        if len(line) > 30:
            sonnet_lines.append(line)

sonnet_lines[:20]

['From fairest creatures we desire increase,',
 'That thereby beauty’s rose might never die,',
 'But as the riper should by time decease,',
 'His tender heir might bear his memory:',
 'But thou, contracted to thine own bright eyes,',
 'Feed’st thy light’s flame with self-substantial fuel,',
 'Making a famine where abundance lies,',
 'Thyself thy foe, to thy sweet self too cruel:',
 'Thou that art now the world’s fresh ornament,',
 'And only herald to the gaudy spring,',
 'Within thine own bud buriest thy content,',
 'And tender churl mak’st waste in niggarding:',
 'Pity the world, or else this glutton be,',
 'To eat the world’s due, by the grave and thee.',
 'When forty winters shall besiege thy brow,',
 'And dig deep trenches in thy beauty’s field,',
 'Thy youth’s proud livery so gazed on now,',
 'Will be a tatter’d weed of small worth held:',
 'Then being asked, where all thy beauty lies,',
 'Where all the treasure of thy lusty days;']

## Create the training dataset

In [24]:
import re, sys

def is_obscene(words):
    for word in words:
        if word in obscene:
            return True
    return False

def get_index_words(text):
    words = re.sub(r'[^a-zA-Z]', ' ', text.lower()).split()
    if is_obscene(words):
        return [] # prune out the obscene text by not indexing them
    else:
        indexes = [word for word in words if word not in stopwords]
        # no more than 3 index words
        if len(indexes) > 3:
            freq = [int(word_freq.loc[word]['freq']) if word in word_freq.index else sys.maxsize for word in indexes ]
            zipped = sorted(zip(freq, indexes))
            indexes = [x for _, x in list(zipped)[:3]]
        return indexes

get_index_words(sonnet_lines[4])

['contracted', 'bright', 'eyes']

In [26]:
indexed_lines = []
for line in sonnet_lines:
    index_words = get_index_words(line) # will prune out any line containing words that the LLMs might reject
    for word in index_words:
        indexed_lines.append({
            "input": word,
            "output": line
        })

In [27]:
indexed_lines[190]

{'input': 'winter',
 'output': 'But flowers distill’d, though they with winter meet,'}

In [29]:
import json
json.dump(indexed_lines, open('indexed_sonnets.json', "w"), indent=2)

In [30]:
!head indexed_sonnets.json

[
  {
    "input": "creatures",
    "output": "From fairest creatures we desire increase,"
  },
  {
    "input": "desire",
    "output": "From fairest creatures we desire increase,"
  },
  {


In [31]:
!ls -lh indexed_sonnets.json

-rw-r--r-- 1 jupyter jupyter 598K Aug 14 00:22 indexed_sonnets.json
