# Creating the finetuning training dataset

This notebook uses a dataset of about 200K jokes scraped from Reddit r/dadjokes
and indexes the joke.  The idea is that, given a word, the fine-tuned LLM should generate
a joke that contains the word.

## Set up.

Install the necessary packages, set up the API keys etc.

In [1]:
#%pip install --quiet -r requirements.txt

In [1]:
from dotenv import load_dotenv
load_dotenv("../keys.env");

PROVIDER = "Google"
#PROVIDER = "OpenAI"

if PROVIDER == "Google":
    from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
    model = ChatGoogleGenerativeAI(model="gemini-1.5-flash", temperature=0.1)
else:
    from langchain_openai import ChatOpenAI, OpenAIEmbeddings
    model = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.1)

## Download stop, obscene words

Words that we should not index.

In [2]:
#!wget --quiet https://raw.githubusercontent.com/stopwords-iso/stopwords-en/master/stopwords-en.txt

In [3]:
#!wget --quiet https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en -O obscene.txt

In [4]:
def get_as_set(filename):
    with open(filename) as ifp:
        words = ifp.readlines()
        words = [word.strip() for word in words]
        return set(words)

In [5]:
stopwords = get_as_set('stopwords-en.txt')
obscene = get_as_set('obscene.txt')

In [6]:
#print(obscene)
print(list(stopwords)[:10])

['formerly', 'cs', 'suggest', 'except', "that'll", 'whoever', 'fr', 'sb', 'v', "we're"]


In [22]:
#!wget --quiet https://raw.githubusercontent.com/orgtre/google-books-ngram-frequency/main/ngrams/1grams_english.csv

In [24]:
import pandas as pd
word_freq = pd.read_csv('1grams_english.csv').set_index('ngram')

In [34]:
word_freq.loc['remember']['freq']

35043274.0

## Download joke data and clean up the data

Note: this is a scraped dataset. some of the jokes might be offensive (hopefully, after removing the low score ones and removing obscene words, the remainder are relatively clean).

Download jokes

In [7]:
#!wget --quiet https://raw.githubusercontent.com/taivop/joke-dataset/master/reddit_jokes.json

In [8]:
!head -500 reddit_jokes.json | tail -10

        "score": 10,
        "title": "Manager : So do you think you'd be a good waiter?"
    },
    {
        "body": "I have a couple of ideas:\n\n1: Dinner\n2: Movies\n\n1 or 2? 1.. 2..? 1..... or 2?",
        "id": "5txs4x",
        "score": 14,
        "title": "An optometrist asks a woman out on a date"
    },
    {


In [9]:
import json
with open('reddit_jokes.json') as ifp:
    reddit_jokes = json.load(ifp)

In [10]:
good_jokes = [f"{joke['title']} ... {joke['body']}" for joke in reddit_jokes if joke['score'] > 5]
good_jokes[15]

"Manager : So do you think you'd be a good waiter? ... Me : well, you could say I bring a lot to the table."

## Create the training dataset

In [48]:
import re, sys

def is_obscene(words):
    for word in words:
        if word in obscene:
            return True
    return False

def get_index_words(joke):
    words = re.sub(r'[^a-zA-Z]', ' ', joke.lower()).split()
    if is_obscene(words):
        return [] # prune out the obscene jokes by not indexing them
    else:
        indexes = [word for word in words if word not in stopwords]
        # no more than 3 index words
        if len(indexes) > 3:
            freq = [int(word_freq.loc[word]['freq']) if word in word_freq.index else sys.maxsize for word in indexes ]
            zipped = sorted(zip(freq, indexes))
            indexes = [x for _, x in list(zipped)[:3]]
        return indexes

get_index_words(good_jokes[0])

['destroy', 'kid', 'housing']

In [49]:
indexed_jokes = []
for joke in good_jokes:
    index_words = get_index_words(joke) # will prune out the obscene jokes
    for word in index_words:
        indexed_jokes.append({
            "input": word,
            "output": joke
        })

In [50]:
indexed_jokes[190]

{'input': 'teeth', 'output': "What's red and bad for your teeth? ... A Brick."}

In [51]:
json.dump(indexed_jokes, open('indexed_jokes.json', "w"), indent=2)

In [52]:
!head indexed_jokes.json

[
  {
    "input": "destroy",
    "output": "Remember when you were a kid and when you cried your parents said, \"I'll give you a reason to cry\"? ... I always thought they were gunna hit me, not that they were going to destroy the housing market 20 years later."
  },
  {
    "input": "kid",
    "output": "Remember when you were a kid and when you cried your parents said, \"I'll give you a reason to cry\"? ... I always thought they were gunna hit me, not that they were going to destroy the housing market 20 years later."
  },
  {


In [54]:
!ls -lh indexed_jokes.json

-rw-r--r-- 1 jupyter jupyter 58M Aug 13 18:53 indexed_jokes.json
