<a href="https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/ex9_intro_to_hlt_2023_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Watch out, this notebook stretches colab memory with n=5, so you might need to "Restart and run all" on full re-runs of the notebook,
since the old data clogs the memory during a rerun**

In this exercise, you'll try to generate text with an n-gram model. In the generation, we use the last generated n-1 words as the prefix, and the n-gram counts to establish the distribution of possible continuations. So we might run this off the following data structure:

* A master dictionary, where the key are n-1 grams
* The value is another dictionary
* In this dictonary the key is a word
* And the value is its count

So, when generating, we can take the last n-1 words, look them up in the master dictionary, and we get a dictionary of all seen continuations and their counts.

Let us divide it to the following tasks:

1. Generate n-grams from a corpus of text, e.g. the IMDB dataset
2. Count the n-grams, i.e. build the master dictionary

With these data structures, the generation can proceed quite easily. Say, we have a 4-gram model.

* Given a prior context $w_1w_2w_3$
* Look up the word-count dictionary of possible words $w_4$
* The counts, once normalized to sum up to 1, form a distribution over words that can continue $w_1w_2w_3$ and we can sample the next word from this distribution.
* Then we append this generated word to our list of already generated words, and repeat the process


Other remarks:

* We want to pad all texts with `<bos>` (beginning of sequence) and <eos> (end of sequence). The `<bos>` we want to have there n-1 times, so we can use it as the initial prompt and let the model learn how the sequences start. The `<eos>` allows us to stop generating, and prevents a crash on unknown n-grams at the very end of a sequence. (if an n-gram $w_1w_2w_3w_4$ was seen only once at the end of a "training" sequence, then an attempt to continue it during generation, would lead to a crash, since we have no known n-gram to continue the sequence $w_2w_3w_4$ with our simple, unsmoothed model :)


# Task A: Generate n-grams

* Write a generator function (using `yield` rather than `return`) which yields n-grams as tuples $(w_1,...,w_n)$ from all sections of the IMDB dataset
* a vectorizer from `sklearn` can be used as a trivial tokenizer
* `more-itertools` is a nifty library to achieve the n-gram generation
* remember to pad with n-1 `<bos>` symbols at the beginning, and one `<eos>` symbol at the end

You can give this a shot, or simply use the code below.

In [2]:
!pip3 install datasets more-itertools





In [3]:
import datasets
import sklearn.feature_extraction
from pprint import pprint

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
dset=datasets.load_dataset("imdb")

In [5]:
# Few remarks here:
# 1. we don't need the vectorizer per se, we just want its analyzer function, which basically tokenizes the text for us, and somewhat unfortunately drops punctuation
# 2. the default token pattern in sklearn drops 1-letter words (like "I" and "a") so I modify it a bit
# 3. it's a pretty lousy tokenizer, but it will do for this toy exercise
cvectorizer=sklearn.feature_extraction.text.CountVectorizer(lowercase=False,stop_words=None,token_pattern=r"(?u)\b\w+\b" )
analyzer=cvectorizer.build_analyzer()
analyzer("I have a dog at home, it likes to shred newspapers.")

['I',
 'have',
 'a',
 'dog',
 'at',
 'home',
 'it',
 'likes',
 'to',
 'shred',
 'newspapers']

In [6]:
# Now we tokenize the IMDB dataset the usual way
def tokenize(ex):
    return {"tokenized":analyzer(ex["text"])}

dset=dset.map(tokenize)

In [7]:
from collections import Counter
from more_itertools import sliding_window #more-itertools is an awesome library!
import tqdm

def generate_ngrams(dset,n):
    for ex in tqdm.tqdm(dset):
        tokens=["<bos>"]*(n-1)+ex["tokenized"]+["<eos>"]
        for ngram in sliding_window(tokens,n):
            yield ngram



# Task B

* Now we can combine the different sections of the IMDB dataset and count our n-grams


In [8]:
# Here we can concatenate all the individual datasets (train,test,unlabeled) in IMDB
# the "master" dataset is a dictionary of these, so dset.values() has the datasets of the individual sections (train,test,unlabeled)
combined_dataset=datasets.concatenate_datasets(list(dset.values()))

In [20]:
ngrams={} #This is the master dictionary
for ngram in generate_ngrams(combined_dataset,4): #let's start with 4-grams, you can try 2- 3- and 5- grams too!
    if ngram[:3] not in ngrams:
        ngrams[ngram[:3]] = {ngram[3]: 1}
        continue
    if ngram[3] not in ngrams[ngram[:3]]:
        ngrams[ngram[:3]].update({ngram[3]: 1})
    else:
        ngrams[ngram[:3]][ngram[3]] += 1

100%|██████████| 100000/100000 [01:09<00:00, 1440.72it/s]


In [37]:
# an example n-1-gram with possible continuation words
pprint(sorted(ngrams[('this', 'film', 'is')].items(), key=lambda x : x[1], reverse=True)[0:10])

[('a', 440),
 ('that', 334),
 ('the', 318),
 ('not', 218),
 ('so', 115),
 ('just', 93),
 ('very', 80),
 ('for', 77),
 ('about', 73),
 ('an', 72)]


Looks good :)

# Task C

* Generate new text, starting from `<bos> <bos> ...` (n-1 times) and ending after say 40 words, or `<eos>` being generated
* I will give you a support function `sample_from` which receives a list of counts and a temperature parameter, and samples according to this distribution, returning a single column index drawn
* The temperature sampling is described here: https://towardsdatascience.com/how-to-sample-from-language-models-682bceb97277
* By all means, if you want to try, do try writing this function yourself!


In [10]:
import numpy

def softmax(x):
    return numpy.exp(x)/sum(numpy.exp(x))

def sample_from(counts,temperature=1.0):
    """
    counts: list of counts that form the distribution
    temperature: the "how wild the generation should be" parameter, numbers close
                 to 0 are very conservative, numbers close or above 1 lead to quite
                wild generations
    """

    counts_array=numpy.array(counts)
    #Make these sum up to 1.
    counts_array_norm=counts_array/counts_array.sum()
    #Divide by temperature, that is what the algorithm does
    counts_array_norm/=temperature
    #Renormalize into a distribution using the softmax function, that is what the algorithm does
    final_distribution=softmax(counts_array_norm)
    #A good way to sample from a distribution is the following function from numpy
    x=numpy.random.multinomial(n=1,pvals=final_distribution)
    selected_word=numpy.argmax(x).flatten()
    return selected_word[0]

sample_from([1,1,1,17],temperature=1.0) #Try running this several times each, with temps 0.1, 0.5, 1.0 ... see how temp 0.1 sticks to picking the max value, but higher temps don't?

3

# Task D: piece it all together

* Again, I will give you the skeleton

In [88]:

def generate(ngrams,n,max_len=40,temperature=1.0,prompt=None):
    """
    ngrams: the master dictionary
    n: the n in n-gram
    max_len: how many words max?
    temperature: the generation temperature
    prompt: the initial prompt, as a tuple, if not given n-1 <bos> symbols will be used
    """

    if prompt is None:
        prompt=["<bos>"]*(n-1)

    generated=list(prompt) #this list will grow with words
    for _ in range(max_len):
        sample_pos = sample_from([x[1] for x in list(ngrams[tuple(prompt)].items())])
        next_word = [x[0] for x in list(ngrams[tuple(prompt)].items())][sample_pos]
        generated.append(next_word)
        prompt = list(generated[-3:])
        if generated[-1]=="<eos>": #stop on end of sequence
            break
    return generated

# Now we can test it!

# make sure to match the n below to the n which was used to create
# the master dictionary
for temp in (0.1,0.5,1.0,2.0,5.0,10.0):
    generated=generate(ngrams=ngrams,n=4,max_len=60,temperature=temp)
    print(f"Temp={temp}:")
    pprint(" ".join(generated))
    print("-----------")



Temp=0.1:
'<bos> <bos> <bos> ELEPHANT WALK may not be anywhere I d love this <eos>'
-----------
Temp=0.5:
('<bos> <bos> <bos> Eskimo is a skilled production that contains a debate I m '
 'especially fond of Nestor Serrano s work in PUMPKINHEAD but effectively '
 'grotesque when they pop up on his offer The result is weird And finally I '
 'might have considered casting the beloved couple s real life discovery '
 'Maureen O Hara looks lovely but seemingly has no connection or storyline')
-----------
Temp=1.0:
('<bos> <bos> <bos> Anybody who wants to rediscover himself and the funds '
 'inside the armored truck no one cares anymore Their daughter has been doing '
 'his mojo since the 70s when it was screening at the Sundance film Festival '
 'where I went when I was walking down the endless pristine and very white '
 'corridors of a prison that looks much worse than The')
-----------
Temp=2.0:
('<bos> <bos> <bos> Martha Plimpton has done some magnificent scores for films '
 'of this era 

# Done!

Ok, the generations are quite funny. Clearly, this is no ChatGPT, but it is also not entirely bad for a model, which is basically two dictionaries...