<a href="https://colab.research.google.com/github/heinohen/tko_7095_i2hlt/blob/main/week5_ex_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# In this exercise, you'll try to generate text with an n-gram model. In the generation, we use the last generated n-1 words as the prefix, and the n-gram counts to establish the distribution of possible continuations.

## REFLECTIONS IN THE END OF THE NOTEBOOK

In [1]:
!pip3 install datasets more-itertools



In [2]:
import datasets
import sklearn.feature_extraction

In [3]:
data = datasets.load_dataset("imdb")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
cvec = sklearn.feature_extraction.text.CountVectorizer(lowercase = False, stop_words = None, token_pattern=r"(?u)\b\w+\b" )
analyzer = cvec.build_analyzer()
analyzer('I have a dog at home, it likes to shred newspapers.')

['I',
 'have',
 'a',
 'dog',
 'at',
 'home',
 'it',
 'likes',
 'to',
 'shred',
 'newspapers']

## TASK A

In [5]:
# Tokenize the IMDB dataset

def tokenize(ex) -> dict:
  return {"tokenized":analyzer(ex["text"])}

data = data.map(tokenize, num_proc = 4)

Map (num_proc=4):   0%|          | 0/25000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/25000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/50000 [00:00<?, ? examples/s]

In [6]:
from collections import Counter
from more_itertools import sliding_window
import tqdm

def generate_ngrams(dset, n ):
  for ex in tqdm.tqdm(dset):
    tokens = ["<bos>"] * (n-1) + ex["tokenized"]+["<eos>"]
    for ngram in sliding_window(tokens,n):
      yield ngram

## TASK B

In [7]:
# Concat all invid datasets (train, test, unlabelled) in IMDB
# The "master" dataset is a dict of these, so data.values() has the datasets of the individual sections
combined_dataset = datasets.concatenate_datasets(list(data.values()))

In [8]:
ngrams={}
for ngram in generate_ngrams(combined_dataset,4):
    """
    ngram[:-1] == tuple: 4 items (from arg) ex. ('got', 'back', 'then', '<eos>')
    prefix == tuple: 3 items ex. ('got', 'back', 'then')
    """
    prefix=ngram[:-1]
    """
    ngam[-1] == last item
    word is above as string
    """
    word=ngram[-1]
    """
    setdefault(key[, default])
    If key is in the dictionary, return its value. If not, insert key with a value of default and return default.
    default defaults to None.
    """
    d=ngrams.setdefault(prefix,{})
    """
    so this is actually a freq counter after all, like i thought
    """
    d[word]=d.get(word,0)+1

100%|██████████| 100000/100000 [01:57<00:00, 850.67it/s]


*** TRIED THIS WITH MANY DIFFERENT COMBINATIONS ***

* convert to string and store with freq
* store as a tuple with freq
* tried storing as individiual words with freqs (i know... not n-gram)

Could not get it to work, had to resort looking at model solution.
Did line-by-line read-through to understand fundamentals

But why do i need the prefix ?

## TASK C

In [9]:
import numpy as np

def softmax(x):
  """
  Calculate the exponential of all elements in the input array.
  https://numpy.org/doc/stable/reference/generated/numpy.exp.html#numpy.exp
  """
  return np.exp(x) / sum(np.exp(x))

def sample_from(counts, temperature):
  """
  counts: List of counts that form the distribution
  temperature: The "how wild the generation should be" parameter, numbers close to 0 are very conservative,
  numbers close or above 1 lead to quite wild generations
  """


  counts_array = np.array(counts)

  # Make thse sum up to 1
  counts_array_norm = counts_array / counts_array.sum()

  # Divide by temperature, that is what the algorithm does
  counts_array_norm /= temperature # augmented assignment operator for floating point division

  # Renormalize into a distribution using the softmax function, that is what the algorithm does
  final_distribution = softmax(counts_array_norm)

  # A good way to sample from a distribution is the following function from numpy
  x = np.random.multinomial(n = 1, pvals = final_distribution)
  selected_word = np.argmax(x).flatten()
  return selected_word[0]

"""
from this sample_from([1,1,1,17], temperature = 0.5) call:

counts_array type --> <class 'numpy.ndarray'>
counts_array contains --> [ 1  1  1 17]

counts_array_norm = counts_array / counts_array.sum() --> [0.05 0.05 0.05 0.85]
counts_array_norm /= temperature --> [0.1 0.1 0.1 1.7]
final_distribution = softmax(counts_array_norm) --> [0.1257382  0.1257382  0.1257382  0.62278539]
"""

sample_from([1,1,1,17], temperature = 0.5)


0

## TASK D

In [13]:
from pprint import pprint

def generate(ngrams, n, max_len, temperature, is_this_the_beginning = True) -> list: # RETURNS LIST WITH ALL THE WORDS IN IT
  """
  ngrams: the master dict
  max_len: how many words at maximum
  temperature: the generation temperature
  prompt: the initial prompt, as a tuple, if not given n-1 <bos> symbols will be used
  """

  # The <bos> we want to have there n-1 times, so we can use it as the initial prompt and let the model learn how the sequences start.
  if is_this_the_beginning is True:
    sequence_start = ["<bos>"] * (n-1)
    is_this_the_beginning = False


  final_product = list(sequence_start) # in here lies the final text

  for i in range(max_len):
    """
    PLAN INSIDE THE LOOP
    1) take what we already have produced
    2) access it's dict
    3) separate words and counts
    4) select a new word with "TASK C" codeblock
    5) append to final return list
    """

    # So, when generating, we can take the last n-1 words, look them up in the master dictionary,
    # and we get a dictionary of all seen continuations and their counts.

    # AS OUR N-GRAM IS A VARIABLE THE FOLLOWING MEANS THAT WE GO BACK n-times and add +1
    prefix=tuple(final_product[-n+1:])
    dict_of_prefix = ngrams[prefix] # THESE WERE GENERATED IN TASK B "MASTER DICTIONARY"
    words_from_dict = [] # INSERT ALL WORDS INSIDE THE DICT
    counts_from_dict = [] # INSERT ALL COUNTS OF CORRESPONDING WORDS INSIDE THE DICT

    for k,v in dict_of_prefix.items():
      words_from_dict.append(k)
      counts_from_dict.append(v)

    # GET INDEX USING THE "TASK C"
    """
    counts_from_dict == intlist, used to calculate the new index
    temperature ==
    index_of_a_new_word = sample_from(counts_from_dict, temperature)
    """
    index_of_a_new_word = sample_from(counts_from_dict, temperature)
    final_product.append(words_from_dict[index_of_a_new_word])
    # The <eos> allows us to stop generating, and prevents a crash on unknown n-grams at the very end of a sequence.
    # (if an n-gram was seen only once at the end of a "training" sequence,
    # then an attempt to continue it during generation, would lead to a crash,
    # since we have no known n-gram to continue the sequence with our simple, unsmoothed model :)
    if final_product[-1] == "<eos>":
      break

  return final_product


for temp in (0.1,0.5,1.0,2.0,5.0):
    generated=generate(ngrams=ngrams,n=4,max_len=60,temperature=temp)
    print(f"Temp={temp}:")
    pprint(" ".join(generated))
    print("-----------")


Temp=0.1:
('<bos> <bos> <bos> Zeoy101 Really this has to be evaluated for what it speaks '
 'about a terrorist dirty bomb attack in London The film is kind of dumb I '
 'guess it required a woman to a secluded island that the Thornberries are '
 'filming on the adults and have them dwarfed by the system The boy never '
 'forgets about what happened to each person')
-----------
Temp=0.5:
('<bos> <bos> <bos> Ginger Snaps 2 there is no audience The nightclub scene '
 'has Nanette Workman singing Call girl and Les Fros Brossuere performing a '
 'Mikado like take off The writer hasn t stop for a minute reckons that MaxÕs '
 'sights are on a class reunion This is a Brit formerly of the NYPD since '
 'their interest in American broadcasts <eos>')
-----------
Temp=1.0:
('<bos> <bos> <bos> Marjorie a young woman through the matrimonial ads of a '
 'newspaper Office fool Charley at first gets the stigmata she also defies the '
 'laws of crap by producing new levels of low This is truly an innova

# REFLECTION OF EXERCISE

* Not much is my original code as a whole lot was given
* Took hints from solved as this is completely new task for me
* Concept was hard, coding relatively easy
* Took time to fully understand the concept
* Time used for the exercise roughly 4-5 hours