In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
[0m

# Experimenting with HuggingFace - Text Generation



**In this notebook, we will explore different decoding methods like Beam search, Top-K sampling, and Top-P sampling, demonstrating their performance along the way. This project is a work in progress and I will rigourslly update it as I learn more about text generation. Feel free to comment with any questions/suggestions:**



In [2]:
#for reproducability
SEED = 70

#maximum number of words in output text
MAX_LEN = 108

# I. Intro

** GPT-2 is capable of next word prediction on a much larger and more gigiatuas scale.**

**Transformers makes it very easy to import this model with both PyTorch and TensorFlow - in this notebook we will be using TensorFlow but it is just as easy in PyTorch. Both the model and its Tokenizer can be imported from the `transformers` library that anyone can get by typing `!pip install transformers`. We begin with our input seqquence:**

In [3]:
input_sequence = "after such a tiring day at work, i come back and "

In [4]:
!pip install --upgrade pip
!pip install --upgrade jax jaxlib

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting pip
  Downloading pip-22.1.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.1
    Uninstalling pip-22.1:
      Successfully uninstalled pip-22.1
Successfully installed pip-22.1.1
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com


In [5]:
import requests

r = requests.get("http://google.com")       
print(r.status_code)

# 200

200


**un comment the pretrained models that is medium and small if you want to run text generation on a smaller version of text files**

In [7]:
#get transformers
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

#get large GPT2 tokenizer and GPT2 model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large")
GPT2 = TFGPT2LMHeadModel.from_pretrained("gpt2-large", pad_token_id =tokenizer.eos_token_id)

#tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")
#GPT2 = TFGPT2LMHeadModel.from_pretrained("gpt2-medium", pad_token_id=tokenizer.eos_token_id)

#tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
#GPT2 = TFGPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

#view model parameters
GPT2.summary()

2022-05-26 06:44:02.430507: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Model: "tfgpt2lm_head_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transformer (TFGPT2MainLaye  multiple                 774030080 
 r)                                                              
                                                                 
Total params: 774,030,080
Trainable params: 774,030,080
Non-trainable params: 0
_________________________________________________________________


# II. Different Decoding Methods

## First Pass (Greedy Search)   #please refer to geeks for geeks for better understanding of depth first search and backtracking to the node with the most optimised cost function, it doesnt even matter but why not go for it

**With Greedy search, the word with the highest probability is predicted as the next word i.e. the next word is updated via:**

$$w_t = argmax_{w}P(w | w_{1:t-1})$$

**at each timestep $t$. Let's see how this naive approach performs:**

In [8]:
#get deep learning basics
import tensorflow as tf
tf.random.set_seed(SEED)

In [9]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode(input_sequence, return_tensors='tf')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = GPT2.generate(input_ids, max_length = MAX_LEN)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens = True))

Output:
----------------------------------------------------------------------------------------------------
after such a tiring day at work, i come back and ive been working on my new project. i have been working on a new game called "The Last of Us" and i have been working on it for about a year now. i have been working on this game for about a year now. i have been working on this game for about a year now. i have been working on this game for about a year now. i have been working on this game for about a year now. i have been working on this


**And there we go boys!! generating text is that easy. Our results are not great - as we can see, our model starts repeating itself rather quickly. Which is kinda weird- bernard looks like having a fit. The main issue with Greedy Search is that words with high probabilities can be masked by words in front of them with low probabilities, so the model is unable to explore more difff combinations of words. We can prevent this by implementing Beam Search which compares difff alternative paths and penalises the sequences:**


## Beam Search with N-Gram Penalities

**Beam search is essentially Greedy Search but the model tracks and keeps `num_beams` of hypotheses at each time step, so the model is able to compare alternative paths as it generates text. We can also include a n-gram penalty by setting `no_repeat_ngram_size = 2` which ensures that no 2-grams appear twice. We will also set `num_return_sequences = 5` so we can see what the other 5 beams looked like**

**To use Beam Search, we need only modify some parameters in the `generate` function:**

In [10]:
# set return_num_sequences > 1
beam_outputs = GPT2.generate(
    input_ids, 
    max_length = MAX_LEN, 
    num_beams = 5, 
    no_repeat_ngram_size = 2, 
    num_return_sequences = 5, 
    early_stopping = True
)

print('')
print("Output:\n" + 100 * '-')

# now we have 3 output sequences
for i, beam_output in enumerate(beam_outputs):
      print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))


Output:
----------------------------------------------------------------------------------------------------
0: after such a tiring day at work, i come back and ive never been so happy in my life.

This is the best gift i have ever received. Thank you so much!
1: after such a tiring day at work, i come back and ive never been so happy in my life.

This is the best gift i have ever received. Thank you so much.
2: after such a tiring day at work, i come back and ive never been so happy in my life.

This is the best gift i have ever received. Thank you so much. I love it.
3: after such a tiring day at work, i come back and ive never been so happy in my life.

This is the best gift i have ever received. Thank you so very much.
4: after such a tiring day at work, i come back and ive never been so happy in my life.

This is the best gift i have ever received. Thank you so much. I love it!


**it makes much more sensess if we sampling for reducing K-FOLDS!! LETS  GOOOOOO**

**Now that's much better! The 5 different beam statements or predictions or hypothesis are pretty much all the same, but if we increaed `num_beams`, then we would see some more variation in the separate beams. But of course, Beam Search is not perfect either. It works well when the legnth of the generated text is more or less constant, like problems in translation or summarization, but not so much for open-ended problems like dialog or story generation (because it is much harder to find a balance between `num_beams` and `no_repeat_ngram_size`)**

**Furthermore, [research](https://arxiv.org/abs/1904.09751) shows that human languages do not follow this 'high probability word next' distribution. This makes sense - if my words were exactly what you expected them to be, I would be quite a boring person and most people don't want to be boring! The below graph plots the difference of Beam Search and actual human speech: ![alt text](https://blog.fastforwardlabs.com/images/2019/05/Screen_Shot_2019_05_08_at_3_06_36_PM-1557342561886.png)**

Taken from original paper [here](https://arxiv.org/abs/1904.09751)

## Basic Sampling

**Now we will explore indeterministic decodings - sampling. Instead of following a strict path to find the end text with the highest probability, we instead randomly pick the next word by its conditional probability distribution:**

$$w_t \sim P(w|w_{1:t-1})$$

**However, when we include this randomness, the generated text tends to be incoherent (see more [here](https://arxiv.org/pdf/1904.09751.pdf)) so we can include the `temperature` parameter which increases the chances of high probability words and decreases the chances of low probability words in the sampling:**

**We just need to set `do_sample = True` to implement sampling and for demonstration purposes (you'll shortly see why) we set `top_k = 0`:**

In [11]:
# use temperature to decrease the sensitivity to low probability candidates
sample_output = GPT2.generate(
                             input_ids, 
                             do_sample = True, 
                             max_length = MAX_LEN, 
                             top_k = 0, 
                             temperature = 0.8
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens = True))

Output:
----------------------------------------------------------------------------------------------------
after such a tiring day at work, i come back and ive got a sand tan and no colors from my car. i run to the car wash and get some leopard skin leather and paint it over the whole car for my friends to see. so now i have a fake tan and a fake paint job on my car. then i get the black and red sunbed and put it over the fake tan and the fake paint job on my car. then i install the sunbed over the fake tan and the fake paint


## Top-K Sampling

**In Top-K sampling, the top k most likely next words are selected and the entire probability mass is shifted to these k words. So instead of increasing the chances of high probability words occuring and decreasing the chances of low probabillity words, we just remove low probability words all together**

**We just need to set `top_k` to however many of the top words we want to consider for our conditional probability distribution:**

In [12]:
#sample from only top_k most likely words
sample_output = GPT2.generate(
                             input_ids, 
                             do_sample = True, 
                             max_length = MAX_LEN, 
                             top_k = 50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens = True), '...')

Output:
----------------------------------------------------------------------------------------------------
after such a tiring day at work, i come back and ive been wondering how much longer i can take the bus by myself from the office, and how many more years will it take before i just give up on the bus, and live by myself for a while. i have had a nice trip to a resort in the northern region of china two months ago, but i was just too tired to move after 4 hours of walking. so i went to a hotel and waited. and then i decided to get on the ...


**Top-K Sampling seems to generate more sensical text than our random sampling before. But we can do even better:**

## Top-P Sampling

**Top-P sampling (also known as nucleus sampling) is similar to Top-K, but instead of choosing the top k most likely wordsm we choose the smallest set of words whose total probability is larger than p, and then the entire probability mass is shifted to the words in this set**

**The main difference here is that with Top-K sampling, the size of the set of words is static (obviously) whereas in Top-P sampling, the size of the set can change. To use this sampling method, we just set `top_k = 0` and choose a value `top_p`:**

In [13]:
#sample only from 80% most likely words
sample_output = GPT2.generate(
                             input_ids, 
                             do_sample = True, 
                             max_length = MAX_LEN, 
                             top_p = 0.8, 
                             top_k = 0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens = True), '...')

Output:
----------------------------------------------------------------------------------------------------
after such a tiring day at work, i come back and ive got to tell you all about it. i don't know how it will help you but i promise you it will. i found the best sex toy on the market.

he masturbates almost every day.

halloween

I have a new friend for Halloween! We talk everyday and he still doesn't know what to wear. When he wore his costume we went shopping and we found this fake penis from the grocery store. i gave it ...


** NOW LETS DO SOMETHING MORE ABSURD. Lets use dynamic selection size from both top-k and top-p samplings

## Top-K and Top-P Sampling

**As you could have probably guessed, we can use both Top-K and Top-P sampling here. This reduces the chances of us getting weird words (low probability words) while allowing for a dynamic selection size. We need only top a value for both `top_k` and `top_p`. We can even include the inital temperature parameter if we want to, Let's now see how our model performs now after adding everything together. We will check the top 5 return to see how diverse our answers are:**

In [14]:
#combine both sampling techniques
sample_outputs = GPT2.generate(
                              input_ids,
                              do_sample = True, 
                              max_length = 2*MAX_LEN,                              #to test how long we can generate and it be coherent
                              #temperature = .7,
                              top_k = 50, 
                              top_p = 0.85, 
                              num_return_sequences = 5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}...".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)))
    print('')

Output:
----------------------------------------------------------------------------------------------------
0: after such a tiring day at work, i come back and ive just got to make a point and say the word 'bitch'. I really have no idea what to say or do so i just stand there and stare into space. I can feel a wave of disgust roll over me. And then the light goes on.

I look up and see that my co-worker has come over and is in tears. "I thought you were going to say something nice about my mother," she says. "It was a horrible day." I say "I'm sorry." I'm so happy she's okay. I'm glad she's on her feet again. I look at the coworker and say "I'm sorry too" and walk away.

When I'm with my colleagues, I'm often reminded that it's okay to be mad. It's good to be angry at those who hurt us. Anger is healthy. It's good to be angry, it's healthy to be angry.

But anger isn't the only emotion to...

1: after such a tiring day at work, i come back and ive never felt so much better than in tha

# III. Benchmark Prompts

**Here, we will see how well model does when given some more interesting inputs.**

In [15]:
MAX_LEN = 150

In [16]:
prompt1 = 'In a shocking finding, scientist discovered a herd of monsters with three gentials living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the monsters spoke perfect English.'

input_ids = tokenizer.encode(prompt1, return_tensors='tf')

In [17]:
sample_outputs = GPT2.generate(
                              input_ids,
                              do_sample = True, 
                              max_length = MAX_LEN,                              #to test how long we can generate and it be coherent
                              #temperature = .8,
                              top_k = 50, 
                              top_p = 0.85 
                              #num_return_sequences = 5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}...".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)))
    print('')

Output:
----------------------------------------------------------------------------------------------------
0: In a shocking finding, scientist discovered a herd of monsters with three gentials living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the monsters spoke perfect English.

The study was published in the journal Current Biology.

"When they spoke, they were perfectly articulate," said Professor Paul Eberhard, a co-author of the study from the School of Earth Sciences at the University of Colorado at Boulder. "They were saying the things that we say when we're thinking of people."

The researchers, who are calling the creatures "bizarre and unusual," said the monsters lived in a region of South America known as the Santa Rosa Pampa,...



**WELL, ill use bernard to do my friend's desertation project on Cross-cultural management helps understanding the intricasies of business**

In [18]:
prompt2 = "Cross-cultural management helps understanding the intricasies of business"

input_ids = tokenizer.encode(prompt2, return_tensors='tf')

In [19]:
sample_outputs = GPT2.generate(
                              input_ids,
                              do_sample = True, 
                              max_length = MAX_LEN,                              #to test how long we can generate and it be coherent
                              #temperature = .8,
                              top_k = 50, 
                              top_p = 0.85 
                              #num_return_sequences = 5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}...".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)))
    print('')

Output:
----------------------------------------------------------------------------------------------------
0: Cross-cultural management helps understanding the intricasies of business, but it doesn't do it alone. It doesn't tell you how to solve problems, or how to build new ones. The same is true for the concept of intercultural communication. It's not easy to learn to communicate in different cultures and cultures, and that's just the way it is.

One way I've come to understand that is by trying to be more comfortable with my Spanish and having conversations with people in my culture, particularly with people who are also bilingual. For me, the Spanish is like the Italian to English: I can't speak either of them fluently, but I can understand both.

As a result, I find it very easy...

