<a href="https://colab.research.google.com/github/kiransoorya/Prodigy_Infotech/blob/PRODIGY_GA_01/Generate_Text_From_A_Line.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This code is actually a replica of https://huggingface.co/blog/how-to-generate

# Thanking Patrick von Platen for this amazing article

# Use https://huggingface.co/ for studing more about transformers module.

# This module is mostly used for text classification , text generation (which is used here) , and Named entity recognition

In [2]:
!pip install -q transformers


In [3]:
#AutoModelForCausalLM -> Prediction of tokens wrt previous tokens
#AutoTokenizer -> Token creation
#Auto -> High level interface - simplified code , flexibility , consistency
from transformers import AutoModelForCausalLM, AutoTokenizer

#used for developing and training neural network models
import torch

#This line checks if CUDA (NVIDIA's parallel computing architecture) is available, indicating that a compatible GPU is present.
#If a GPU is available, torch_device is set to "cuda", otherwise it falls back to "cpu".
torch_device = "cuda" if torch.cuda.is_available() else "cpu"

#This line loads a pre-trained tokenizer for the GPT-2 model. The tokenizer is responsible for converting text into tokens (numerical representations) that the model can understand.
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
#to(torch_device) method moves the model to the appropriate device (cuda for GPU or cpu), enabling efficient computation
model = AutoModelForCausalLM.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id).to(torch_device)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [4]:
#Greedy approach -> irrelavent text , works on value multiplication but fails , if higher valued token is hidden along lower valued ones
# encode context the generation is conditioned on i.e as pytorch tensors
user_prompt = input("Enter text to generate prompt:")
model_inputs = tokenizer(user_prompt, return_tensors='pt').to(torch_device)

# generate 40 new tokens (i.e words)
greedy_output = model.generate(**model_inputs, max_new_tokens=40)

#output
print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))


Enter text to generate prompt:Hi my name is Kiran , it's a nice weather today .


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
Hi my name is Kiran, it's a nice weather today. I'm going to go to the beach and I'm going to go to the beach and I'm going to go to the beach and I'm going to go to the beach and I'm going to


In [5]:
#Beam Search -> repetitions exist
# activate beam search and early_stopping
beam_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    num_beams=5, #focus here -> if greater than 5 leads to higher computations as it creates 5 paths which is unnecessary
    early_stopping=True #focus here -> this must be true always as it leads to an infinite progression
)

#output
print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
Hi my name is Kiran, it's a nice weather today.

My name is Kiran, it's a nice weather today.

My name is Kiran, it's a nice weather today.

My name is Kiran, it's


In [6]:
# set no_repeat_ngram_size to 2
beam_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    num_beams=5,
    no_repeat_ngram_size=2, #focus here -> this is present here to show that there is a penality for choosing a repetition, with a size of 2
    early_stopping=True
)

#output
print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
Hi my name is Kiran, it's a nice weather today. I'm going to go out and buy some food for my family and I'll be back soon.

I have been looking for a place to stay for the past few months. I've been


In [7]:
# set return_num_sequences > 1
beam_outputs = model.generate(
    **model_inputs,
    max_new_tokens=40,
    num_beams=5,
    no_repeat_ngram_size=2,
    num_return_sequences=5, #focus here -> here we use it to see 5 possible outcomes
    early_stopping=True
)

# now we have 5 output sequences
print("Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
  print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
0: Hi my name is Kiran, it's a nice weather today. I'm going to go out and buy some food for my family and I'll be back soon.

I have been looking for a place to stay for the past few months. I've been
1: Hi my name is Kiran, it's a nice weather today. I'm going to go out and buy some food for my family and I'll be back soon.

I have been looking for a place to stay for some time now. I've been staying
2: Hi my name is Kiran, it's a nice weather today. I'm going to go out and buy some food for my family and I'll be back soon.

I have been looking for a place to stay for some time now. I've been in
3: Hi my name is Kiran, it's a nice weather today. I'm going to go out and buy some food for my family and I'll be back soon.

I have been looking for a place to stay for the past few months. I have a
4: Hi my name is Kiran, it's a nice weather today. I'm going to go out and buy some f

In [8]:
#Creating randomness
# set seed to reproduce results. Feel free to change the seed though to get different results
from transformers import set_seed
set_seed(42) #randomize results

# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True, #focus here -> All samplings are done
    top_k=0 #focus here -> No top k samplings are done - i.e distribution of vocabulary is weird here
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
Hi my name is Kiran, it's a nice weather today.


In [9]:
# set seed to reproduce results. Feel free to change the seed though to get different results
set_seed(42)

# use temperature to decrease the sensitivity to low probability candidates
sample_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=0,
    temperature=0.6, #focus here -> softmax -> this is done to create weirdness in text to make it sound like a human
    #i.e reduce probability among tokens and create a flatter curve for token choice
)

#output
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
Hi my name is Kiran, it's a nice weather today. It's so warm. I can't wait to see what you're doing with that sunburn.

Rated 5 out of 5 by Mics from Great for the price I bought this for my


In [10]:
# set seed to reproduce results. Feel free to change the seed though to get different results
set_seed(42)

# set top_k to 50
sample_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=50 #focus here -> top 50 tokens given higher priority is chosen
)

#output
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
Hi my name is Kiran, it's a nice weather today.


In [11]:
# set seed to reproduce results. Feel free to change the seed though to get different results
set_seed(42)

# set top_k to 50
sample_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_p=0.92, #Nucleus sampling - done to control text generation
    top_k=0
)

#output
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
Hi my name is Kiran, it's a nice weather today.


In [12]:
# set seed to reproduce results. Feel free to change the seed though to get different results
set_seed(42)

# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    num_return_sequences=3,
)

#output
print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
0: Hi my name is Kiran, it's a nice weather today.
1: Hi my name is Kiran, it's a nice weather today.

I would love to come out to you in person, we'll meet again and see how it goes.

Hope to see you in heaven, we'll meet again and see how it
2: Hi my name is Kiran, it's a nice weather today.

You should get in there and bring your bike and your friends to join you on the journey.

I've ridden my bike for about 1 week.

I hope to see you
