In [1]:
!pip install rich transformers torch accelerate bitsandbytes peft > /dev/null

Task: Text Generation

Each task has its own default model in the pipeline.

- Causal Language Modeling

- Masked Language Modeling

Another type of variation is

- Text Generation

- Text-to-Text Generation models


Variety of LMs in HuggingFace

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard


Some of the Text generation Tasks

Code Generation: Trained to generate code

https://huggingface.co/spaces/bigcode/bigcode-playground

Instruction Model: Those that are trained on instruction

Stories generation: A prompt starts the Stories Generation


Quantization Using BitsAndBytes is touched

- Grokking how the models are shrunk and loaded.



In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer

In [6]:
from transformers import pipeline

path = "distilbert/distilgpt2"

generator = pipeline('text-generation',
                     model = path)

generator("Hello, I'm a language model",
          max_length = 30,
          num_return_sequences=3)

# Will provide 3 generated statements

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model.\n\nI do intend to use these projects to improve the language over time. This also means that I am"},
 {'generated_text': "Hello, I'm a language model I'm used to, but it means a lot of time. Being really good at it means you don't need"},
 {'generated_text': "Hello, I'm a language model, I do not have to reinvent the wheel as the whole language. This whole thing about a programming language has always"}]

In [None]:
[{'generated_text': "Hello, I'm a language model. I wrote a tutorial for what I used to write last month, so I'm going to focus on making that"},
 {'generated_text': "Hello, I'm a language model. I really am not in an all-encompassing way. I like to write complex code, write complex"},
 {'generated_text': "Hello, I'm a language model, I'm a programmer, and so on. My new, very big, new library is: my library!"}]

In [None]:
# - Text-to-Text generation models have a separate pipeline called text2text-generation.

# - This pipeline takes an input containing the sentence including the task and returns the output of the accomplished task.

In [None]:
# Instruction tuned model --- > kind of Agents ---> instruct model ----> lot of instructions/context given
# Chat model --- > Chat bots

In [8]:
from transformers import pipeline

text2text_generator = pipeline("text2text-generation")

text2text_generator("question: What is 42 ? context: 42 is the answer to life, the universe and everything")

No model was supplied, defaulted to google-t5/t5-base and revision 686f1db (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'generated_text': 'the answer to life, the universe and everything'}]

In [None]:
# Encoder ==> syntactic(positional) + semantic(embeddings), # decoder

#### How pipeline works: Dive into AutoClasses

The pipeline function is built using the AutoModel, AutoTokenizer to create the generated text. We will look into, how it could be implemented

In [9]:
from transformers import AutoModelForCausalLM
path = "distilbert/distilgpt2"
model = AutoModelForCausalLM.from_pretrained(path)
model.generation_config

GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256
}

In [10]:
prompt = "Question: Please write a function in Python that transforms bytes to Giga bytes.\n\nAnswer:"

In [11]:
tokenizer = AutoTokenizer.from_pretrained(path)

In [12]:
model = AutoModelForCausalLM.from_pretrained(path,
                                             pad_token_id=0)

In [None]:
#Home Work run below cell line by line and comment

In [13]:
# Lets first look at how to provide the model into the pipeline

pipe = pipeline("text-generation",
                model=model,
                tokenizer=tokenizer)

result = pipe(prompt,
              max_new_tokens=60)[0]["generated_text"][len(prompt):]

result


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


' Do you think you know how to compute the number of times a second after a second? Or did the math to add a value from 0 to 1 every 1 seconds? Or is there a way to calculate the number of times a second after a second of a second?'

In [None]:
# max_new_tokens: the maximum number of tokens to generate. In other words, the size of the output sequence,
# not including the tokens in the prompt. As an alternative to using the output’s length as a stopping criteria,
# you can choose to stop generation whenever the full generation exceeds some amount of time. To learn more, check StoppingCriteria.

# num_beams: by specifying a number of beams higher than 1, you are effectively switching from greedy
# search to beam search. This strategy evaluates several hypotheses at each time step and eventually
# chooses the hypothesis that has the overall highest probability for the entire sequence. This has
# the advantage of identifying high-probability sequences that start with a lower probability initial
# tokens and would’ve been ignored by the greedy search. Visualize how it works in the beam search visualizer below.

# do_sample: if set to True, this parameter enables decoding strategies such as multinomial sampling,
# beam-search multinomial sampling, Top-K sampling and Top-p sampling. All these strategies select the
# next token from the probability distribution over the entire vocabulary with various strategy-specific adjustments.

# num_return_sequences: the number of sequence candidates to return for each input. This option is only
# available for the decoding strategies that support multiple sequence candidates, e.g. variations of
# beam search and sampling. Decoding strategies like greedy search and contrastive search
# return a single output sequence.

In [None]:
# https://huggingface.co/spaces/m-ric/beam_search_visualizer

# https://huggingface.co/blog/optimize-llm

In [14]:
from transformers import AutoModelForSeq2SeqLM, GenerationConfig

tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-small")

model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-small")

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
# Read config from GenerationConfig below - do F12

In [15]:
translation_generation_config = GenerationConfig(
    num_beams=4,
    early_stopping=True,
    decoder_start_token_id=0,
    eos_token_id=model.config.eos_token_id,
    pad_token=model.config.pad_token_id,
)

In [17]:
from rich import print
print(type(translation_generation_config))
print(translation_generation_config)

In [18]:
# Tip: add `push_to_hub=True` to push to the Hub
translation_generation_config.save_pretrained("/tmp", "translation_generation_config.json")

# You could then use the named generation config file to parameterize generation
generation_config = GenerationConfig.from_pretrained("/tmp", "translation_generation_config.json")

In [19]:
inputs = tokenizer("translate English to French: Configuration files are easy to use!", return_tensors="pt")

In [21]:
print(inputs)

In [22]:
outputs = model.generate(**inputs, generation_config=generation_config)

print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:1 for open-end generation.


##### Practice with Gemma Model

Explore the model inference processes, its methods by executing below cells later the decoding strategies will be explored.

##### All the decoding stategies

https://huggingface.co/blog/how-to-generate

https://huggingface.co/docs/transformers/generation_strategies

Greedy Search

Contrastive Search

Multinomial Sampling

Beam-Search Decoding

Beam-Search Multinomial Sampling

Diverse Beam Search Decoding


In [23]:
# Gemma model will require your read access token.

from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2b-it",
    torch_dtype=torch.bfloat16
)

In [None]:
torch.cuda.is_available()

True

In [None]:
model.to('cuda')

In [None]:
# not required to run this cell.

# llama_path = "/home/kamal/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/snapshots/c1b0db933684edbfe29a06fa47eb19cc48025e93/"
import torch
gemma_path = "google/gemma-2b-it"

gemma = AutoModelForCausalLM.from_pretrained(
    # pretrained_model_name_or_path='/home/aicoder/.cache/huggingface/hub/models--google--gemma-2b-it/snapshots/718cb189da9c5b2e55abe86f2eeffee9b4ae0dad/
    gemma_path,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    # local_files_only=True  # this will stop the function from calling the hub for the model
) # takes 11GB of VRAM

In [None]:
from rich import print

In [None]:
gemma_tokenizer = AutoTokenizer.from_pretrained(gemma_path)
# print(llama_tokenizer.default_chat_template) # rich's print fails due to tag
print(gemma_tokenizer.default_chat_template)

In [None]:
print(gemma_tokenizer.special_tokens_map)

In [None]:
prompt = "Where there is a will"
gemma_input = gemma_tokenizer(prompt, return_tensors='pt').to('cuda')
# The above command needs to be reviewed for the errors it can create

In [None]:
gemma_input

{'input_ids': tensor([[   2, 6006, 1104,  603,  476,  877]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}

In [None]:
# First level of decoding customisation
gemma_output = model.generate(
    **gemma_input,
    max_new_tokens=100,
    do_sample=True,
    # pad_token_id = llama_tokenizer.eos_token_id
)

In [None]:
gemma_output

In [None]:
output = gemma_tokenizer.decode(gemma_output[0],skip_special_tokens=True)
output

"Where there is a will, there is a way.\n\nThis phrase expresses the concept that with a clear desire or purpose, it is possible to overcome obstacles and achieve one's goals. It reinforces the idea that hard work and persistence are essential for achieving success.\n\nHere are some examples of how this phrase can be used:\n\n- When faced with a difficult task, remind yourself that there is a will to succeed.\n- Set SMART goals and create a plan to achieve them.\n- If you have a dream"

In [None]:
from transformers import AutoModelForCausalLM, GenerationConfig

model = AutoModelForCausalLM.from_pretrained("my_account/my_model")
generation_config = GenerationConfig(
    max_new_tokens=50, do_sample=True, top_k=50, eos_token_id=model.config.eos_token_id
)
generation_config.save_pretrained("my_account/my_model", push_to_hub=True)

In [None]:
translation_generation_config = GenerationConfig(
    num_beams=4,
    early_stopping=True,
    decoder_start_token_id=0,
    eos_token_id=model.config.eos_token_id,
    pad_token=model.config.pad_token_id,
)

# Tip: add `push_to_hub=True` to push to the Hub
translation_generation_config.save_pretrained("/tmp", "translation_generation_config.json")

# You could then use the named generation config file to parameterize generation
generation_config = GenerationConfig.from_pretrained("/tmp", "translation_generation_config.json")
inputs = tokenizer("translate English to French: Configuration files are easy to use!", return_tensors="pt")
outputs = model.generate(**inputs, generation_config=generation_config)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

In [None]:
# Greedy Generation
outputs = model.generate(**inputs)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

In [None]:
# Contrastive Search
prompt = "Hugging Face Company is"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(**inputs, penalty_alpha=0.6, top_k=4, max_new_tokens=100)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

In [None]:
# Contrastive Search
from transformers import set_seed

set_seed(0)  # For reproducibility

prompt = "Today was an amazing day because"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(**inputs, do_sample=True, num_beams=1, max_new_tokens=100)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

In [None]:
# Beam Search Decoding

prompt = "It is astonishing how one can"

inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(**inputs, num_beams=5, max_new_tokens=50)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

In [None]:
# Beam Search Multinomial Sampling

set_seed(0)  # For reproducibility

prompt = "translate English to German: The house is wonderful."

inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(**inputs, num_beams=5, do_sample=True)
tokenizer.decode(outputs[0], skip_special_tokens=True)

In [None]:
# Diverse beam search decoding

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

prompt = (
    "The Permaculture Design Principles are a set of universal design principles "
    "that can be applied to any location, climate and culture, and they allow us to design "
    "the most efficient and sustainable human habitation and food production systems. "
    "Permaculture is a design system that encompasses a wide variety of disciplines, such "
    "as ecology, landscape design, environmental science and energy conservation, and the "
    "Permaculture design principles are drawn from these various disciplines. Each individual "
    "design principle itself embodies a complete conceptual framework based on sound "
    "scientific principles. When we bring all these separate  principles together, we can "
    "create a design system that both looks at whole systems, the parts that these systems "
    "consist of, and how those parts interact with each other to create a complex, dynamic, "
    "living system. Each design principle serves as a tool that allows us to integrate all "
    "the separate parts of a design, referred to as elements, into a functional, synergistic, "
    "whole system, where the elements harmoniously interact and work together in the most "
    "efficient way possible."
)

inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(**inputs, num_beams=5, num_beam_groups=5, max_new_tokens=30, diversity_penalty=1.0)

tokenizer.decode(outputs[0], skip_special_tokens=True)

In [None]:
# Speculative decoding
prompt = "Alice and Bob"

inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(**inputs, assistant_model=assistant_model)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

In [None]:
del gemma
torch.cuda.empty_cache()  # this time the model is not offloadin
# restarting the server to release the memory

### Llama observation

- Llama provides the output of 10 riddles

- Llama encoding and decoding is working same as roberta models

In [None]:
import torch

code_llama_path = "codellama/CodeLlama-7b-hf"

code_tokenizer = AutoTokenizer.from_pretrained(code_llama_path)

codellama = AutoModelForCausalLM.from_pretrained(
    code_llama_path,
    device_map="auto",
    quantization_config=quant_config,
    torch_dtype=torch.bfloat16
)  # takes around 11.5GB of VRAM
# with 4-bit quantization 5GB of VRAM is consumed

In [None]:
code_tokenizer.default_chat_template

"{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% elif false == true and not '<<SYS>>' in messages[0]['content'] %}{% set loop_messages = messages %}{% set system_message = 'You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\\n\\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don\\'t know the answer to a question, please don\\'t share false information.' %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must

In [None]:
code_input = code_tokenizer.apply_chat_template(code_message,return_tensors='pt').to("cuda")

Using sep_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using mask_token, but it is not set yet.


In [None]:
code_output = codellama.generate(
    code_input,
    max_new_tokens=500,
    do_sample=True,
    pad_token_id=code_tokenizer.eos_token_id
)

In [None]:
code_output = codellama.generate(
    code_input,
    max_new_tokens=500,
    do_sample=True,
    pad_token_id=code_tokenizer.eos_token_id,
    temperature=0.2
)

In [None]:
code_output = codellama.generate(
    code_input,
    max_new_tokens=500,
    # do_sample=True,
    pad_token_id=code_tokenizer.eos_token_id,
    # temperature=0.2
)

In [None]:
code_output = codellama.generate(
    code_input,
    max_new_tokens=500,
    # do_sample=True,
    pad_token_id=code_tokenizer.eos_token_id,
    # temperature=0.2
    repetition_penalty=0.5
)

In [None]:
code_output = codellama.generate(
    code_input,
    max_new_tokens=500,
    do_sample=True,
    pad_token_id=code_tokenizer.eos_token_id,
    # temperature=0.2
    repetition_penalty=0.5,
    top_p=10,
    top_k=20
)

In [None]:
output = code_output[0][len(code_input[0]):]
code_tokenizer.decode(output, skip_special_tokens=True)

### Code Llama observation

- Model inference lead to OOM with 500 tokens request, when loaded with bfloat16

- Model inference worked with 500 Tokens, with quant_config done with 4-bit.

- In quantisation 1GB of Vram is consumed for inference


- **The model out was gibberish**

- Requested for the code related to dictionary

- Even after reviewing the generation configs, only gibberish was generated