In [None]:
!pip install rich transformers torch accelerate bitsandbytes peft > /dev/null

Task: Text Generation

Each task has its own default model in the pipeline.

- Causal Language Modeling

- Masked Language Modeling

Another type of variation is

- Text Generation

- Text-to-Text Generation models


Variety of LMs in HuggingFace

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard


Some of the Text generation Tasks

Code Generation: Trained to generate code

https://huggingface.co/spaces/bigcode/bigcode-playground

Instruction Model: Those that are trained on instruction

Stories generation: A prompt starts the Stories Generation


Quantization Using BitsAndBytes is touched

- Grokking how the models are shrunk and loaded.



In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer

In [None]:
from transformers import pipeline

path = "distilbert/distilgpt2"

generator = pipeline('text-generation',
                     model = path)

generator("Hello, I'm a language model",
          max_length = 30,
          num_return_sequences=3)

# Will provide 3 generated statements

In [None]:
# - Text-to-Text generation models have a separate pipeline called text2text-generation.

# - This pipeline takes an input containing the sentence including the task and returns the output of the accomplished task.

In [4]:
from transformers import pipeline

text2text_generator = pipeline("text2text-generation")

text2text_generator("question: What is 42 ? context: 42 is the answer to life, the universe and everything")

No model was supplied, defaulted to google-t5/t5-base and revision 686f1db (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]



[{'generated_text': 'the answer to life, the universe and everything'}]

#### How pipeline works: Dive into AutoClasses

The pipeline function is built using the AutoModel, AutoTokenizer to create the generated text. We will look into, how it could be implemented

In [5]:
model = AutoModelForCausalLM.from_pretrained(path)

In [6]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(path)
model.generation_config

GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256
}

In [7]:
prompt = "Question: Please write a function in Python that transforms bytes to Giga bytes.\n\nAnswer:"

In [8]:
tokenizer = AutoTokenizer.from_pretrained(path)

In [9]:
model = AutoModelForCausalLM.from_pretrained(path,
                                             pad_token_id=0)

In [10]:
# Lets first look at how to provide the model into the pipeline

pipe = pipeline("text-generation",
                model=model,
                tokenizer=tokenizer)

result = pipe(prompt,
              max_new_tokens=60)[0]["generated_text"][len(prompt):]

result


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


' Please write a function and make it transparently available and transparent as well as making it transparently available.\nAnswer: Please write a function as an expression or function whose code is equivalent to the one that is in the giga. It will read and write the function in a standard giga.'

In [13]:
# max_new_tokens: the maximum number of tokens to generate. In other words, the size of the output sequence,
# not including the tokens in the prompt. As an alternative to using the output’s length as a stopping criteria,
# you can choose to stop generation whenever the full generation exceeds some amount of time. To learn more, check StoppingCriteria.

# num_beams: by specifying a number of beams higher than 1, you are effectively switching from greedy
# search to beam search. This strategy evaluates several hypotheses at each time step and eventually
# chooses the hypothesis that has the overall highest probability for the entire sequence. This has
# the advantage of identifying high-probability sequences that start with a lower probability initial
# tokens and would’ve been ignored by the greedy search. Visualize how it works in the beam search visualizer below.

# do_sample: if set to True, this parameter enables decoding strategies such as multinomial sampling,
# beam-search multinomial sampling, Top-K sampling and Top-p sampling. All these strategies select the
# next token from the probability distribution over the entire vocabulary with various strategy-specific adjustments.

# num_return_sequences: the number of sequence candidates to return for each input. This option is only
# available for the decoding strategies that support multiple sequence candidates, e.g. variations of
# beam search and sampling. Decoding strategies like greedy search and contrastive search
# return a single output sequence.

In [14]:
# https://huggingface.co/spaces/m-ric/beam_search_visualizer

# https://huggingface.co/blog/optimize-llm

In [15]:
from transformers import AutoModelForSeq2SeqLM, GenerationConfig

tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-small")

model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-small")

In [16]:
translation_generation_config = GenerationConfig(
    num_beams=4,
    early_stopping=True,
    decoder_start_token_id=0,
    eos_token_id=model.config.eos_token_id,
    pad_token=model.config.pad_token_id,
)

In [17]:
# Tip: add `push_to_hub=True` to push to the Hub
translation_generation_config.save_pretrained("/tmp", "translation_generation_config.json")

# You could then use the named generation config file to parameterize generation
generation_config = GenerationConfig.from_pretrained("/tmp", "translation_generation_config.json")

In [18]:
inputs = tokenizer("translate English to French: Configuration files are easy to use!", return_tensors="pt")b

In [19]:
inputs

{'input_ids': tensor([[13959,  1566,    12,  2379,    10, 25306,   257,  2073,    33,   514,
            12,   169,    55,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [20]:
outputs = model.generate(**inputs, generation_config=generation_config)

print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:1 for open-end generation.


['Les fichiers de configuration sont faciles à utiliser!']


##### Practice with Gemma Model

Explore the model inference processes, its methods by executing below cells.

In [25]:
# Gemma model will require your read access token.

from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2b-it",
    torch_dtype=torch.bfloat16
)

In [28]:
torch.cuda.is_available()

True

In [None]:
model.to('cuda')

In [None]:
# not required to run this cell.

# llama_path = "/home/kamal/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/snapshots/c1b0db933684edbfe29a06fa47eb19cc48025e93/"
import torch
gemma_path = "google/gemma-2b-it"

gemma = AutoModelForCausalLM.from_pretrained(
    # pretrained_model_name_or_path='/home/aicoder/.cache/huggingface/hub/models--google--gemma-2b-it/snapshots/718cb189da9c5b2e55abe86f2eeffee9b4ae0dad/
    gemma_path,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    # local_files_only=True  # this will stop the function from calling the hub for the model
) # takes 11GB of VRAM

In [31]:
from rich import print

In [32]:
gemma_tokenizer = AutoTokenizer.from_pretrained(gemma_path)
# print(llama_tokenizer.default_chat_template) # rich's print fails due to tag
print(gemma_tokenizer.default_chat_template)

In [33]:
print(gemma_tokenizer.special_tokens_map)

In [42]:
prompt = "Where there is a will"
gemma_input = gemma_tokenizer(prompt, return_tensors='pt').to('cuda')
# The above command needs to be reviewed for the errors it can create

In [40]:
gemma_input

{'input_ids': tensor([[   2, 6006, 1104,  603,  476,  877]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}

In [43]:
gemma_output = model.generate(
    **gemma_input,
    max_new_tokens=100,
    do_sample=True,
    # pad_token_id = llama_tokenizer.eos_token_id
)

In [44]:
gemma_output

tensor([[     2,   6006,   1104,    603,    476,    877, 235269,   1104,    603,
            476,   1703, 235265,    109,   1596,  20911,  55620,    573,   7819,
            674,    675,    476,   3110,  12005,    689,   6187, 235269,    665,
            603,   3077,    577,  21944,  36638,    578,   7914,    974, 235303,
         235256,   9082, 235265,   1165, 144118,    573,   4268,    674,   2811,
           1160,    578,  61642,    708,   8727,    604,  32379,   3361, 235265,
            109,   4858,    708,   1009,   8944,    576,   1368,    736,  20911,
            798,    614,   1671, 235292,    109, 235290,   3194,  20360,    675,
            476,   5988,   6911, 235269,  30903,   5804,    674,   1104,    603,
            476,    877,    577,  21252, 235265,    108, 235290,   4218,  82648,
           9082,    578,   3104,    476,   1780,    577,   7914,   1174, 235265,
            108, 235290,   1927,    692,    791,    476,   6523]],
       device='cuda:0')

In [45]:
# llama_output = llama_output[len(llama_input[0]):]
# llama_output
output = gemma_tokenizer.decode(gemma_output[0],skip_special_tokens=True)
output

"Where there is a will, there is a way.\n\nThis phrase expresses the concept that with a clear desire or purpose, it is possible to overcome obstacles and achieve one's goals. It reinforces the idea that hard work and persistence are essential for achieving success.\n\nHere are some examples of how this phrase can be used:\n\n- When faced with a difficult task, remind yourself that there is a will to succeed.\n- Set SMART goals and create a plan to achieve them.\n- If you have a dream"

##### All the NLP Tasks as Generation Tasks

We will review how the NLP tasks like classification, QnA can be converted to Text-Gen task

In [None]:
# working on getting the riddle based on the 10 riddles input
gemma_10_input = gemma_tokenizer.apply_chat_template(messages_to_model,
                                                     return_tensors='pt').to('cuda')
gemma_10_input

tensor([[     2,    106,   1645,    108,  33501,    603,    476, 133326, 235265,
          12542,    908,    675, 235248, 235274, 235276,    978, 235265,    108,
           6140,   1317,    573, 193130, 235265,   1307,  96085,    578,   1453,
         235303, 235251,   5033,   4341,   1354, 235285,  25469,    578,  10084,
            578,  47331,   2003,    575,    861,   3142, 235265,    590,   1144,
            793,  12100, 235269,    578,    590,   1453, 235303, 235251,   8044,
          24306, 235265,    109,   1969,   2174,  11807,  28294, 235269,    665,
            603,  24048,   1154,    476,  75735, 235265,   1165,    603,  10545,
            675,    573,  25023, 235269,    578,    573,  25023,    603,  14471,
         235265,    109, 159960, 235267,    685,    476,  12425,    575,    573,
           5455,    576,   3354, 235269,  82056,    901,  90892,   1013,   2764,
            476,  26911, 235265,  13227,  14987,    901,   3695,   8829, 235265,
           2625,    970,   7

In [None]:
gemma_tokenizer.decode(llama_10_input[0])

"<bos><start_of_turn>user\nBelow is a riddle. Come up with 10 more.\nOutput just the riddles. No numbering and don't output anything elseI bubble and laugh and spit water in your face. I am no lady, and I don't wear lace.\n\nAn open ended barrel, it is shaped like a hive. It is filled with the flesh, and the flesh is alive.\n\nStealthy as a shadow in the dead of night, cunning but affectionate if given a bite. Never owned but often loved. At my sport considered cruel, but that's because you never know me at all.\n\nI am a fire's best friend. When fat, my body fills with wind. When pushed to thin, through my nose I blow. Then you can watch the embers glow.\n\nI crawl on the earth. And rise on a pillar.\n\nA box without hinges, lock or key, yet golden treasure lies within. \n\nAs a whole, I am both safe and secure. Behead me, I become a place of meeting. Behead me again, I am the partner of ready. Restore me, I become the domain of beasts.\n\nWho is he that runs without a leg. And his ho

In [None]:
gemma_10_output = gemma.generate(
    llama_10_input,
    max_new_tokens=500,
    do_sample=True,
    pad_token_id=llama_tokenizer.eos_token_id
)

In [None]:
output_10 = gemma_tokenizer.decode(gemma_10_output[0], skip_special_tokens=True)
output_10

"user\nBelow is a riddle. Come up with 10 more.\nOutput just the riddles. No numbering and don't output anything elseI bubble and laugh and spit water in your face. I am no lady, and I don't wear lace.\n\nAn open ended barrel, it is shaped like a hive. It is filled with the flesh, and the flesh is alive.\n\nStealthy as a shadow in the dead of night, cunning but affectionate if given a bite. Never owned but often loved. At my sport considered cruel, but that's because you never know me at all.\n\nI am a fire's best friend. When fat, my body fills with wind. When pushed to thin, through my nose I blow. Then you can watch the embers glow.\n\nI crawl on the earth. And rise on a pillar.\n\nA box without hinges, lock or key, yet golden treasure lies within. \n\nAs a whole, I am both safe and secure. Behead me, I become a place of meeting. Behead me again, I am the partner of ready. Restore me, I become the domain of beasts.\n\nWho is he that runs without a leg. And his house on his back?\n\n

In [None]:
del gemma
torch.cuda.empty_cache()  # this time the model is not offloadin
# restarting the server to release the memory

### Llama observation

- Llama provides the output of 10 riddles

- Llama encoding and decoding is working same as roberta models

In [None]:
import torch

code_llama_path = "/home/kamal/.cache/huggingface/hub/models--codellama--CodeLlama-7b-hf/snapshots/bc5283229e2fe411552f55c71657e97edf79066c/"
code_tokenizer = AutoTokenizer.from_pretrained(code_llama_path)
codellama = AutoModelForCausalLM.from_pretrained(
    code_llama_path,
    device_map="auto",
    quantization_config=quant_config,
    torch_dtype=torch.bfloat16
)  # takes around 11.5GB of VRAM
# with 4-bit quantization 5GB of VRAM is consumed

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
code_tokenizer.default_chat_template

"{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% elif false == true and not '<<SYS>>' in messages[0]['content'] %}{% set loop_messages = messages %}{% set system_message = 'You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\\n\\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don\\'t know the answer to a question, please don\\'t share false information.' %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must

In [None]:
simple_request = "Generate 10 riddles that you know"

simple_message = [{"role":"user", "content":simple_request}]

In [None]:
code_request = "Generate 10 dictionary related problems"

code_message = [{"role":"user", "content":code_request}]

In [None]:
code_input = code_tokenizer.apply_chat_template(simple_message,return_tensors='pt').to("cuda")

Using sep_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using mask_token, but it is not set yet.


In [None]:
code_input = code_tokenizer.apply_chat_template(code_message,return_tensors='pt').to("cuda")

Using sep_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using mask_token, but it is not set yet.


In [None]:
code_output = codellama.generate(
    code_input,
    max_new_tokens=500,
    do_sample=True,
    pad_token_id=code_tokenizer.eos_token_id
)

In [None]:
code_output = codellama.generate(
    code_input,
    max_new_tokens=500,
    do_sample=True,
    pad_token_id=code_tokenizer.eos_token_id,
    temperature=0.2
)

In [None]:
code_output = codellama.generate(
    code_input,
    max_new_tokens=500,
    # do_sample=True,
    pad_token_id=code_tokenizer.eos_token_id,
    # temperature=0.2
)

In [None]:
code_output = codellama.generate(
    code_input,
    max_new_tokens=500,
    # do_sample=True,
    pad_token_id=code_tokenizer.eos_token_id,
    # temperature=0.2
    repetition_penalty=0.5
)

In [None]:
code_output = codellama.generate(
    code_input,
    max_new_tokens=500,
    do_sample=True,
    pad_token_id=code_tokenizer.eos_token_id,
    # temperature=0.2
    repetition_penalty=0.5,
    top_p=10,
    top_k=20
)

In [None]:
output = code_output[0][len(code_input[0]):]
code_tokenizer.decode(output, skip_special_tokens=True)

'ayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaSSLayaayaayaayaayaayaayaSSLayaSSLayaayaayaayaayaayaayaayaayaayaayaayaayaSSLayaayaayaayaayaayaayaayaayaayaSSLayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaSSLayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaSSLayaayaayaayaayaayaSSLayaayaayaayaayaayaayaayaSSLayaSSLayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaSSLayaayaayaayaayaayaSSLayaayaayaayaayaSSLayaayaayaayaayaayaayaayaayaSSLayaayaayaayaayaayaayaayaayaayaayaayaSSLayaayaayaayaayaayaayaSSLayaayaSSLayaSSLayaayaSSLayaayaayaayaayaayaayaayaayaayaayaayaayaayaayaSSLayaayaayaayaayaayaayaayaaya

### Code Llama observation

- Model inference lead to OOM with 500 tokens request, when loaded with bfloat16

- Model inference worked with 500 Tokens, with quant_config done with 4-bit.

- In quantisation 1GB of Vram is consumed for inference


- **The model out was gibberish**

- Requested for the code related to dictionary

- Even after reviewing the generation configs, only gibberish was generated