In [1]:
!pip install rich transformers torch accelerate bitsandbytes peft > /dev/null

Task: Text Generation

Each task has its own default model in the pipeline.

- Causal Language Modeling

- Masked Language Modeling

Another type of variation is

- Text Generation

- Text-to-Text Generation models


Variety of LMs in HuggingFace

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard


Some of the Text generation Tasks

Code Generation: Trained to generate code

https://huggingface.co/spaces/bigcode/bigcode-playground

Instruction Model: Those that are trained on instruction

Stories generation: A prompt starts the Stories Generation


Quantization Using BitsAndBytes is touched

- Grokking how the models are shrunk and loaded.



In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer

In [3]:
from transformers import pipeline

path = "distilbert/distilgpt2"

generator = pipeline('text-generation',
                     model = path)

generator("Hello, I'm a language model",
          max_length = 30,
          num_return_sequences=3)

# Will provide 3 generated statements

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model for creating my own user interface, and for working collaboratively with others. By collaborating on this project on my own"},
 {'generated_text': "Hello, I'm a language model with a little bit of a sense of what my environment is going to look like when I open your browser, you"},
 {'generated_text': "Hello, I'm a language model\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"}]

In [None]:
# - Text-to-Text generation models have a separate pipeline called text2text-generation.

# - This pipeline takes an input containing the sentence including the task and returns the output of the accomplished task.

In [5]:
from transformers import pipeline

text2text_generator = pipeline("text2text-generation")
text2text_generator("question: What is 42 ?")
# text2text_generator("question: What is 42 ? context: 42 is the answer to life, the universe and everything")

No model was supplied, defaulted to google-t5/t5-base and revision 686f1db (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'generated_text': 'not_duplicate'}]

# How pipeline works: Dive into AutoClasses



The pipeline function is built using the AutoModel, AutoTokenizer to create the generated text. We will look into, how it could be implemented

In [None]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(path)
model.generation_config

GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256
}

In [None]:
prompt = "Question: Please write a function in Python that transforms bytes to Giga bytes.\n\nAnswer:"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(path)

In [None]:
model = AutoModelForCausalLM.from_pretrained(path,
                                             pad_token_id=0)

In [None]:
# Lets first look at how to provide the model into the pipeline

pipe = pipeline("text-generation",
                model=model,
                tokenizer=tokenizer)

result = pipe(prompt,
              max_new_tokens=60)[0]["generated_text"][len(prompt):]

result


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


' "Giga bytes are defined by the Python module in src/giga.py.\nIf you are using the interpreter in python 2 or above please note in the corresponding module that it\'s compiled in the "GigaBytes.py" folder.'

In [None]:
# max_new_tokens: the maximum number of tokens to generate. In other words, the size of the output sequence,
# not including the tokens in the prompt. As an alternative to using the output’s length as a stopping criteria,
# you can choose to stop generation whenever the full generation exceeds some amount of time. To learn more, check StoppingCriteria.

# num_beams: by specifying a number of beams higher than 1, you are effectively switching from greedy
# search to beam search. This strategy evaluates several hypotheses at each time step and eventually
# chooses the hypothesis that has the overall highest probability for the entire sequence. This has
# the advantage of identifying high-probability sequences that start with a lower probability initial
# tokens and would’ve been ignored by the greedy search. Visualize how it works in the beam search visualizer below.

# do_sample: if set to True, this parameter enables decoding strategies such as multinomial sampling,
# beam-search multinomial sampling, Top-K sampling and Top-p sampling. All these strategies select the
# next token from the probability distribution over the entire vocabulary with various strategy-specific adjustments.

# num_return_sequences: the number of sequence candidates to return for each input. This option is only
# available for the decoding strategies that support multiple sequence candidates, e.g. variations of
# beam search and sampling. Decoding strategies like greedy search and contrastive search
# return a single output sequence.

In [None]:
# https://huggingface.co/spaces/m-ric/beam_search_visualizer

# https://huggingface.co/blog/optimize-llm

In [6]:
from transformers import AutoModelForSeq2SeqLM, GenerationConfig

tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-small")

model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-small")

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [7]:
translation_generation_config = GenerationConfig(
    num_beams=4,
    early_stopping=True,
    decoder_start_token_id=0,
    eos_token_id=model.config.eos_token_id,
    pad_token=model.config.pad_token_id,
)

In [8]:
# Tip: add `push_to_hub=True` to push to the Hub
translation_generation_config.save_pretrained("/tmp", "translation_generation_config.json")

# You could then use the named generation config file to parameterize generation
generation_config = GenerationConfig.from_pretrained("/tmp", "translation_generation_config.json")

In [11]:
inputs = tokenizer("translate English to French: Configuration files are easy to use!", return_tensors="pt")

In [12]:
inputs

{'input_ids': tensor([[13959,  1566,    12,  2379,    10, 25306,   257,  2073,    33,   514,
            12,   169,    55,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [13]:
outputs = model.generate(**inputs, generation_config=generation_config)
print(outputs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:1 for open-end generation.


tensor([[    0,   622, 13785,     7,    20,  5298,   527,  9912,     7,     3,
            85, 11503,    55,     1]])
['Les fichiers de configuration sont faciles à utiliser!']


# Practice with Gemma model


Explore the model inference processes, its methods by executing below cells later the decoding strategies will be explored.



In [5]:
# Gemma model will require your read access token.

from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2b-it",
    torch_dtype=torch.bfloat16
)

In [None]:
torch.cuda.is_available()

True

In [None]:
model.to('cuda')

In [None]:
# not required to run this cell.

# llama_path = "/home/kamal/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/snapshots/c1b0db933684edbfe29a06fa47eb19cc48025e93/"
import torch
gemma_path = "google/gemma-2b-it"

gemma = AutoModelForCausalLM.from_pretrained(
    # pretrained_model_name_or_path='/home/aicoder/.cache/huggingface/hub/models--google--gemma-2b-it/snapshots/718cb189da9c5b2e55abe86f2eeffee9b4ae0dad/
    gemma_path,
    device_map="auto",
    torch_dtype=torch.bfloat16, # reducessize from float32==>11gb to bfloat16==> 5 gb, bfloat16 = brainfloat by google brain labs
    # local_files_only=True  # this will stop the function from calling the hub for the model
) # takes 11GB of VRAM

In [7]:
from rich import print

In [8]:
gemma_tokenizer = AutoTokenizer.from_pretrained(gemma_path)
# print(llama_tokenizer.default_chat_template) # rich's print fails due to tag
print(gemma_tokenizer.default_chat_template)

tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]


No chat template is defined for this tokenizer - using a default chat template that implements the ChatML format (without BOS/EOS tokens!). If the default is not appropriate for your model, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.



In [9]:
print(gemma_tokenizer.special_tokens_map)

In [16]:
prompt = "Where there is a will"
gemma_input = gemma_tokenizer(prompt, return_tensors='pt').to('cuda')
# The above command needs to be reviewed for the errors it can create

In [13]:
gemma_input

{'input_ids': tensor([[   2, 6006, 1104,  603,  476,  877]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}

In [17]:
gemma_input

{'input_ids': tensor([[   2, 6006, 1104,  603,  476,  877]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]], device='cuda:0')}

In [18]:
# First level of decoding customisation
gemma_output = gemma.generate(
    **gemma_input,
    max_new_tokens=100,
    do_sample=True,
    # pad_token_id = llama_tokenizer.eos_token_id
)

In [19]:
gemma_output

tensor([[     2,   6006,   1104,    603,    476,    877,   1104,    603,    476,
           1703, 235265,    109,   1596,  14232,    603,  30665,    577,   7536,
          54338, 235269,    901,   1277,   5467,    603,  26945, 235265,   1165,
            919,   1125,   1671,    731,  41687,   1461,   1163,    573,  24357,
         235269,   3359,  20363,  52041, 235269,  73862,  46543, 235269,    578,
          23053, 101722, 235265,    109,    651,  14232,   3454,    674,   1104,
            603,   2593,    476,   1703,    577,   7914,    476,   6789, 235269,
           1693,   1013,    665,   4930,  11623, 235265,   1165,    603,    476,
          31285,    674,    675,  17564,    578,  99597, 235269,   4341,    603,
           3077, 235265,    109,   4858,    708,   1009,   8944,    576,   1368,
            573,  14232,    798,    614,   7936, 235292,    109, 235287,   3194,
            692,    708,  15853,    476,   3210, 235269,   1453]],
       device='cuda:0')

In [20]:
output = gemma_tokenizer.decode(gemma_output[0],skip_special_tokens=True)
output

# Where there is a will, there is a way.
# This phrase expresses the concept that with a clear desire or purpose, it is possible to overcome obstacles and achieve one's goals. It reinforces the idea that hard work and persistence are essential for achieving success.
# Here are some examples of how this phrase can be used:
# - When faced with a difficult task, remind yourself that there is a will to succeed.
# - Set SMART goals and create a plan to achieve them.
# - If you have a dream

'Where there is a will there is a way.\n\nThis quote is attributed to Thomas Fuller, but its origin is uncertain. It has been used by countless people over the centuries, including Albert Einstein, Eleanor Roosevelt, and Nelson Mandela.\n\nThe quote means that there is always a way to achieve a goal, even if it seems impossible. It is a reminder that with determination and perseverance, anything is possible.\n\nHere are some examples of how the quote can be applied:\n\n* When you are facing a problem, don'

In [None]:
# do not run
from transformers import AutoModelForCausalLM, GenerationConfig

model = AutoModelForCausalLM.from_pretrained("my_account/my_model")
generation_config = GenerationConfig(
    max_new_tokens=50, do_sample=True, top_k=50, eos_token_id=model.config.eos_token_id
)
generation_config.save_pretrained("my_account/my_model", push_to_hub=True)

In [22]:
translation_generation_config = GenerationConfig(
    num_beams=4,
    early_stopping=True,
    decoder_start_token_id=0,
    eos_token_id=gemma.config.eos_token_id,
    pad_token=gemma.config.pad_token_id,
)

# Tip: add `push_to_hub=True` to push to the Hub
translation_generation_config.save_pretrained("/tmp", "translation_generation_config.json")

# You could then use the named generation config file to parameterize generation
generation_config = GenerationConfig.from_pretrained("/tmp", "translation_generation_config.json")

In [None]:
inputs = gemma_tokenizer("translate English to French: Configuration files are easy to use!", return_tensors="pt").to("cuda")
outputs = gemma.generate(**inputs, generation_config=generation_config)

In [26]:
inputs

{'input_ids': tensor([[     2,  25509,   4645,    577,   6987, 235292,  27115,   6630,    708,
           3980,    577,   1281, 235341]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

In [25]:
outputs

tensor([[     2,  25509,   4645,    577,   6987, 235292,  27115,   6630,    708,
           3980,    577,   1281, 235341,   2365,   2765,    692,    577,  29432,
            861,   1812]], device='cuda:0')

In [24]:
print(gemma_tokenizer.batch_decode(outputs, skip_special_tokens=True))

# All the decoding stategies



https://huggingface.co/blog/how-to-generate

https://huggingface.co/docs/transformers/generation_strategies

Greedy Search

Contrastive Search

Multinomial Sampling

Beam-Search Decoding

Beam-Search Multinomial Sampling

Diverse Beam Search Decoding

```
# This is formatted as code
```



In [None]:
# - *greedy decoding* if `num_beams=1` and `do_sample=False`
#         - *contrastive search* if `penalty_alpha>0.` and `top_k>1`
#         - *multinomial sampling* if `num_beams=1` and `do_sample=True`
#         - *beam-search decoding* if `num_beams>1` and `do_sample=False`
#         - *beam-search multinomial sampling* if `num_beams>1` and `do_sample=True`
#         - *diverse beam-search decoding* if `num_beams>1` and `num_beam_groups>1`
#         - *constrained beam-search decoding* if `constraints!=None` or `force_words_ids!=None`
#         - *assisted decoding* if `assistant_model` or `prompt_lookup_num_tokens` is passed to `.generate()`

In [27]:
# Greedy Generation
outputs = gemma.generate(**inputs)
gemma_tokenizer.batch_decode(outputs, skip_special_tokens=True)



['translate English to French: Configuration files are easy to use! They allow you to configure your system']

In [29]:
# Contrastive Search
prompt = "Hugging Face Company is"
inputs = gemma_tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = gemma.generate(**inputs, penalty_alpha=0.6, top_k=4, max_new_tokens=100) # penalty_alpha
gemma_tokenizer.batch_decode(outputs, skip_special_tokens=True)

['Hugging Face Company is a global leader in natural language processing (NLP) and machine learning (ML). They offer a wide range of products and services, including:\n\n* **Chatbots:** Hugging Face provides chatbots for businesses, education, and healthcare.\n* **Language models:** These pre-trained models are used for a variety of NLP tasks, such as sentiment analysis, text summarization, and machine translation.\n* **Data and tools:** Hugging Face offers a variety of data and tools for NLP']

In [31]:
# Contrastive Search
from transformers import set_seed

set_seed(0)  # For reproducibility

prompt = "Today was an amazing day because"
inputs = gemma_tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = gemma.generate(**inputs, do_sample=True, num_beams=1, max_new_tokens=100)
gemma_tokenizer.batch_decode(outputs, skip_special_tokens=True)

['Today was an amazing day because I got to explore some new places with my friends. We headed to [Place Name] and had a blast exploring the historic architecture, learning about the local history, and enjoying fresh local food. \n\nHere are a few highlights:\n\n **The Old Town Hall:** This building is a beautiful example of Renaissance architecture and is now used as a community center and art gallery. We got to take a peek inside and admire the beautiful paintings and sculptures.\n\n**The Castle Hill:** We had a picnic']

In [32]:
# Beam Search Decoding

prompt = "It is astonishing how one can"

inputs = gemma_tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = gemma.generate(**inputs, num_beams=5, max_new_tokens=50)
gemma_tokenizer.batch_decode(outputs, skip_special_tokens=True)

["It is astonishing how one can get lost in the vastness of the universe, yet remain anchored in the present moment. This is the essence of mindfulness meditation, a practice that transcends mere relaxation and delves into the depths of one's being.\n\nHere's how mindfulness"]

In [34]:
# Beam Search Multinomial Sampling

set_seed(0)  # For reproducibility

prompt = "translate English to German: The house is wonderful."

inputs = gemma_tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = gemma.generate(**inputs, num_beams=5, do_sample=True)
gemma_tokenizer.decode(outputs[0], skip_special_tokens=True)



"translate English to German: The house is wonderful. It's spacious, well-maintained,"

In [39]:
# Diverse beam search decoding

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

prompt = (
    "The Permaculture Design Principles are a set of universal design principles "
    "that can be applied to any location, climate and culture, and they allow us to design "
    "the most efficient and sustainable human habitation and food production systems. "
    "Permaculture is a design system that encompasses a wide variety of disciplines, such "
    "as ecology, landscape design, environmental science and energy conservation, and the "
    "Permaculture design principles are drawn from these various disciplines. Each individual "
    "design principle itself embodies a complete conceptual framework based on sound "
    "scientific principles. When we bring all these separate  principles together, we can "
    "create a design system that both looks at whole systems, the parts that these systems "
    "consist of, and how those parts interact with each other to create a complex, dynamic, "
    "living system. Each design principle serves as a tool that allows us to integrate all "
    "the separate parts of a design, referred to as elements, into a functional, synergistic, "
    "whole system, where the elements harmoniously interact and work together in the most "
    "efficient way possible."
)

inputs = gemma_tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = gemma.generate(**inputs, num_beams=5, num_beam_groups=5, max_new_tokens=30, diversity_penalty=1.0)

print(gemma_tokenizer.decode(outputs[0], skip_special_tokens=True))

In [37]:
outputs

tensor([[     2,    651,   2399,  11963,   3735,   6307,  33797,    708,    476,
           1142,    576,  15660,   2480,  12555,    674,    798,    614,   7936,
            577,   1089,   5201, 235269,  11662,    578,   6981, 235269,    578,
            984,   2765,    917,    577,   2480,    573,   1546,  10878,    578,
          19496,   3515, 121480,    578,   2960,   4584,   5188, 235265,   2399,
          11963,   3735,    603,    476,   2480,   1812,    674,  95164,    476,
           5396,   8080,    576,  50416, 235269,   1582,    685,  55396, 235269,
          15487,   2480, 235269,  10868,   8042,    578,   4134,  19632, 235269,
            578,    573,   2399,  11963,   3735,   2480,  12555,    708,  11288,
            774,   1450,   4282,  50416, 235265,   9573,   3811,   2480,  12854,
           5344, 131474,    476,   3407,  36640,  15087,   3482,    611,   4835,
          12132,  12555, 235265,   3194,    783,   4347,    832,   1450,   8905,
            139, 165352,   3

In [None]:
# Speculative decoding
prompt = "Alice and Bob"

inputs = gemma_tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = gemma.generate(**inputs, assistant_model=assistant_model)
gemma_tokenizer.batch_decode(outputs, skip_special_tokens=True)

In [41]:
del gemma
torch.cuda.empty_cache()  # this time the model is not offloadin
# restarting the server to release the memory

# Llama observation



- Llama provides the output of 10 riddles

- Llama encoding and decoding is working same as roberta models

In [42]:
import torch

code_llama_path = "codellama/CodeLlama-7b-hf"

code_tokenizer = AutoTokenizer.from_pretrained(code_llama_path)

codellama = AutoModelForCausalLM.from_pretrained(
    code_llama_path,
    device_map="auto",
    quantization_config=quant_config,   # quant_config
    torch_dtype=torch.bfloat16
)  # takes around 11.5GB of VRAM
# with 4-bit quantization 5GB of VRAM is consumed

tokenizer_config.json:   0%|          | 0.00/749 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

NameError: name 'quant_config' is not defined

In [43]:
code_tokenizer.default_chat_template


No chat template is defined for this tokenizer - using the default template for the CodeLlamaTokenizerFast class. If the default is not appropriate for your model, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.



"{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% elif false == true and not '<<SYS>>' in messages[0]['content'] %}{% set loop_messages = messages %}{% set system_message = 'You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\\n\\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don\\'t know the answer to a question, please don\\'t share false information.' %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must

In [None]:
code_input = code_tokenizer.apply_chat_template(code_message,return_tensors='pt').to("cuda")

Using sep_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using mask_token, but it is not set yet.


In [None]:
code_output = codellama.generate(
    code_input,
    max_new_tokens=500,
    do_sample=True,
    pad_token_id=code_tokenizer.eos_token_id
)

In [None]:
code_output = codellama.generate(
    code_input,
    max_new_tokens=500,
    do_sample=True,
    pad_token_id=code_tokenizer.eos_token_id,
    temperature=0.2
)

In [None]:
code_output = codellama.generate(
    code_input,
    max_new_tokens=500,
    # do_sample=True,
    pad_token_id=code_tokenizer.eos_token_id,
    # temperature=0.2
)

In [None]:
code_output = codellama.generate(
    code_input,
    max_new_tokens=500,
    # do_sample=True,
    pad_token_id=code_tokenizer.eos_token_id,
    # temperature=0.2
    repetition_penalty=0.5
)

In [None]:
code_output = codellama.generate(
    code_input,
    max_new_tokens=500,
    do_sample=True,
    pad_token_id=code_tokenizer.eos_token_id,
    # temperature=0.2
    repetition_penalty=0.5,
    top_p=10,
    top_k=20
)

In [None]:
output = code_output[0][len(code_input[0]):]
code_tokenizer.decode(output, skip_special_tokens=True)

### Code Llama observation

- Model inference lead to OOM with 500 tokens request, when loaded with bfloat16

- Model inference worked with 500 Tokens, with quant_config done with 4-bit.

- In quantisation 1GB of Vram is consumed for inference


- **The model out was gibberish**

- Requested for the code related to dictionary

- Even after reviewing the generation configs, only gibberish was generated

# Open AI

# Groq

In [6]:
!pip install groq -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/75.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [11]:
from groq import Groq
client = Groq()
chat_completion = client.chat.completions.create(
    # required parameters
    messages = [
        {"role":"system",
         "content":"you are an helpful assistant"},
        {"role":"user",
         "content":"Explain importance of fast language model"}
    ],
    model = "mixtral-8x7b-32768",

    # optional parameter
    temperature = 0.6,
    top_p = 1,
    max_tokens=1024,
    stop = None,
    stream = False
)

print(chat_completion.choices[0].message.content)

A fast language model is important for several reasons:

1. Improved User Experience: A fast language model can quickly process user inputs and provide real-time responses, leading to a smoother and more responsive user experience.
2. Increased Productivity: Fast language models can handle large volumes of data and requests, making them ideal for use cases that require high-throughput processing, such as chatbots, virtual assistants, and language translation services.
3. Reduced Costs: Fast language models can process data more efficiently, reducing the computational resources required to perform tasks, leading to cost savings.
4. Enhanced Accuracy: Fast language models can quickly learn and adapt to new data, allowing them to provide more accurate and relevant responses over time.
5. Scalability: Fast language models can easily scale to handle increasing volumes of data and requests, making them well-suited for use in large-scale applications and systems.

Overall, a fast language mod

In [3]:
from google.colab import userdata
userdata.get('GROQ_API_KEY',)
import os
os.environ['GROQ_API_KEY']=userdata.get('GROQ_API_KEY',)

In [4]:
# print(os.environ.get('GROQ_API_KEY', userdata.get('GROQ_API_KEY',)))
print(os.environ.get('GROQ_API_KEY'))


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


In [None]:
# create a class, object and save the object to the pkl file