# Introduction to LangChain and Hugging Face  

# Packages and Settings


*Transformers* by HuggingFace offers a wide range of pre-trained models like BERT, GPT, and T5 for NLP tasks.

*Einops* simplifies tensor manipulation with a clear syntax, making complex operations more straightforward.

*Accelerate*, also by HuggingFace, helps optimize model training on various hardware accelerators such as GPUs and TPUs.

*BitsAndBytes* enables efficient quantization of large models, reducing memory consumption in PyTorch.

In [1]:
# !pip install -q transformers einops accelerate bitsandbytes

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig

ModuleNotFoundError: No module named 'transformers'

In [3]:
import torch
import getpass
import os

device = "cuda:0" if torch.cuda.is_available() else "cpu"

In [4]:
device

'cpu'

In [5]:
torch.random.manual_seed(42)

<torch._C.Generator at 0x7a227c1a3d90>

# Token

In [6]:
os.environ["HF_TOKEN"] = getpass.getpass()
# in /home/lucas/Dropbox/Docs

··········


# Model
Model from HuggingFace  

Starting by showcasing Phi 3 (microsoft/Phi-3-mini-4k-instruct), a smaller model that has proven to be very interesting and comparable to much larger ones.

https://huggingface.co/microsoft/Phi-3-mini-4k-instruct

open source, accessible, and performs well in Portuguese, although it still works better in English.

In [7]:
id_model = "microsoft/Phi-3-mini-4k-instruct"

In [8]:
model = AutoModelForCausalLM.from_pretrained(
    id_model,
    # device_map = "cuda",
    torch_dtype = "auto",
    trust_remote_code = True,
    attn_implementation="eager"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

*device_map="cuda":* Specifies that the model should be loaded onto a CUDA-enabled GPU. GPUs significantly improve inference and training performance by leveraging parallel processing.

*torch_dtype="auto":* Automatically sets the appropriate data type for the model's tensors. This ensures the model uses the best data type for performance and memory efficiency, typically float32 or float16.

*trust_remote_code=True:* Allows the loading of custom code from the model repository on HuggingFace. This is necessary for certain models that require specific configurations or implementations not included in the standard library.

*attn_implementation="eager":* Specifies the implementation method for the attention mechanism. The "eager" setting is a particular implementation that may offer better performance for some models by processing the attention mechanism in a specific way.

## Tokenizer
Ee also need to load the tokenizer associated with the model. The tokenizer is essential for preparing text data into a format that the model can understand.

A tokenizer converts raw text into tokens, which are numerical representations that the model can process. It also converts the model's output tokens back into human-readable text.
Tokenizers handle tasks such as splitting text into words or subwords, adding special tokens, and managing vocabulary mapping.
[more details in the slides]

The tokenizer is a critical component in the NLP pipeline, bridging the gap between raw text and model-ready tokens.

To implement this, we will use the AutoTokenizer.from_pretrained() function, specifying the same tokenizer as the model. This ensures consistency in text processing during both training and inference.

In [9]:
tokenizer = AutoTokenizer.from_pretrained(id_model)

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

## Creating the Pipeline

Now we will create a pipeline for text generation using the model and tokenizer we loaded earlier. HuggingFace's pipeline function simplifies the process of executing various natural language processing tasks by providing a high-level interface.

A pipeline is an abstraction that simplifies the use of pre-trained models for a variety of NLP tasks. It provides a unified API for different tasks, such as text generation, text classification, translation, and more.

> [More details in the slides]

### Parameters:

- **`"text-generation"`**: Specifies the task the pipeline is set up to perform. In this case, we are configuring a pipeline for text generation. The pipeline will use the model to generate text based on a given prompt.
  
- **`model=model`**: Specifies the pre-trained model the pipeline will use. Here, we are passing the previously loaded model. This model is responsible for generating text based on the input tokens.
  
- **`tokenizer=tokenizer`**: Specifies the tokenizer the pipeline will use. We pass the previously loaded tokenizer to ensure that the input text is correctly tokenized and the output tokens are accurately decoded.

In [10]:
pipe = pipeline("text-generation", model = model, tokenizer = tokenizer)

Device set to use cpu


## Parameters for Text Generation

To customize the behavior of our text generation pipeline, we can pass a dictionary of arguments to control various aspects of the generation process.

### `max_new_tokens`
This parameter specifies the maximum number of new tokens to be generated in response to the input prompt. It controls the length of the generated text.

- **Example**: Setting `max_new_tokens` to 500 means the model will generate up to 500 tokens beyond the input prompt.

### `return_full_text`
Determines whether to return the full text, including the input prompt, or only the newly generated tokens.

- **Example**: Setting `return_full_text` to `False` means only the newly generated tokens will be returned, excluding the original input prompt. If set to `True`, the returned text will include both the input prompt and the generated continuation.

### `temperature`
Controls the randomness of the text generation process. Lower values make the model's output more deterministic and focused, while higher values increase randomness and creativity.

- **Example**: A `temperature` of `0.1` makes the model's predictions more reliable and less varied, leading to more predictable outputs. A higher `temperature` would result in more diverse and varied text.

### `do_sample`
This parameter enables or disables sampling during text generation. When set to `True`, the model samples tokens based on their probabilities, adding an element of randomness to the output. When set to `False`, the model always selects the token with the highest probability (greedy decoding).

- **Example**: Setting `do_sample` to `True` allows for more diverse and creative text generation. If set to `False`, the output will be more deterministic but potentially less engaging.



In [11]:
generation_args = {
    "max_new_tokens": 100,  # ou 50 para testar com um número menor de tokens
    "return_full_text": False,
    "temperature": 0.1,
    "do_sample": True,
}

## Generating the Output (portuguese)

The following line of code passes the input message and generation arguments to the text generation pipeline:

```python
output = pipe(messages, **generation_args)
```
**generation_args: This unpacks the generation_args dictionary and passes its contents as keyword arguments to the pipeline, customizing the text generation process. This allows fine-tuning of the generation behavior by adjusting parameters such as max_new_tokens, temperature, and more.

In [12]:
prompt = "Explique o que é computação quântica"
#prompt = "Quanto é 7 x 6 - 42?"

output = pipe(prompt, **generation_args)

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48


In [13]:
output

[{'generated_text': ' e como ela difere da computação clássica.\n\n\n### Solution:\n\nA computação quântica é um campo da ciência da computação que explora os princípios da mecânica quântica para processar informações. Ao contrário da computação clássica, que usa bits binários (0 ou 1) para representar dados, a computação quântica utiliza qubits, que pode'}]

In [14]:
print(output[0]['generated_text'])

 e como ela difere da computação clássica.


### Solution:

A computação quântica é um campo da ciência da computação que explora os princípios da mecânica quântica para processar informações. Ao contrário da computação clássica, que usa bits binários (0 ou 1) para representar dados, a computação quântica utiliza qubits, que pode


In [15]:
prompt = "Quem foi a primeira pessoa no espaço?"
output = pipe(prompt, **generation_args)

KeyboardInterrupt: 

In [None]:
output

## Templates and Prompt Engineering
Prompt templates help translate the user's input and parameters into instructions for a language model. This can be used to guide the model's response, helping it understand the context and generate relevant and more coherent output.

> **Solving the problem of text continuing after the response**

To discover the appropriate template, always check the model description, for example: [https://huggingface.co/microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct).

For Phi 3, the authors recommend the following template.

Note: Later, we will see a way to retrieve this template manually without having to copy and paste it here.

These tags formed by `<|##name##|>` are what we call **special tokens** and are used to delimit the beginning and end of text, telling the model how we want the message to be interpreted.

The special tokens used to interact with Phi 3 are:

* `<|system|>`, `<|user|>`, and `<|assistant|>`: correspond to the roles of the messages. The roles used here are: system, user, and assistant.

* `<|end|>`: This is equivalent to the EOS (End of String) token, used to mark the end of the text/string.

We will use `.format` to concatenate the prompt into this template so we don’t have to manually rewrite it every time.


In [16]:
template = """<|system|>
You are a helpful assistant.<|end|>
<|user|>
"{}"<|end|>
<|assistant|>""".format(prompt)

In [18]:
output = pipe(template, **generation_args)
print(output[0]['generated_text'])

 A primeira pessoa a ir ao espaço foi Yuri Gagarin, um cosmonauta soviético. Ele completou uma órbita ao redor da Terra em 12 de abril de 1961, a bordo da nave espacial Vostok 1. Sua missão marcou um momento histórico na exploração espacial e foi um grande passo para a humanidade na conquista do espaço.


In [None]:
prompt = "Você entende português?"

template = """<|system|>
You are a helpful assistant.<|end|>
<|user|>
"{}"<|end|>
<|assistant|>""".format(prompt)

output = pipe(template, **generation_args)
print(output[0]['generated_text'])

In [None]:
prompt = "O que é IA?"  # @param {type:"string"}

template = """<|system|>
You are a helpful assistant.<|end|>
<|user|>
"{}"<|end|>
<|assistant|>""".format(prompt)

output = pipe(template, **generation_args)
print(output[0]['generated_text'])

### Exploring Prompt Engineering

In addition to slightly modifying the system prompt to make the result more suitable:

- For example, we can add "Answer in 1 sentence" after our question ("What is AI?").
- Another example: "Answer in the form of a poem."

In [None]:
#prompt = "O que é IA? "  # @param {type:"string"}
#prompt = "O que é IA? Responda em 1 frase" # @param {type:"string"}
prompt = "O que é IA? Responda em forma de poema" # @param {type:"string"}

sys_prompt = "Você é um assistente virtual prestativo. Responda as perguntas em português."

template = """<|system|>
{}<|end|>
<|user|>
"{}"<|end|>
<|assistant|>""".format(sys_prompt, prompt)

print(template)

output = pipe(template, **generation_args)
print(output[0]['generated_text'])

In [19]:
prompt = "Gere um código em python que escreva a sequência de fibonnaci"

sys_prompt = "Você é um programador experiente. Retorne o código requisitado e forneça explicações breves se achar conveniente"

template = """<|system|>
{}<|end|>
<|user|>
"{}"<|end|>
<|assistant|>""".format(sys_prompt, prompt)

output = pipe(template, **generation_args)
print(output[0]['generated_text'])


SyntaxError: invalid syntax (<ipython-input-19-81eee6fc6769>, line 3)

In [None]:
def fibonacci(n):

    a, b = 0, 1

    sequence = []

    while len(sequence) < n:

        sequence.append(a)

        a, b = b, a + b

    return sequence


# Exemplo de uso:

n = 10  # Quantidade de números da sequência de Fibonacci a serem gerados

print(fibonacci(n))

### Improving Results

**Exploring Prompt Changes**
- It may fail depending on the type of request. Various ways to improve the output will be explored.
- For now, remember to first check if your prompt could be more specific. If even after improving the prompt you are struggling to achieve the expected result (and after experimenting with other parameters), the model may not be suitable for this task.
  - > Bonus tip for code generation: A suggestion for a convenient prompt when using an LLM as your co-pilot:  
    > "Refactor using concepts like SOLID, Clean Code, DRY, KISS, and if possible, apply one or more appropriate design patterns aiming for scalability and performance, creating an organized folder structure and separating files accordingly."  
    > (Of course, feel free to modify this as needed.)

- Note: Sometimes, keeping the prompt simple and not overly elaborate works better. Adding too much or including unrelated references can "confuse" the model. Therefore, it's often useful to add or remove terms incrementally when experimenting to achieve better results.

**Exploring Other Models**
- In this case, to achieve more accurate results, you could look for larger and more modern models with more parameters (keeping in mind the trade-off between efficiency and response quality) or models specialized in the desired task, such as code generation or conversations/chat.
  - For example, for code generation, you could use the model [deepseek-coder 6.7B](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct) (or search for others focused on this area).


### Where to Find Prompts

Creating your own prompt can be ideal if you're aiming for very specific cases.  
However, if you're short on time to experiment (or unsure of the best approach), a good tip is to search for prompts online.  

There are many websites and repositories where the community shares ready-made prompts.  

One example is the LangSmith hub: [https://smith.langchain.com/hub](https://smith.langchain.com/hub).  
It is part of the LangChain ecosystem. This will be very convenient later, as we’ll see how to fetch prompts hosted there using just a single function.


## Message Format

A growing use case for LLMs is chat. In a chat context, instead of continuing a single text sequence (as with a standard language model), the model continues a conversation consisting of one or more messages. Each message includes a role, such as "user" or "assistant," along with the message text.

The prompt can therefore be structured as shown below. We’ll explore this in more detail when using LangChain, as it provides additional resources to enhance this mode.

### `msg`: Input Messages

This list contains the input message to which we want the model to respond. Each message is a dictionary with the following keys:

- **`role`**:  
  - `"user"` indicates that the message comes from the user.  
  - Other roles may include `"system"` or `"assistant"` if you're simulating a multi-turn conversation. Different models may use different role names. For Phi 3, these roles are expected.
  
- **`content`**:  
  - This contains the actual query or prompt you want the model to respond to.

We’ll delve deeper into this mode when we start working with LangChain.
