# Aula3 - Resolvendo Tarefas com LLM (Large Language Model) de Maneira Zero e Few-shot

[Unicamp - IA368DD: Deep Learning aplicado a sistemas de busca.](https://www.cpg.feec.unicamp.br/cpg/lista/caderno_horario_show.php?id=1779)

Autor: Marcus Vinícius Borela de Castro

[Repositório no github](https://github.com/marcusborela/deep_learning_em_buscas_unicamp)

[Link para chat de apoio com WebChatGPT](https://github.com/marcusborela/deep_learning_em_buscas_unicamp/blob/main/chat/aula3_resolvendo_tarefas_com_llm_de_maneira_zero_e_few_shot.md)

[![Open In Colab latest github version](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/marcusborela/deep_learning_em_buscas_unicamp/blob/main/code/aula_2_classificacao_de_texto_e_reranqueador.ipynb) [Open In Colab latest github version]

# Enunciado exercício
O aluno irá escolher uma tarefa para resolver de maneira zero ou few-shot. Sugestões:
- Classificação de textos (ex: análise de sentimos (IMDB))
- Predizer se uma passagem/parágrafo é relevante para uma pergunta/query
- Se uma resposta predita por um sistema de QA ou sumarizador é semanticamente igual à resposta ground-truth


É importante ter uma função de avaliação da qualidade das respostas do modelo few-shot. Por exemplo, acurácia.


É possível criar um pequeno dataset de teste manualmente (ex: com 10 à 100 exemplos)


- Usar a API do LLAMA fornecida por nós (licença exclusiva para pesquisa). [Colab demo da API do LLAMA](https://colab.research.google.com/drive/1zZ-ch29LTicNPA62t2MaOwMROywnqUxf?usp=sharing) (obrigado, Thales Rogério)
- Opcionalmente, usar a API do code-davinci-002, que é de graça e trás resultados muito bons.
CUIDADO: NÃO USAR O TEXT-DAVINCI-002/003, que é pago

- Opcionalmente, usar a API do ChatGPT (gpt-3.5-turbo) que é barata: ~1 centavo de real por 1000 tokens (uma página)
- Opcionalmente, usar o Alpaca: https://alpaca-ai.ngrok.io/


Dicas:
- Teste com zero-shot E few-shot.
- No few-shot, faça testes com e sem instruções no cabeçalho (explicação da tarefa, ex: "Traduza de Ingles para Portugues"). Pode ser que sem a instrução o modelo até funcione melhor.
- Siga sempre um padrão ao criar os exemplos few-shot. Aqui tem uma pagina com dicas para prompt engineering: https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api


# Dicas

[Exemplos de prompt](https://platform.openai.com/examples) 


# Organizando o ambiente

In [20]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Not connected to a GPU


In [21]:
from psutil import virtual_memory


In [22]:
def mostra_memoria(lista_mem=['cpu']):
  """
  Esta função exibe informações de memória da CPU e/ou GPU, conforme parâmetros fornecidos.

  Parâmetros:
  -----------
  lista_mem : list, opcional
      Lista com strings 'cpu' e/ou 'gpu'. 
      'cpu' - exibe informações de memória da CPU.
      'gpu' - exibe informações de memória da GPU (se disponível).
      O valor padrão é ['cpu'].

  Saída:
  -------
  A função não retorna nada, apenas exibe as informações na tela.

  Exemplo de uso:
  ---------------
  Para exibir informações de memória da CPU:
      mostra_memoria(['cpu'])

  Para exibir informações de memória da CPU e GPU:
      mostra_memoria(['cpu', 'gpu'])
  
  Autor: Marcus Vinícius Borela de Castro

  """  
  if 'cpu' in lista_mem:
    vm = virtual_memory()
    ram={}
    ram['total']=round(vm.total / 1e9,2)
    ram['available']=round(virtual_memory().available / 1e9,2)
    # ram['percent']=round(virtual_memory().percent / 1e9,2)
    ram['used']=round(virtual_memory().used / 1e9,2)
    ram['free']=round(virtual_memory().free / 1e9,2)
    ram['active']=round(virtual_memory().active / 1e9,2)
    ram['inactive']=round(virtual_memory().inactive / 1e9,2)
    ram['buffers']=round(virtual_memory().buffers / 1e9,2)
    ram['cached']=round(virtual_memory().cached/1e9 ,2)
    print(f"Your runtime RAM in gb: \n total {ram['total']}\n available {ram['available']}\n used {ram['used']}\n free {ram['free']}\n cached {ram['cached']}\n buffers {ram['buffers']}")
    print('/nGPU')
    gpu_info = !nvidia-smi
  if 'gpu' in lista_mem:
    gpu_info = '\n'.join(gpu_info)
    if gpu_info.find('failed') >= 0:
      print('Not connected to a GPU')
    else:
      print(gpu_info)


In [23]:
mostra_memoria(['cpu'])

Your runtime RAM in gb: 
 total 13.62
 available 12.44
 used 0.88
 free 9.32
 cached 3.06
 buffers 0.37
/nGPU


### Vinculando pasta do google drive para salvar dados

In [24]:
import os

In [25]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Fixando as seeds

In [71]:
import random
import numpy as np
import torch

In [69]:
def inicializa_seed(num_semente:int=123):
  """
  Inicializa as sementes para garantir a reprodutibilidade dos resultados do modelo.
  Essa é uma prática recomendada, já que a geração de números aleatórios pode influenciar os resultados do modelo.
  Além disso, a função também configura as sementes da GPU para garantir a reprodutibilidade quando se utiliza aceleração por GPU. 
  
  Args:
      num_semente (int): número da semente a ser utilizada para inicializar as sementes das bibliotecas.
  
  References:
      http://nlp.seas.harvard.edu/2018/04/03/attention.html
      https://github.com/CyberZHG/torch-multi-head-attention/blob/master/torch_multi_head_attention/multi_head_attention.py#L15
  """
  # Define as sementes das bibliotecas random, numpy e pytorch
  random.seed(num_semente)
  np.random.seed(num_semente)
  torch.manual_seed(num_semente)
  
  # Define as sementes da GPU
  torch.backends.cudnn.deterministic = True
  torch.backends.cudnn.benchmark = False

  #torch.cuda.manual_seed(num_semente)
  #Cuda algorithms
  #torch.backends.cudnn.deterministic = True


In [72]:
num_semente=123
inicializa_seed(num_semente)

## Preparando para debug e display

In [68]:
import pandas as pd
import os

In [None]:
!pip install transformers -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m80.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m65.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import transformers

Dicas em https://zohaib.me/debugging-in-google-collab-notebook/

In [None]:
!pip install -Uqq ipdb
import ipdb
# %pdb off # desativa debug em exceção
# %pdb on  # ativa debug em exceção
# ipdb.set_trace(context=8)  para execução nesse ponto

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m793.3/793.3 KB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m385.8/385.8 KB[0m [31m44.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m77.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires ipython~=7.9.0, but you have ipython 8.11.0 which is incompatible.[0m[31m
[0m

In [73]:
def config_display():
  """
  Esta função configura as opções de display do Pandas.
  """

  # Configurando formato saída Pandas
  # define o número máximo de colunas que serão exibidas
  pd.options.display.max_columns = None

  # define a largura máxima de uma linha
  pd.options.display.width = 1000

  # define o número máximo de linhas que serão exibidas
  pd.options.display.max_rows = 100

  # define o número máximo de caracteres por coluna
  pd.options.display.max_colwidth = 50

  # se deve exibir o número de linhas e colunas de um DataFrame.
  pd.options.display.show_dimensions = True

  # número de dígitos após a vírgula decimal a serem exibidos para floats.
  pd.options.display.precision = 7


In [None]:
def config_debug():
  """
  Esta função configura as opções de debug do PyTorch e dos pacotes
  transformers e datasets.
  """

  # Define opções de impressão de tensores para o modo científico
  torch.set_printoptions(sci_mode=True) 
  """
    Significa que valores muito grandes ou muito pequenos são mostrados em notação científica.
    Por exemplo, em vez de imprimir o número 0.0000012345 como 0.0000012345, 
    ele seria impresso como 1.2345e-06. Isso é útil em situações em que os valores dos tensores 
    envolvidos nas operações são muito grandes ou pequenos, e a notação científica permite 
    uma melhor compreensão dos números envolvidos.  
  """

  # Habilita detecção de anomalias no autograd do PyTorch
  torch.autograd.set_detect_anomaly(True)
  """
    Permite identificar operações que podem causar problemas de estabilidade numérica, 
    como gradientes explodindo ou desaparecendo. Quando essa opção é ativada, 
    o PyTorch verifica se há operações que geram valores NaN ou infinitos nos tensores 
    envolvidos no cálculo do gradiente. Se for detectado um valor anômalo, o PyTorch 
    interrompe a execução e gera uma exceção, permitindo que o erro seja corrigido 
    antes que se torne um problema maior.

    É importante notar que a detecção de anomalias pode ter um impacto significativo 
    no desempenho, especialmente em modelos grandes e complexos. Por esse motivo,
    ela deve ser usada com cautela e apenas para depuração.
  """

  # Configura variável de ambiente para habilitar a execução síncrona (bloqueante) das chamadas da API do CUDA.
  os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
  """
    o Python aguarda o término da execução de uma chamada da API do CUDA antes de executar a próxima chamada. 
    Isso é útil para depurar erros no código que envolve operações na GPU, pois permite que o erro seja capturado 
    no momento em que ocorre, e não depois de uma sequência de operações que pode tornar a origem do erro mais difícil de determinar.
    No entanto, é importante lembrar que esse modo de execução é significativamente mais lento do que a execução assíncrona, 
    que é o comportamento padrão do CUDA. Por isso, é recomendado utilizar esse comando apenas em situações de depuração 
    e removê-lo após a solução do problema.
  """

  # Define o nível de verbosity do pacote transformers para info
  transformers.utils.logging.set_verbosity_info() 
  
  
  """
    Define o nível de detalhamento das mensagens de log geradas pela biblioteca Hugging Face Transformers 
    para o nível info. Isso significa que a biblioteca irá imprimir mensagens de log informativas sobre
    o andamento da execução, tais como tempo de execução, tamanho de batches, etc.

    Essas informações podem ser úteis para entender o que está acontecendo durante a execução da tarefa 
    e auxiliar no processo de debug. É importante notar que, em alguns casos, a quantidade de informações
    geradas pode ser muito grande, o que pode afetar o desempenho do sistema e dificultar a visualização
    das informações relevantes. Por isso, é importante ajustar o nível de detalhamento de acordo com a 
    necessidade de cada tarefa.
  
    Caso queira reduzir a quantidade de mensagens, comentar a linha acima e 
      descomentar as duas linhas abaixo, para definir o nível de verbosity como error ou warning
  
    transformers.utils.logging.set_verbosity_error()
    transformers.utils.logging.set_verbosity_warning()
  """


  # Define o modo verbose do xmode, que é utilizado no debug
  %xmode Verbose 

  """
    Comando usado no Jupyter Notebook para controlar o modo de exibição das informações de exceções.
    O modo verbose é um modo detalhado que exibe informações adicionais ao imprimir as exceções.
    Ele inclui as informações de pilha de chamadas completa e valores de variáveis locais e globais 
    no momento da exceção. Isso pode ser útil para depurar e encontrar a causa de exceções em seu código.
    Ao usar %xmode Verbose, as informações de exceção serão impressas com mais detalhes e informações adicionais serão incluídas.

    Caso queira desabilitar o modo verbose e utilizar o modo plain, 
    comentar a linha acima e descomentar a linha abaixo:
    %xmode Plain
  """

  """
    Dica:
    1.  pdb (Python Debugger)
      Quando ocorre uma exceção em uma parte do código, o programa para a execução e exibe uma mensagem de erro 
      com informações sobre a exceção, como a linha do código em que ocorreu o erro e o tipo da exceção.

      Se você estiver depurando o código e quiser examinar o estado das variáveis ​​e executar outras operações 
      no momento em que a exceção ocorreu, pode usar o pdb (Python Debugger). Para isso, é preciso colocar o comando %debug 
      logo após ocorrer a exceção. Isso fará com que o programa pare na linha em que ocorreu a exceção e abra o pdb,
      permitindo que você explore o estado das variáveis, examine a pilha de chamadas e execute outras operações para depurar o código.


    2. ipdb
      O ipdb é um depurador interativo para o Python que oferece recursos mais avançados do que o pdb,
      incluindo a capacidade de navegar pelo código fonte enquanto depura.
      
      Você pode começar a depurar seu código inserindo o comando ipdb.set_trace() em qualquer lugar do 
      seu código onde deseja pausar a execução e começar a depurar. Quando a execução chegar nessa linha, 
      o depurador entrará em ação, permitindo que você examine o estado atual do seu programa e execute 
      comandos para investigar o comportamento.

      Durante a depuração, você pode usar comandos:
        next (para executar a próxima linha de código), 
        step (para entrar em uma função chamada na próxima linha de código) 
        continue (para continuar a execução normalmente até o próximo ponto de interrupção).

      Ao contrário do pdb, o ipdb é um depurador interativo que permite navegar pelo código fonte em que
      está trabalhando enquanto depura, permitindo que você inspecione variáveis, defina pontos de interrupção
      adicionais e até mesmo execute expressões Python no contexto do seu programa.
  """


In [74]:
config_display()

In [None]:
# config_debug()

Exception reporting mode: Verbose


# Experimentando chamadas aos LLM

In [44]:
dict_modelos = {'gpt-3.5-turbo': {'max_tokens': 4096},  # $0.002 / 1K tokens
                'code-davinci-002': {'max_tokens': 8001}} # gratuito

## Chat GPT (Modelo gpt-3.5-turbo)

Para uso do gpt-3.5-turbo, usamos como referência o caderno da [openai: How_to_format_inputs_to_ChatGPT_models.ipynb](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_format_inputs_to_ChatGPT_models.ipynb)

### How to format inputs to ChatGPT models

ChatGPT is powered by `gpt-3.5-turbo`, OpenAI's most advanced model.

You can build your own applications with `gpt-3.5-turbo` using the OpenAI API.

Chat models take a series of messages as input, and return an AI-written message as output.

This guide illustrates the chat format with a few example API calls.

### Import the openai library

In [1]:
# if needed, install and/or upgrade to the latest version of the OpenAI Python library
%pip install --upgrade openai


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.27.2-py3-none-any.whl (70 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.1/70.1 KB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp
  Downloading aiohttp-3.8.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
Collecting frozenlist>=1.1.1
  Downloading frozenlist-1.3.3-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (158 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.8/158.8 KB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multidict<7.0,>=4.5
  Downloading multidict-6.0.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (114 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

In [2]:
# import the OpenAI Python library for calling the OpenAI API
import openai


### An example chat API call

A chat API call has two required inputs:
- `model`: the name of the model you want to use (e.g., `gpt-3.5-turbo`)
- `messages`: a list of message objects, where each object has at least two fields:
    - `role`: the role of the messenger (either `system`, `user`, or `assistant`)
    - `content`: the content of the message (e.g., `Write me a beautiful poem`)

Typically, a conversation will start with a system message, followed by alternating user and assistant messages, but you are not required to follow this format.

Let's look at an example chat API calls to see how the chat format works in practice.

In [6]:
import getpass

In [7]:
openai.api_key = getpass.getpass("Entre a OPENAI_API_KEY")

Entre a OPENAI_API_KEY··········


In [3]:
MODEL = "gpt-3.5-turbo"

In [8]:
response = openai.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Knock knock."},
        {"role": "assistant", "content": "Who's there?"},
        {"role": "user", "content": "Marcus."},
    ],
    temperature=0,
)

In [9]:
response


<OpenAIObject chat.completion id=chatcmpl-6vrjUfjDVqeIcZV49F7tHbcgFMiEx at 0x7ff7b51548b0> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Marcus who?",
        "role": "assistant"
      }
    }
  ],
  "created": 1679249264,
  "id": "chatcmpl-6vrjUfjDVqeIcZV49F7tHbcgFMiEx",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 4,
    "prompt_tokens": 38,
    "total_tokens": 42
  }
}

As you can see, the response object has a few fields:
- `id`: the ID of the request
- `object`: the type of object returned (e.g., `chat.completion`)
- `created`: the timestamp of the request
- `model`: the full name of the model used to generate the response
- `usage`: the number of tokens used to generate the replies, counting prompt, completion, and total
- `choices`: a list of completion objects (only one, unless you set `n` greater than 1)
    - `message`: the message object generated by the model, with `role` and `content`
    - `finish_reason`: the reason the model stopped generating text (either `stop`, or `length` if `max_tokens` limit was reached)
    - `index`: the index of the completion in the list of choices

Extract just the reply with:

In [10]:
response['choices'][0]['message']['content']

'Marcus who?'

Even non-conversation-based tasks can fit into the chat format, by placing the instruction in the first user message.

For example, to ask the model to explain asynchronous programming in the style of the pirate Blackbeard, we can structure conversation as follows:

In [None]:
# example with a system message
response = openai.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain asynchronous programming in the style of the pirate Blackbeard."},
    ],
    temperature=0,
)

print(response['choices'][0]['message']['content'])


Ahoy matey! Asynchronous programming be like havin' a crew o' pirates workin' on different tasks at the same time. Ye see, instead o' waitin' for one task to be completed before startin' the next, we can have multiple tasks runnin' at once. It be like havin' me crew hoistin' the sails while others be swabbin' the deck and loadin' the cannons. Each task be workin' independently, but they all be contributin' to the overall success o' the ship. And just like how me crew communicates with each other to make sure everything be runnin' smoothly, asynchronous programming uses callbacks and promises to coordinate the different tasks and make sure they all be finished in the right order. Arrr, it be a powerful tool for any programmer lookin' to optimize their code and make it run faster.


In [None]:
# example without a system message
response = openai.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "user", "content": "Explain asynchronous programming in the style of the pirate Blackbeard."},
    ],
    temperature=0,
)

print(response['choices'][0]['message']['content'])




Ahoy mateys! Let me tell ye about asynchronous programming, arrr! 

Ye see, in the world of programming, sometimes we need to wait for things to happen before we can move on to the next task. But with asynchronous programming, we can keep working on other tasks while we wait for those things to happen. 

It's like when we're sailing the high seas and we need to wait for the wind to change direction. We don't just sit there twiddling our thumbs, do we? No, we keep busy with other tasks like repairing the ship or checking the maps. 

In programming, we use something called callbacks or promises to keep track of those things we're waiting for. And while we wait for those callbacks or promises to be fulfilled, we can keep working on other parts of our code. 

So, me hearties, asynchronous programming is like being a pirate on the high seas, always ready to tackle the next task while we wait for the winds to change. Arrr!


### Tips for instructing gpt-3.5-turbo-0301

Best practices for instructing models may change from model version to model version. The advice that follows applies to `gpt-3.5-turbo-0301` and may not apply to future models.

#### System messages

The system message can be used to prime the assistant with different personalities or behaviors.

However, the model does not generally pay as much attention to the system message, and therefore we recommend placing important instructions in the user message instead.

In [None]:
# An example of a system message that primes the assistant to explain concepts in great depth
response = openai.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a friendly and helpful teaching assistant. You explain concepts in great depth using simple terms, and you give examples to help people learn. At the end of each explanation, you ask a question to check for understanding"},
        {"role": "user", "content": "Can you explain how fractions work?"},
    ],
    temperature=0,
)

print(response["choices"][0]["message"]["content"])


Sure! Fractions are a way of representing a part of a whole. The top number of a fraction is called the numerator, and it represents how many parts of the whole we are talking about. The bottom number is called the denominator, and it represents how many equal parts the whole is divided into.

For example, if we have a pizza that is divided into 8 equal slices, and we take 3 slices, we can represent that as the fraction 3/8. The numerator is 3 because we took 3 slices, and the denominator is 8 because the pizza was divided into 8 slices.

To add or subtract fractions, we need to have a common denominator. This means that the denominators of the fractions need to be the same. To do this, we can find the least common multiple (LCM) of the denominators and then convert each fraction to an equivalent fraction with the LCM as the denominator.

To multiply fractions, we simply multiply the numerators together and the denominators together. To divide fractions, we multiply the first fraction 

In [None]:
# An example of a system message that primes the assistant to give brief, to-the-point answers
response = openai.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a laconic assistant. You reply with brief, to-the-point answers with no elaboration."},
        {"role": "user", "content": "Can you explain how fractions work?"},
    ],
    temperature=0,
)

print(response["choices"][0]["message"]["content"])


Fractions represent a part of a whole. They consist of a numerator (top number) and a denominator (bottom number) separated by a line. The numerator represents how many parts of the whole are being considered, while the denominator represents the total number of equal parts that make up the whole.


#### Few-shot prompting

In some cases, it's easier to show the model what you want rather than tell the model what you want.

One way to show the model what you want is with faked example messages.

For example:

In [None]:
# An example of a faked few-shot conversation to prime the model into translating business jargon to simpler speech
response = openai.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful, pattern-following assistant."},
        {"role": "user", "content": "Help me translate the following corporate jargon into plain English."},
        {"role": "assistant", "content": "Sure, I'd be happy to!"},
        {"role": "user", "content": "New synergies will help drive top-line growth."},
        {"role": "assistant", "content": "Things working well together will increase revenue."},
        {"role": "user", "content": "Let's circle back when we have more bandwidth to touch base on opportunities for increased leverage."},
        {"role": "assistant", "content": "Let's talk later when we're less busy about how to do better."},
        {"role": "user", "content": "This late pivot means we don't have time to boil the ocean for the client deliverable."},
    ],
    temperature=0,
)

print(response["choices"][0]["message"]["content"])


We don't have enough time to complete everything perfectly for the client.


To help clarify that the example messages are not part of a real conversation, and shouldn't be referred back to by the model, you can instead set the `name` field of `system` messages to `example_user` and `example_assistant`.

Transforming the few-shot example above, we could write:

In [None]:
# The business jargon translation example, but with example names for the example messages
response = openai.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful, pattern-following assistant that translates corporate jargon into plain English."},
        {"role": "system", "name":"example_user", "content": "New synergies will help drive top-line growth."},
        {"role": "system", "name": "example_assistant", "content": "Things working well together will increase revenue."},
        {"role": "system", "name":"example_user", "content": "Let's circle back when we have more bandwidth to touch base on opportunities for increased leverage."},
        {"role": "system", "name": "example_assistant", "content": "Let's talk later when we're less busy about how to do better."},
        {"role": "user", "content": "This late pivot means we don't have time to boil the ocean for the client deliverable."},
    ],
    temperature=0,
)

print(response["choices"][0]["message"]["content"])


This sudden change in plans means we don't have enough time to do everything for the client's project.


Not every attempt at engineering conversations will succeed at first.

If your first attempts fail, don't be afraid to experiment with different ways of priming or conditioning the model.

As an example, one developer discovered an increase in accuracy when they inserted a user message that said "Great job so far, these have been perfect" to help condition the model into providing higher quality responses.

For more ideas on how to lift the reliability of the models, consider reading our guide on [techniques to increase reliability](../techniques_to_improve_reliability.md). It was written for non-chat models, but many of its principles still apply.

## Counting tokens OpenAI Models

Mais detalhes em [OpenAI: How_to_count_tokens_with_tiktoken.ipynb](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb)

When you submit your request, the API transforms the messages into a sequence of tokens.

The number of tokens used affects:
- the cost of the request
- the time it takes to generate the response
- when the reply gets cut off from hitting the maximum token limit (4096 for `gpt-3.5-turbo`)

As of Mar 01, 2023, you can use the following function to count the number of tokens that a list of messages will use.

In [12]:
%pip install --upgrade tiktoken

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tiktoken
  Downloading tiktoken-0.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.3.2


In [13]:
import tiktoken

In [14]:
def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0301"):
    """Returns the number of tokens used by a list of messages."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        encoding = tiktoken.get_encoding("cl100k_base")
    if model == "gpt-3.5-turbo-0301":  # note: future models may deviate from this
        num_tokens = 0
        for message in messages:
            num_tokens += 4  # every message follows <im_start>{role/name}\n{content}<im_end>\n
            for key, value in message.items():
                num_tokens += len(encoding.encode(value))
                if key == "name":  # if there's a name, the role is omitted
                    num_tokens += -1  # role is always required and always 1 token
        num_tokens += 2  # every reply is primed with <im_start>assistant
        return num_tokens
    else:
        raise NotImplementedError(f"""num_tokens_from_messages() is not presently implemented for model {model}.
See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.""")


In [15]:
messages = [
    {"role": "system", "content": "You are a helpful, pattern-following assistant that translates corporate jargon into plain English."},
    {"role": "system", "name":"example_user", "content": "New synergies will help drive top-line growth."},
    {"role": "system", "name": "example_assistant", "content": "Things working well together will increase revenue."},
    {"role": "system", "name":"example_user", "content": "Let's circle back when we have more bandwidth to touch base on opportunities for increased leverage."},
    {"role": "system", "name": "example_assistant", "content": "Let's talk later when we're less busy about how to do better."},
    {"role": "user", "content": "This late pivot means we don't have time to boil the ocean for the client deliverable."},
]


In [16]:
# example token count from the function defined above
print(f"{num_tokens_from_messages(messages)} prompt tokens counted.")


126 prompt tokens counted.


In [17]:
# example token count from the OpenAI API
response = openai.ChatCompletion.create(
    model=MODEL,
    messages=messages,
    temperature=0,
)


In [18]:
response

<OpenAIObject chat.completion id=chatcmpl-6vrpIjOeqDsCjjiymvv9NEtaRvphY at 0x7ff7b4fc58b0> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "This sudden change in plans means we don't have enough time to do everything for the client's project.",
        "role": "assistant"
      }
    }
  ],
  "created": 1679249624,
  "id": "chatcmpl-6vrpIjOeqDsCjjiymvv9NEtaRvphY",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 22,
    "prompt_tokens": 126,
    "total_tokens": 148
  }
}

In [19]:
print(f'{response["usage"]["prompt_tokens"]} prompt tokens used.')

126 prompt tokens used.


In [50]:
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

In [51]:
encoding.encode("tiktoken is great!")

[83, 1609, 5963, 374, 2294, 0]

In [52]:
[encoding.decode_single_token_bytes(token) for token in [83, 1609, 5963, 374, 2294, 0]]

[b't', b'ik', b'token', b' is', b' great', b'!']

In [48]:
def num_tokens_from_string(string: str, model_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.encoding_for_model(model_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [49]:
num_tokens_from_string("tiktoken is great!", "gpt-3.5-turbo")

6

## code-davinci-002

Para uso do code-davinci-00, usamos como referência o caderno da [openai: Unit_test_writing_using_a_multi-step_prompt.ipynb](https://github.com/openai/openai-cookbook/blob/main/examples/Unit_test_writing_using_a_multi-step_prompt.ipynb)

Dicas para iteração com esse modelo em https://platform.openai.com/docs/guides/code/best-practices

In [26]:
MODEL = "code-davinci-002"

In [27]:
prompt_teste = 'Write a function in python that calculates fibonacci'

In [30]:
max_tokens:int = 1000
temperature:float = 1.0

In [32]:
response = openai.Completion.create(
        model=MODEL,
        prompt=prompt_teste,
        stop=["\n\n", "\n\t\n", "\n    \n", "```"],
        max_tokens=max_tokens,
        temperature=temperature,
        stream=False)

In [33]:
response

<OpenAIObject text_completion id=cmpl-6vsLlR8hqWEuay0f6FIulqpWnBIrD at 0x7ff7ae8b2c20> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": " sequence. Like f(n) = f(n-1) + f(n-2) + ... + f(3) + f(2) + f(1). For example   f(6) = f(5) + f(4) = 21."
    }
  ],
  "created": 1679251637,
  "id": "cmpl-6vsLlR8hqWEuay0f6FIulqpWnBIrD",
  "model": "code-davinci-002",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 58,
    "prompt_tokens": 10,
    "total_tokens": 68
  }
}

In [39]:
prompt_teste = 'Meu nome é Marcus. Moro no Brasil. A capital do Brasil é: '

In [40]:
response = openai.Completion.create(
        model=MODEL,
        prompt=prompt_teste,
        # stop=["\n\n", "\n\t\n", "\n    \n", "```"],
        max_tokens=10,
        temperature=temperature,
        stream=False)

In [41]:
response

<OpenAIObject text_completion id=cmpl-6vsOFHPFFhUpS9RcKHeatc55S8Duh at 0x7ff7b51544f0> JSON: {
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "logprobs": null,
      "text": "Bras\u00edlia. Meu e-mail \u00e9"
    }
  ],
  "created": 1679251791,
  "id": "cmpl-6vsOFHPFFhUpS9RcKHeatc55S8Duh",
  "model": "code-davinci-002",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 10,
    "prompt_tokens": 20,
    "total_tokens": 30
  }
}

## LLAMA 

[Colab demo da API do LLAMA](https://colab.research.google.com/drive/1zZ-ch29LTicNPA62t2MaOwMROywnqUxf?usp=sharing) (obrigado, Thales Rogério)

In [53]:
import requests

In [54]:
base_url="http://143.106.167.108/api"

In [56]:
data={
	"prompt":"""Given table, specify which rows have repeated values for both "Item number" and "Local". If no row is repeated say "no repeats".

Example 1:
|Row | Item number | Local |
|1 |  3 5 7 | New York |
|2|  5 8 2 | Madagascar |
|3|  3 4 5 | New York |
|4|  3 4 5 | Paris |

Explanation: Rows 1 and 3 have the same local "New York" and the same item number "3 4 5". Therefore they are repeated.

Answer: (1,3).

Example 2:
|Row | Item number | Local |
|1 |  0 9 2 4 | Amsterdam |
|2|  9 4 2 4 | Barcelona |
|3|  7 3 2 | London |
|4|  7 3 1 | London |
|5|  7 3 2 | London |
|6|  7 3 2 | London |
|7|  7 3 2 | London |
|8|  7 3  2 |  New York |
|9 |  0 9 2 4 | Amsterdam |

Explanation:""",

"temperature": 0.0,
"top_p": 1,
"max_length": 250
}

r=requests.post(f"{base_url}/complete", json=data)

In [57]:
if r.ok:
  response=r.json()
  print(response)

{'prompt': 'Given table, specify which rows have repeated values for both "Item number" and "Local". If no row is repeated say "no repeats".\n\nExample 1:\n|Row | Item number | Local |\n|1 |  3 5 7 | New York |\n|2|  5 8 2 | Madagascar |\n|3|  3 4 5 | New York |\n|4|  3 4 5 | Paris |\n\nExplanation: Rows 1 and 3 have the same local "New York" and the same item number "3 4 5". Therefore they are repeated.\n\nAnswer: (1,3).\n\nExample 2:\n|Row | Item number | Local |\n|1 |  0 9 2 4 | Amsterdam |\n|2|  9 4 2 4 | Barcelona |\n|3|  7 3 2 | London |\n|4|  7 3 1 | London |\n|5|  7 3 2 | London |\n|6|  7 3 2 | London |\n|7|  7 3 2 | London |\n|8|  7 3  2 |  New York |\n|9 |  0 9 2 4 | Amsterdam |\n\nExplanation:', 'temperature': 0.0, 'top_p': 1.0, 'max_length': 250, 'stopping_tokens': [], 'request_uuid': '6b526e94-986c-4519-adc5-2519f7e6705a'}


In [59]:
response

{'prompt': 'Given table, specify which rows have repeated values for both "Item number" and "Local". If no row is repeated say "no repeats".\n\nExample 1:\n|Row | Item number | Local |\n|1 |  3 5 7 | New York |\n|2|  5 8 2 | Madagascar |\n|3|  3 4 5 | New York |\n|4|  3 4 5 | Paris |\n\nExplanation: Rows 1 and 3 have the same local "New York" and the same item number "3 4 5". Therefore they are repeated.\n\nAnswer: (1,3).\n\nExample 2:\n|Row | Item number | Local |\n|1 |  0 9 2 4 | Amsterdam |\n|2|  9 4 2 4 | Barcelona |\n|3|  7 3 2 | London |\n|4|  7 3 1 | London |\n|5|  7 3 2 | London |\n|6|  7 3 2 | London |\n|7|  7 3 2 | London |\n|8|  7 3  2 |  New York |\n|9 |  0 9 2 4 | Amsterdam |\n\nExplanation:',
 'temperature': 0.0,
 'top_p': 1.0,
 'max_length': 250,
 'stopping_tokens': [],
 'request_uuid': '6b526e94-986c-4519-adc5-2519f7e6705a'}

We will use the request_uuid to check if the completion job is done

In [62]:
request_uuid=response["request_uuid"]

In [63]:
import time

In [64]:
ready = False
while not ready:
    r = requests.get(f"{base_url}/get_result/{request_uuid}")
    response = r.json()
    ready = response['ready']
    if ready:
        print(response['generated_text'])
        break
    # Wait 10 seconds before checking again
    time.sleep(10)

 Rows 3-7 all have the same local "London", but their item numbers differ. Therefore there are no repeats in this example.


when consulting the result you may find 3 scenarios
- Your job did not run yet, you should try again in a couple of seconds (Ready=False, message=None)
- Your job did run and everything worked (Ready=True, message=your response)
- Your job did run but it failed (Ready=True, message=None)

# Rate limiting

We may adjust this during the week, but due to computational constrains we will apply a rate limit of about 2 requests per 5 seconds. If you exceed this limit you will receive an error 429. You should adjust your code accordingly.

Please remember that the whole class is using a shared resource, so avoid excessive requests even if they are under the rate limit.

If you encounter any errors or problems, let us know in the classroom.

In [65]:
for i in range(30):
  r=requests.get(f"{base_url}/get_result/{request_uuid}")
  print(i, "->", r.status_code)

0 -> 200
1 -> 200
2 -> 200
3 -> 429
4 -> 429
5 -> 429
6 -> 429
7 -> 429
8 -> 429
9 -> 429
10 -> 429
11 -> 429
12 -> 200
13 -> 429
14 -> 429
15 -> 429
16 -> 429
17 -> 429
18 -> 429
19 -> 429
20 -> 429
21 -> 429
22 -> 429
23 -> 200
24 -> 429
25 -> 429
26 -> 429
27 -> 429
28 -> 429
29 -> 429
