# Aula6 - Doc2Query

[Unicamp - IA368DD: Deep Learning aplicado a sistemas de busca.](https://www.cpg.feec.unicamp.br/cpg/lista/caderno_horario_show.php?id=1779)

Autor: Marcus Vinícius Borela de Castro

[Repositório no github](https://github.com/marcusborela/deep_learning_em_buscas_unicamp)

Stage: expanding texts with queries generated by finetuned t5-base doc2query

# Organizando o ambiente

In [1]:
import os

In [2]:
DIRETORIO_TRABALHO = '/home/borela/fontes/deep_learning_em_buscas_unicamp/local/doc2query'


In [3]:
assert os.path.exists(DIRETORIO_TRABALHO), f"Path para {DIRETORIO_TRABALHO} não existe!"

In [4]:
from psutil import virtual_memory

In [5]:
def mostra_memoria(lista_mem=['cpu']):
  """
  Esta função exibe informações de memória da CPU e/ou GPU, conforme parâmetros fornecidos.

  Parâmetros:
  -----------
  lista_mem : list, opcional
      Lista com strings 'cpu' e/ou 'gpu'. 
      'cpu' - exibe informações de memória da CPU.
      'gpu' - exibe informações de memória da GPU (se disponível).
      O valor padrão é ['cpu'].

  Saída:
  -------
  A função não retorna nada, apenas exibe as informações na tela.

  Exemplo de uso:
  ---------------
  Para exibir informações de memória da CPU:
      mostra_memoria(['cpu'])

  Para exibir informações de memória da CPU e GPU:
      mostra_memoria(['cpu', 'gpu'])
  
  Autor: Marcus Vinícius Borela de Castro

  """  
  if 'cpu' in lista_mem:
    vm = virtual_memory()
    ram={}
    ram['total']=round(vm.total / 1e9,2)
    ram['available']=round(virtual_memory().available / 1e9,2)
    # ram['percent']=round(virtual_memory().percent / 1e9,2)
    ram['used']=round(virtual_memory().used / 1e9,2)
    ram['free']=round(virtual_memory().free / 1e9,2)
    ram['active']=round(virtual_memory().active / 1e9,2)
    ram['inactive']=round(virtual_memory().inactive / 1e9,2)
    ram['buffers']=round(virtual_memory().buffers / 1e9,2)
    ram['cached']=round(virtual_memory().cached/1e9 ,2)
    print(f"Your runtime RAM in gb: \n total {ram['total']}\n available {ram['available']}\n used {ram['used']}\n free {ram['free']}\n cached {ram['cached']}\n buffers {ram['buffers']}")
    print('/nGPU')
    gpu_info = !nvidia-smi
  if 'gpu' in lista_mem:
    gpu_info = '\n'.join(gpu_info)
    if gpu_info.find('failed') >= 0:
      print('Not connected to a GPU')
    else:
      print(gpu_info)


In [6]:
mostra_memoria(['cpu','gpu'])

Your runtime RAM in gb: 
 total 67.35
 available 54.02
 used 12.15
 free 8.6
 cached 44.59
 buffers 2.01
/nGPU
Mon Apr 10 19:46:46 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.39.01    Driver Version: 510.39.01    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:02:00.0 Off |                  N/A |
| 85%   71C    P2   167W / 370W |  15033MiB / 24576MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                         

## Fixando as seeds

In [7]:
import random
import torch
import numpy as np

In [8]:
def inicializa_seed(num_semente:int=123):
  """
  Inicializa as sementes para garantir a reprodutibilidade dos resultados do modelo.
  Essa é uma prática recomendada, já que a geração de números aleatórios pode influenciar os resultados do modelo.
  Além disso, a função também configura as sementes da GPU para garantir a reprodutibilidade quando se utiliza aceleração por GPU. 
  
  Args:
      num_semente (int): número da semente a ser utilizada para inicializar as sementes das bibliotecas.
  
  References:
      http://nlp.seas.harvard.edu/2018/04/03/attention.html
      https://github.com/CyberZHG/torch-multi-head-attention/blob/master/torch_multi_head_attention/multi_head_attention.py#L15
  """
  # Define as sementes das bibliotecas random, numpy e pytorch
  random.seed(num_semente)
  np.random.seed(num_semente)
  torch.manual_seed(num_semente)
  
  # Define as sementes da GPU
  torch.backends.cudnn.deterministic = True
  torch.backends.cudnn.benchmark = False

  #torch.cuda.manual_seed(num_semente)
  #Cuda algorithms
  #torch.backends.cudnn.deterministic = True


In [9]:
num_semente=123
inicializa_seed(num_semente)

## Preparando para debug e display

In [10]:
import pandas as pd

In [11]:
#!pip install transformers -q

In [12]:
import transformers

In [13]:
# https://zohaib.me/debugging-in-google-collab-notebook/
# !pip install -Uqq ipdb
import ipdb
# %pdb off # desativa debug em exceção
# %pdb on  # ativa debug em exceção
# ipdb.set_trace(context=8)  para execução nesse ponto

In [14]:
def config_display():
  """
  Esta função configura as opções de display do Pandas.
  """

  # Configurando formato saída Pandas
  # define o número máximo de colunas que serão exibidas
  pd.options.display.max_columns = None

  # define a largura máxima de uma linha
  pd.options.display.width = 1000

  # define o número máximo de linhas que serão exibidas
  pd.options.display.max_rows = 100

  # define o número máximo de caracteres por coluna
  pd.options.display.max_colwidth = 50

  # se deve exibir o número de linhas e colunas de um DataFrame.
  pd.options.display.show_dimensions = True

  # número de dígitos após a vírgula decimal a serem exibidos para floats.
  pd.options.display.precision = 7


In [15]:
def config_debug():
  """
  Esta função configura as opções de debug do PyTorch e dos pacotes
  transformers e datasets.
  """

  # Define opções de impressão de tensores para o modo científico
  torch.set_printoptions(sci_mode=True) 
  """
    Significa que valores muito grandes ou muito pequenos são mostrados em notação científica.
    Por exemplo, em vez de imprimir o número 0.0000012345 como 0.0000012345, 
    ele seria impresso como 1.2345e-06. Isso é útil em situações em que os valores dos tensores 
    envolvidos nas operações são muito grandes ou pequenos, e a notação científica permite 
    uma melhor compreensão dos números envolvidos.  
  """

  # Habilita detecção de anomalias no autograd do PyTorch
  torch.autograd.set_detect_anomaly(True)
  """
    Permite identificar operações que podem causar problemas de estabilidade numérica, 
    como gradientes explodindo ou desaparecendo. Quando essa opção é ativada, 
    o PyTorch verifica se há operações que geram valores NaN ou infinitos nos tensores 
    envolvidos no cálculo do gradiente. Se for detectado um valor anômalo, o PyTorch 
    interrompe a execução e gera uma exceção, permitindo que o erro seja corrigido 
    antes que se torne um problema maior.

    É importante notar que a detecção de anomalias pode ter um impacto significativo 
    no desempenho, especialmente em modelos grandes e complexos. Por esse motivo,
    ela deve ser usada com cautela e apenas para depuração.
  """

  # Configura variável de ambiente para habilitar a execução síncrona (bloqueante) das chamadas da API do CUDA.
  os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
  """
    o Python aguarda o término da execução de uma chamada da API do CUDA antes de executar a próxima chamada. 
    Isso é útil para depurar erros no código que envolve operações na GPU, pois permite que o erro seja capturado 
    no momento em que ocorre, e não depois de uma sequência de operações que pode tornar a origem do erro mais difícil de determinar.
    No entanto, é importante lembrar que esse modo de execução é significativamente mais lento do que a execução assíncrona, 
    que é o comportamento padrão do CUDA. Por isso, é recomendado utilizar esse comando apenas em situações de depuração 
    e removê-lo após a solução do problema.
  """

  # Define o nível de verbosity do pacote transformers para info
  # transformers.utils.logging.set_verbosity_info() 
  
  
  """
    Define o nível de detalhamento das mensagens de log geradas pela biblioteca Hugging Face Transformers 
    para o nível info. Isso significa que a biblioteca irá imprimir mensagens de log informativas sobre
    o andamento da execução, tais como tempo de execução, tamanho de batches, etc.

    Essas informações podem ser úteis para entender o que está acontecendo durante a execução da tarefa 
    e auxiliar no processo de debug. É importante notar que, em alguns casos, a quantidade de informações
    geradas pode ser muito grande, o que pode afetar o desempenho do sistema e dificultar a visualização
    das informações relevantes. Por isso, é importante ajustar o nível de detalhamento de acordo com a 
    necessidade de cada tarefa.
  
    Caso queira reduzir a quantidade de mensagens, comentar a linha acima e 
      descomentar as duas linhas abaixo, para definir o nível de verbosity como error ou warning
  
    transformers.utils.logging.set_verbosity_error()
    transformers.utils.logging.set_verbosity_warning()
  """


  # Define o modo verbose do xmode, que é utilizado no debug
  # %xmode Verbose 

  """
    Comando usado no Jupyter Notebook para controlar o modo de exibição das informações de exceções.
    O modo verbose é um modo detalhado que exibe informações adicionais ao imprimir as exceções.
    Ele inclui as informações de pilha de chamadas completa e valores de variáveis locais e globais 
    no momento da exceção. Isso pode ser útil para depurar e encontrar a causa de exceções em seu código.
    Ao usar %xmode Verbose, as informações de exceção serão impressas com mais detalhes e informações adicionais serão incluídas.

    Caso queira desabilitar o modo verbose e utilizar o modo plain, 
    comentar a linha acima e descomentar a linha abaixo:
    %xmode Plain
  """

  """
    Dica:
    1.  pdb (Python Debugger)
      Quando ocorre uma exceção em uma parte do código, o programa para a execução e exibe uma mensagem de erro 
      com informações sobre a exceção, como a linha do código em que ocorreu o erro e o tipo da exceção.

      Se você estiver depurando o código e quiser examinar o estado das variáveis ​​e executar outras operações 
      no momento em que a exceção ocorreu, pode usar o pdb (Python Debugger). Para isso, é preciso colocar o comando %debug 
      logo após ocorrer a exceção. Isso fará com que o programa pare na linha em que ocorreu a exceção e abra o pdb,
      permitindo que você explore o estado das variáveis, examine a pilha de chamadas e execute outras operações para depurar o código.


    2. ipdb
      O ipdb é um depurador interativo para o Python que oferece recursos mais avançados do que o pdb,
      incluindo a capacidade de navegar pelo código fonte enquanto depura.
      
      Você pode começar a depurar seu código inserindo o comando ipdb.set_trace() em qualquer lugar do 
      seu código onde deseja pausar a execução e começar a depurar. Quando a execução chegar nessa linha, 
      o depurador entrará em ação, permitindo que você examine o estado atual do seu programa e execute 
      comandos para investigar o comportamento.

      Durante a depuração, você pode usar comandos:
        next (para executar a próxima linha de código), 
        step (para entrar em uma função chamada na próxima linha de código) 
        continue (para continuar a execução normalmente até o próximo ponto de interrupção).

      Ao contrário do pdb, o ipdb é um depurador interativo que permite navegar pelo código fonte em que
      está trabalhando enquanto depura, permitindo que você inspecione variáveis, defina pontos de interrupção
      adicionais e até mesmo execute expressões Python no contexto do seu programa.
  """


In [16]:
config_display()

In [17]:
config_debug()

## Importações

In [18]:
from transformers import pipeline

In [19]:
from transformers import T5Tokenizer, T5ForConditionalGeneration


# Experimentações

## Testando geração de perguntas com o modelo treinado

Dicas em https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/02_how_to_generate.ipynb#scrollTo=AZ6xs-KLi9jT 

In [19]:
tokenizer = T5Tokenizer.from_pretrained("t5-base")

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [107]:
model = T5ForConditionalGeneration.from_pretrained(f'{DIRETORIO_TRABALHO}/base-checkpoint-2340 ')

In [22]:

text = "Python is an interpreted, high-level and general-purpose programming language."
text += "Python's design philosophy emphasizes code readability with its notable use of significant whitespace."
text += "Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects."


input_ids = tokenizer.encode(text, max_length=320, truncation=True, return_tensors='pt')
outputs = model.generate(
    input_ids=input_ids,
    max_length=64,
    do_sample=True,
    top_p=0.95,
    num_return_sequences=5)

print(f"Text: {text}")


Text: Python is an interpreted, high-level and general-purpose programming language.Python's design philosophy emphasizes code readability with its notable use of significant whitespace.Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.


In [23]:
print("\nGenerated Queries:")
for i in range(len(outputs)):
    query = tokenizer.decode(outputs[i], skip_special_tokens=True)
    print(f'{i + 1}: {query}')


Generated Queries:
1: what is python design philosophy
2: what is python design philosophy
3: what is python design philosophy
4: what is python design philosophy
5: what is python design philosophy


## Experimentando pipeline text2text-generation

In [25]:
pipe = pipeline("text2text-generation")

No model was supplied, defaulted to t5-base and revision 686f1db (https://huggingface.co/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [26]:
pipe(text)

[{'generated_text': 'Python is a programming language that is interpreted, high-level and general-purpose'}]

In [39]:
print("\nGenerated Queries:")
for i in range(5):
    saida  = pipe(text, return_tensors=False, temperature=0, top_p=0.5, top_k=50)[0]['generated_text']
    print(f'{i + 1}: {saida }')


Generated Queries:
1: Python is a programming language that is interpreted, high-level and general-purpose
2: Python is a programming language that is interpreted, high-level and general-purpose
3: Python is a programming language that is interpreted, high-level and general-purpose
4: Python is a programming language that is interpreted, high-level and general-purpose
5: Python is a programming language that is interpreted, high-level and general-purpose


### num_beams

In [48]:
# Gere 5 sentenças
num_sentences = 5
for i in range(num_sentences):
    sentence = pipe(text, do_sample=True, num_beams=5, early_stopping=False)[0]['generated_text']
    print("Sentença " + str(i + 1) + ": " + sentence)

Sentença 1: Python is a high-level and general-purpose programming language.
Sentença 2: Python is a high-level and general-purpose programming language. written in Python
Sentença 3: Python is a high-level, high-level and general-purpose programming language.
Sentença 4: Python is a high-level and general-purpose programming language.
Sentença 5: Python is a high-level, high-level and general-purpose programming language.


In [50]:
num_sentences = 5
for i in range(num_sentences):
    sentence = pipe(text, do_sample=True, num_beams=5, early_stopping=False, no_repeat_ngram_size=2)[0]['generated_text']
    print("Sentença " + str(i + 1) + ": " + sentence)

Sentença 1: Python is an interpreted, high-level and general-purpose programming language
Sentença 2: Python is a high-level and general-purpose programming language. written in Python
Sentença 3: Python is a high-level and general-purpose programming language.
Sentença 4: Python is a high-level and general-purpose programming language.
Sentença 5: Python is a programming language that aims to help programmers write clear, logical


In [56]:

sentences = pipe(text, do_sample=True, num_beams=6, early_stopping=False, num_return_sequences=5, no_repeat_ngram_size=2)


In [59]:
sentences

[{'generated_text': 'Python is an interpreted, high-level and general-purpose programming language'},
 {'generated_text': 'Python is an interpretive, high-level and general-purpose programming language'},
 {'generated_text': 'Python is an interpreted, high-level and general-purpose programming language. '},
 {'generated_text': 'Python is an interpretable, high-level and general-purpose programming language'},
 {'generated_text': 'Python is a high-level and general-purpose programming language. written in Python'}]

In [61]:
for i, sentence in enumerate([x['generated_text'] for x in sentences]):
    print("Sentença " + str(i + 1) + ": " + sentence)

Sentença 1: Python is an interpreted, high-level and general-purpose programming language
Sentença 2: Python is an interpretive, high-level and general-purpose programming language
Sentença 3: Python is an interpreted, high-level and general-purpose programming language. 
Sentença 4: Python is an interpretable, high-level and general-purpose programming language
Sentença 5: Python is a high-level and general-purpose programming language. written in Python


### do_sample

In [63]:

sentences = pipe(text, do_sample=True, top_k=0, num_return_sequences=5, no_repeat_ngram_size=2)
for i, sentence in enumerate([x['generated_text'] for x in sentences]):
    print("Sentença " + str(i + 1) + ": " + sentence)

Sentença 1: , a purported language for general purpose language programming.. used for basic
Sentença 2: .Information-oriented programming is crucial to this new concept. a language with
Sentença 3: Python is a development language aimed at designing people's worklife and computing strategies
Sentença 4: written in high-level programming languages. it is interpreted, high level programming language
Sentença 5: and widely used programming language, primarily for advanced advanced projects. A logical studio


In [64]:
sentences = pipe(text, do_sample=True, top_k=0, temperature=0.7, num_return_sequences=5, no_repeat_ngram_size=2)
for i, sentence in enumerate([x['generated_text'] for x in sentences]):
    print("Sentença " + str(i + 1) + ": " + sentence)

Sentença 1: Python is a programming language for large-scale projects. syntax has
Sentença 2: it is primarily a low-level and general-purpose programming language. written
Sentença 3: a global language that is widely used used for low level tasks. a powerful
Sentença 4: Python was developed in the 1990s, and is still used today. language.
Sentença 5: The Python language is a highly interpreted, high-level and general-purpose programming


In [65]:
sentences = pipe(text, do_sample=True, top_k=0, temperature=0.7, num_return_sequences=5)
for i, sentence in enumerate([x['generated_text'] for x in sentences]):
    print("Sentença " + str(i + 1) + ": " + sentence)

Sentença 1: The language is well-designed, high-level and general-purpose..
Sentença 2: is a programming language that is interpreted, high-level and general-purpose.
Sentença 3: Python is an interpretable, high-level and general-purpose programming language. designed
Sentença 4: Its intended to be interpreted, high-level and general-purpose programming language.
Sentença 5: the accepted language for high-level and general-purpose programming. has been around since


### Top-K Sampling

In [66]:
sentences = pipe(text, top_k=50, do_sample=True,  num_return_sequences=5)
for i, sentence in enumerate([x['generated_text'] for x in sentences]):
    print("Sentença " + str(i + 1) + ": " + sentence)

Sentença 1: ., high-level and general-purpose programming language.a simple programming language
Sentença 2: is a high-level and general purpose programming language the programming language used to create
Sentença 3: A. A. a strong design philosophy and general-purpose programming language.
Sentença 4: Python (pronounced 'py') is a programming language 
Sentença 5: Python is an interpreted programming language and general-purpose language. a programming language


In [68]:
sentences = pipe(text, top_k=50, do_sample=True,  num_return_sequences=5)
for i, sentence in enumerate([x['generated_text'] for x in sentences]):
    print("Sentença " + str(i + 1) + ": " + sentence)

Sentença 1: . language aimed at programming. is a programming language intended for small
Sentença 2: Python is a written, high-level, and general-purpose programming
Sentença 3: py is a programming language that works for various applications. is an interpreted
Sentença 4: This program is designed especially for user and business developers. application programming language is 
Sentença 5: The Python language is interpreted, high-level and general-purpose programming language language


In [69]:
random.seed(1)
sentences = pipe(text, top_k=50, do_sample=True,  num_return_sequences=5)
for i, sentence in enumerate([x['generated_text'] for x in sentences]):
    print("Sentença " + str(i + 1) + ": " + sentence)

Sentença 1: a distributed language for general-purpose applications. is an interpreted, high-level
Sentença 2: is a broader programming language, primarily used for higher-level and larger projects
Sentença 3: , written and distributed with minimal documentation. a high-level and general purpose programming
Sentença 4: it is not just a functional language.. in general use.,
Sentença 5: is a general-purpose programming language. an interpretive programming language that is extremely


In [74]:
sentences = pipe(text, top_k=50, do_sample=True,  num_return_sequences=5)
for i, sentence in enumerate([x['generated_text'] for x in sentences]):
    print("Sentença " + str(i + 1) + ": " + sentence)

Sentença 1: a programming language developed by Apple Inc., an American, based in London,
Sentença 2: Python (plugins), a high level programable language, is 
Sentença 3: Python is a high-level, general-purpose programming language. a high
Sentença 4: Python is a highly-interpreted, high-level programming language. a common
Sentença 5: , high-level and general-purpose programming language. an interpretive, high-


In [75]:
sentences = pipe(text, top_k=50, do_sample=True,  num_return_sequences=5, no_repeat_ngram_size=2)
for i, sentence in enumerate([x['generated_text'] for x in sentences]):
    print("Sentença " + str(i + 1) + ": " + sentence)

Sentença 1: written and used using Python to write complex and logical programs.. intended to
Sentença 2: Python is an interpretable, high-level and general-purpose programming language. widely
Sentença 3: In which it is intended to build and interpret programs, Python is the preferred programming language for
Sentença 4: It's design philosophy emphasizes code readability with its notable use of significant whitespace
Sentença 5: . a specialized programming language and object-oriented programming style.,


### Top-p (nucleus) sampling

In [76]:
sentences = pipe(text, top_k=0, top_p=0.92, do_sample=True,  num_return_sequences=5, no_repeat_ngram_size=2)
for i, sentence in enumerate([x['generated_text'] for x in sentences]):
    print("Sentença " + str(i + 1) + ": " + sentence)

Sentença 1: Python is designed to run Python code and is an editor that works with Python.
Sentença 2: By default, Python is an interpreted, high-level and general-purpose programming
Sentença 3: Written by C++++, employed by professional developers in many fields and fields.
Sentença 4: Python. A programming language for general-purpose and non-interpretable tasks
Sentença 5: PLY is an interpretive, high-level and general-purpose programming language.


In [77]:
sentences = pipe(text, top_k=50, top_p=0.92, do_sample=True,  num_return_sequences=5, no_repeat_ngram_size=2)
for i, sentence in enumerate([x['generated_text'] for x in sentences]):
    print("Sentença " + str(i + 1) + ": " + sentence)

Sentença 1: it's design philosophy emphasizes code readability with its notable use of significant whitespace
Sentença 2: Its design philosophy emphasizes code readability with its notable use of significant whitespace.
Sentença 3: is a high-level programming language. general-purpose, high level and
Sentença 4: Python is a high-level programming language . is generally used in high level
Sentença 5: a high-level and general-purpose programming language. a general purpose Python language


In [81]:
sentences = pipe(text, top_k=50, top_p=0.92, do_sample=True, min_length=20, num_return_sequences=5, repetition_penalty=1.2)
for i, sentence in enumerate([x['generated_text'] for x in sentences]):
    print("Sentença " + str(i + 1) + ": " + sentence)

Sentença 1: is a high-level programming language.'s design philosophy emphasizes code read
Sentença 2: .Python is a programming language for computer systems, machines and general-
Sentença 3: programming software developed primarily for non-technical uses. language that includes many
Sentença 4: Python is a programming language intended to help programmers write clear, 
Sentença 5: , the most widely used programming language in the world. for all kinds of programming applications


In [83]:
sentences = pipe(text, top_k=0, top_p=0.95, do_sample=True,  num_return_sequences=5, no_repeat_ngram_size=2)
for i, sentence in enumerate([x['generated_text'] for x in sentences]):
    print("Sentença " + str(i + 1) + ": " + sentence)

Sentença 1: Python is a programming language that has been compiled to fast, robust
Sentença 2: The Python language is named after Gary Perry and his wife, Ashley. a widely
Sentença 3: it is an interpretable language intended for high-level programminggeneral-purpose programming
Sentença 4: includes several modules. written in high-level, high encoding
Sentença 5: Python is an interpreted programming language. usually aimed at beginners. a


In [84]:
sentences = pipe(text, top_k=0, top_p=0.95, do_sample=True,  num_return_sequences=5, repetition_penalty=1.2)
for i, sentence in enumerate([x['generated_text'] for x in sentences]):
    print("Sentença " + str(i + 1) + ": " + sentence)

Sentença 1: is a programming language that aims to help programmers write clear and logical code
Sentença 2: py and a general-purpose programming language.4.python
Sentença 3: it is considered an languages library. a programming language. and intended for general
Sentença 4: .Python is a programming language for the general user and intermediate level development
Sentença 5: available for unix,  primarily used for high-level programmers general


In [85]:
sentences = pipe(text, top_k=50, top_p=0.95, do_sample=True,  num_return_sequences=5, repetition_penalty=1.2)
for i, sentence in enumerate([x['generated_text'] for x in sentences]):
    print("Sentença " + str(i + 1) + ": " + sentence)

Sentença 1: By default code is a text buffer., high-level and general purpose
Sentença 2: The Python language aims to improve your project code. interpreted, high-level
Sentença 3: Python was a programming language from the 1960s-80s, a style for
Sentença 4: It is a programming language that emphasizes code readability with significant whitespace.
Sentença 5: it is a low-level programming language. a high-level, general


In [90]:
sentences = pipe(text, top_k=50, top_p=0.95, temperature=0.8, do_sample=True,  num_return_sequences=5, repetition_penalty=1.2)
for i, sentence in enumerate([x['generated_text'] for x in sentences]):
    print("Sentença " + str(i + 1) + ": " + sentence)

Sentença 1: Python is a programming language for general-purpose projects. designed to help programmers
Sentença 2: . a well-known and widely used programming language that is interpreted,
Sentença 3: Python is a high-level and general-purpose programming language.
Sentença 4: a programming language designed to improve code readability and code reuse., high-
Sentença 5: The Python programming language is interpreted, high-level and general-purpose programming language.


Conclusão: Achei que as frases geradas com top-p=0.95, top_k=50 e do_sample, com  repetition_penalty foram melhores.

## Experimentando chamar o pipeline com batch

In [96]:
# Lista de strings
lista_texto = [
    "Example 1: This is an example of a text with more than 20 words for testing.",
    "Example 2: Here is another example of a text with more than 20 words for validation.",
    "Example 3: This is a third example of a text with more than 20 words for verification.",
    "Example 4: Another example of a text with more than 20 words for analysis.",
    "Example 5: Sample text with more than 20 words for evaluation.",
    "Example 6: Example of a text with more than 20 words for use.",
    "Example 7: Test text with more than 20 words for demonstration.",
    "Example 8: Example of a long text with more than 20 words for experimentation.",
    "Example 9: Sample text with more than 20 words for comparison.",
    "Example 10: Example of a text with more than 20 words for reference.",
    "Example 11: Test text with more than 20 words for illustrative purposes.",
    "Example 12: Example of a text with more than 20 words for proof.",
    "Example 13: Sample text with more than 20 words for performance analysis.",
    "Example 14: Example of a text with more than 20 words for feature demonstration.",
    "Example 15: Test text with more than 20 words for validation.",
    "Example 16: Example of a text with more than 20 words for educational purposes.",
    "Example 17: Sample text with more than 20 words for usability testing.",
    "Example 18: Example of a text with more than 20 words for research purposes.",
    "Example 19: Test text with more than 20 words for quality evaluation.",
    "Example 20: Example of a text with more than 20 words for example purposes."
]



In [97]:
# Função para chamar o pipe em batch
def generate_text_batch(batch):
    return pipe(batch, num_return_sequences=5, top_k=50, top_p=0.95, do_sample=True, repetition_penalty=1.3)


In [103]:

# Gerar sequências de texto em batch
batch_size = 5
for i in range(0, len(lista_texto), batch_size):
    print(f'i {i}')
    batch = lista_texto[i:i+batch_size]
    print(f"batch: {batch}")
    generated_text = generate_text_batch(batch)
    # print(f"generated_text {generated_text}")
    for j, text in enumerate(batch):
        print(f'dentro do batch j {j}')
        print(f"Frase de entrada {i+j+1}: {text}")
        print(f"Frase gerada[0]: {generated_text[j][0]['generated_text']}\n")
        print(f"Frase gerada[-1]: {generated_text[j][-1]['generated_text']}\n")


i 0
batch: ['Example 1: This is an example of a text with more than 20 words for testing.', 'Example 2: Here is another example of a text with more than 20 words for validation.', 'Example 3: This is a third example of a text with more than 20 words for verification.', 'Example 4: Another example of a text with more than 20 words for analysis.', 'Example 5: Sample text with more than 20 words for evaluation.']
dentro do batch j 0
Frase de entrada 1: Example 1: This is an example of a text with more than 20 words for testing.
Frase gerada[0]: 2: This is an example of a text with more than 20 words for testing.

Frase gerada[-1]: 1: Below is an example. Example 2: This is 2: This is an

dentro do batch j 1
Frase de entrada 2: Example 2: Here is another example of a text with more than 20 words for validation.
Frase gerada[0]: Example 1: Here is another example of a text with more than 20 words for validation.

Frase gerada[-1]: Example 1: This is an example of a text with more than 20 wo

In [99]:
from datasets import Dataset

In [100]:

# Criar o dataset
dataset = Dataset.from_dict({"text": lista_texto})


In [104]:
# Gerar sequências de texto em batch
batch_size = 5
for i in range(0, len(lista_texto), batch_size):
    print(f'i {i}')
    batch = dataset[i:i+batch_size]['text']
    print(f"batch: {batch}")
    generated_text = generate_text_batch(batch)
    # print(f"generated_text {gnerated_text}")
    for j, text in enumerate(batch):
        print(f'dentro do batch j {j}')
        print(f"Frase de entrada {i+j+1}: {text}")
        print(f"Frase gerada[0]: {generated_text[j][0]['generated_text']}\n")
        print(f"Frase gerada[-1]: {generated_text[j][-1]['generated_text']}\n")

i 0
batch: ['Example 1: This is an example of a text with more than 20 words for testing.', 'Example 2: Here is another example of a text with more than 20 words for validation.', 'Example 3: This is a third example of a text with more than 20 words for verification.', 'Example 4: Another example of a text with more than 20 words for analysis.', 'Example 5: Sample text with more than 20 words for evaluation.']
dentro do batch j 0
Frase de entrada 1: Example 1: This is an example of a text with more than 20 words for testing.
Frase gerada[0]: 2: This is an example of a text with more than 20 words for testing.

Frase gerada[-1]: 1: This is an example of a text with more than 20 words for testing.

dentro do batch j 1
Frase de entrada 2: Example 2: Here is another example of a text with more than 20 words for validation.
Frase gerada[0]: Example 3: Here is a second example of text with more than 20 words for validation.

Frase gerada[-1]: Example 3: This is another example of a text with

In [106]:
# Configurar os parâmetros para o pipe
batch_size = 5
num_workers = 8

# Chamar o pipe com os parâmetros de batch_size, num_workers e device (gpu)
generated_text = pipe(lista_texto, num_workers=num_workers, batch_size=batch_size, return_tensors=False, temperature=0.8)

# Iterar sobre os textos gerados
for i, text in enumerate(generated_text):
    print(f'Texto de entrada {i + 1}: {lista_texto[i]}')
    print(f'Texto gerado: {text}\n')


Texto de entrada 1: Example 1: This is an example of a text with more than 20 words for testing.
Texto gerado: {'generated_text': '2: This is an example of a text with more than 20 words for testing.'}

Texto de entrada 2: Example 2: Here is another example of a text with more than 20 words for validation.
Texto gerado: {'generated_text': 'Example 1: Here is another example of a text with more than 20 words for validation.'}

Texto de entrada 3: Example 3: This is a third example of a text with more than 20 words for verification.
Texto gerado: {'generated_text': '2: This is a third example of a text with more than 20 words for'}

Texto de entrada 4: Example 4: Another example of a text with more than 20 words for analysis.
Texto gerado: {'generated_text': 'Example 5: Another example of a text with more than 20 words for analysis.'}

Texto de entrada 5: Example 5: Sample text with more than 20 words for evaluation.
Texto gerado: {'generated_text': 'Example 4: Sample text with more than

## Experimentando pipe com o modelo treinado


In [20]:
tokenizer = T5Tokenizer.from_pretrained("t5-base")

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [21]:
model = T5ForConditionalGeneration.from_pretrained(f'{DIRETORIO_TRABALHO}/base-checkpoint-2340 ')

In [50]:
pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer, device=0, repetition_penalty=1.3)

In [38]:
# Lista de strings
lista_texto = [
    "Example 1: This is an example of a text with more than 20 words for testing.",
    "Example 2: Here is another example of a text with more than 20 words for validation.",
    "Example 3: This is a third example of a text with more than 20 words for verification.",
    "Example 4: Another example of a text with more than 20 words for analysis.",
    "Example 5: Sample text with more than 20 words for evaluation.",
    "Example 6: Example of a text with more than 20 words for use.",
    "Example 7: Test text with more than 20 words for demonstration.",
    "Example 8: Example of a long text with more than 20 words for experimentation.",
    "Example 9: Sample text with more than 20 words for comparison.",
    "Example 10: Example of a text with more than 20 words for reference.",
    "Example 11: Test text with more than 20 words for illustrative purposes.",
    "Example 12: Example of a text with more than 20 words for proof.",
    "Example 13: Sample text with more than 20 words for performance analysis.",
    "Example 14: Example of a text with more than 20 words for feature demonstration.",
    "Example 15: Test text with more than 20 words for validation.",
    "Example 16: Example of a text with more than 20 words for educational purposes.",
    "Example 17: Sample text with more than 20 words for usability testing.",
    "Example 18: Example of a text with more than 20 words for research purposes.",
    "Example 19: Test text with more than 20 words for quality evaluation.",
    "Example 20: Example of a text with more than 20 words for example purposes."
]



In [58]:
def remove_duplicatas(generated_text, max_sequences):
    frases_filtradas = []
    for i, text in enumerate(generated_text):
        frases_geradas = set()  # Conjunto para armazenar frases geradas e garantir que sejam distintas
        for retorno in text:
            frase = retorno["generated_text"]
            if frase not in frases_geradas:  # Verificar se a frase já foi gerada antes
                frases_geradas.add(frase)  # Adicionar à lista de frases geradas
                if len(frases_geradas) == max_sequences:  # Parar de adicionar frases quando atingir o limite máximo
                    break
        frases_filtradas.append(list(frases_geradas))  # Adicionar lista de frases geradas (set) a frases_filtradas
    return frases_filtradas



In [61]:
def gerar_texto(lista_texto, parm_num_return_sequences:int=5):
    generated_text = pipe(lista_texto, num_workers=num_workers, batch_size=batch_size, top_k=50, do_sample=True, num_return_sequences=parm_num_return_sequences*2)
    return remove_duplicatas(generated_text, max_sequences = parm_num_return_sequences)


In [62]:
# Configurar os parâmetros para o pipe
batch_size = 5
num_workers = 8

# Chamar o pipe com os parbâmetros de batch_size, num_workers e device (gpu)
#generated_text = pipe(lista_texto, num_workers=num_workers, batch_size=batch_size, top_k=50, do_sample=True, temperature=0.2, num_return_sequences=5)

lista_texto_gerado = gerar_texto(lista_texto, 5)
# Iterar sobre os textos gerados
for i, text in enumerate(lista_texto_gerado):
    print(f'Texto de entrada {i + 1}: {lista_texto[i]}')
    print(f'Texto gerado: {text}\n')


Texto de entrada 1: Example 1: This is an example of a text with more than 20 words for testing.
Texto gerado: ['how many words are in an example of a math', 'how many words in an example of graph', 'how many words are in an example of a test', 'how many words in an example of rich text', 'how many words in an example of a text']

Texto de entrada 2: Example 2: Here is another example of a text with more than 20 words for validation.
Texto gerado: ['how many words on an example of data', 'how many words in an example of dynamic text', 'how many words in an example of texting', 'how many words in an example of validation', 'how many words in an example of validation?']

Texto de entrada 3: Example 3: This is a third example of a text with more than 20 words for verification.
Texto gerado: ['how many words in an example of a math', 'how many words in an example of cod', 'how many words in an example of rich text', 'how many words in an example of such text', 'how many words in an example

# Baixando os dados

In [63]:
!wget https://huggingface.co/datasets/BeIR/trec-covid/resolve/main/corpus.jsonl.gz

--2023-04-10 20:26:35--  https://huggingface.co/datasets/BeIR/trec-covid/resolve/main/corpus.jsonl.gz
Resolving huggingface.co (huggingface.co)... 108.158.122.51, 108.158.122.64, 108.158.122.28, ...
Connecting to huggingface.co (huggingface.co)|108.158.122.51|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/a8/10/a810e88b0e7b233be82b89c1fa6ec2d75efc6d55784c2ada9dcac8434a634f3a/e9e97686e3138eaff989f67c04cd32e8f8f4c0d4857187e3f180275b23e24e85?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27corpus.jsonl.gz%3B+filename%3D%22corpus.jsonl.gz%22%3B&response-content-type=application%2Fgzip&Expires=1681428195&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9jZG4tbGZzLmh1Z2dpbmdmYWNlLmNvL3JlcG9zL2E4LzEwL2E4MTBlODhiMGU3YjIzM2JlODJiODljMWZhNmVjMmQ3NWVmYzZkNTU3ODRjMmFkYTlkY2FjODQzNGE2MzRmM2EvZTllOTc2ODZlMzEzOGVhZmY5ODlmNjdjMDRjZDMyZThmOGY0YzBkNDg1NzE4N2UzZjE4MDI3NWIyM2UyNGU4NT9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0

In [None]:
if not os.path.exists(f"{DIRETORIO_TRABALHO}/corpus.jsonl.gz"):
    !wget https://huggingface.co/datasets/BeIR/trec-covid/resolve/main/corpus.jsonl.gz
    !mv corpus.jsonl.gz {DIRETORIO_TRABALHO}

In [64]:
import gzip

In [69]:
import json

In [71]:
# Nome do arquivo compactado
arquivo_gz =  f'{DIRETORIO_TRABALHO}/corpus.jsonl.gz'

In [67]:
# Descompacte o arquivo para a memória
with gzip.open(arquivo_gz, 'rt') as f:
    # Leia o conteúdo do arquivo descompactado
    corpus_original = [json.loads(line) for line in f]

In [68]:
# Exiba os dados carregados
print(f"{type(queries)} len(queries): {len(queries)}")

<class 'list'> len(queries): 50
