# Overview

- RAG (Retrieval Augmented Generation) with [Llama-3](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
- Data: [100 LLM Papers to explore](https://www.kaggle.com/datasets/ruchi798/100-llm-papers-to-explore)
- Works in a single P100 or in dual-T4

# Imports

In [1]:
! nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-2e3899ba-3ea1-3211-0854-e26a32e746c7)
GPU 1: Tesla T4 (UUID: GPU-ee37cd52-0891-ed9b-2c84-49f2ca94cb21)


In [2]:
%%time

from IPython.display import clear_output

! pip install -qq -U langchain
! pip install -qq -U tiktoken
! pip install -qq -U pypdf
! pip install -qq -U faiss-gpu

! pip install sentence_transformers==2.2.2
! pip install -qq -U InstructorEmbedding

! pip install -qq -U transformers 
! pip install -qq -U accelerate
! pip install -qq -U bitsandbytes

clear_output()

CPU times: user 1.88 s, sys: 462 ms, total: 2.34 s
Wall time: 2min 31s


In [3]:
%%time

import warnings
warnings.filterwarnings("ignore")

import os
import glob
import gc

### format output
import textwrap
import time

import langchain

### loaders
from langchain.document_loaders import PyPDFLoader, DirectoryLoader

### splits
from langchain.text_splitter import RecursiveCharacterTextSplitter

### prompts
from langchain import PromptTemplate

### vector stores
from langchain_community.vectorstores import FAISS

### models
from langchain.llms import HuggingFacePipeline
from langchain.embeddings import HuggingFaceInstructEmbeddings

### retrievers
from langchain.chains import RetrievalQA

import torch

import transformers
from transformers import (
    AutoTokenizer, AutoModelForCausalLM,
    BitsAndBytesConfig,
    pipeline
)

clear_output()

CPU times: user 11 s, sys: 1.77 s, total: 12.7 s
Wall time: 22.3 s


In [4]:
print('langchain:', langchain.__version__)
print('torch:', torch.__version__)
print('transformers:', transformers.__version__)

langchain: 0.1.16
torch: 2.1.2
transformers: 4.40.1


In [5]:
len(glob.glob('/kaggle/input/100-llm-papers-to-explore/*'))

100

# Keys

- You need to accept the terms and request access to Llama-3 weights in the [model page in Hugging Face](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
- Then, you can use your Hugging Face key to access the weights
- Add yout HF key to your Kaggle Notebooks Secrets (Add-ons --> Secrets)

In [6]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
os.environ["HF_TOKEN"] = user_secrets.get_secret("huggingfacehub_api_token")

# CFG

- CFG class enables easy and organized experimentation

In [7]:
class CFG:
    DEBUG = False
    
    # LLM
    model_name = 'meta-llama/Meta-Llama-3-8B-Instruct'
    temperature = 0.4
    top_p = 0.90
    repetition_penalty = 1.15
    max_len = 8192
    max_new_tokens = 512

    # splitting
    split_chunk_size = 800
    split_overlap = 400
    
    # embeddings
    embeddings_model_repo = 'BAAI/bge-base-en-v1.5'

    # similar passages
    k = 6
    
    # paths
    PDFs_path = '/kaggle/input/100-llm-papers-to-explore/'
    Embeddings_path =  '/kaggle/input/faiss-ml-papers-st'
    Output_folder = './ml-papers-vectordb'

# Load data

In [8]:
%%time

loader = DirectoryLoader(
    CFG.PDFs_path,
    glob = "./*3215v3.pdf" if CFG.DEBUG else "./*.pdf",
    loader_cls = PyPDFLoader,
    show_progress = True,
    use_multithreading = True
)

documents = loader.load()

100%|██████████| 100/100 [06:02<00:00,  3.63s/it]

CPU times: user 6min 2s, sys: 7.32 s, total: 6min 9s
Wall time: 6min 2s





In [9]:
print(f'We have {len(documents)} pages in total')

We have 2871 pages in total


# Splitter

In [10]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = CFG.split_chunk_size,
    chunk_overlap = CFG.split_overlap
)

texts = text_splitter.split_documents(documents)

print(f'We have created {len(texts)} chunks from {len(documents)} pages')

We have created 23268 chunks from 2871 pages


# Create Embeddings

In [11]:
%%time

### we create the embeddings if they do not already exist in the input folder
if not os.path.exists(CFG.Embeddings_path + '/index.faiss'):
    
    print('Creating embeddings...\n\n')

    ### download embeddings model
    embeddings = HuggingFaceInstructEmbeddings(
        model_name = CFG.embeddings_model_repo,
        model_kwargs = {"device": "cuda"}
    )

    ### create embeddings and DB
    vectordb = FAISS.from_documents(
        documents = texts, 
        embedding = embeddings
    )

    ### persist vector database
    vectordb.save_local(f"{CFG.Output_folder}/faiss_index_ml_papers") # save in output folder
#     vectordb.save_local(f"{CFG.Embeddings_path}/faiss_index_ml_papers") # save in input folder

clear_output()

CPU times: user 9min 23s, sys: 3.68 s, total: 9min 27s
Wall time: 9min 15s


# Load Embeddings

- You can just load the embeddings if you have already created and saved them previously

In [12]:
%%time

### download embeddings model
embeddings = HuggingFaceInstructEmbeddings(
    model_name = CFG.embeddings_model_repo,
    model_kwargs = {"device": "cuda"}
)

### load vector DB embeddings
vectordb = FAISS.load_local(
#     CFG.Embeddings_path, # from input folder
    CFG.Output_folder + '/faiss_index_ml_papers', # from output folder
    embeddings,
    allow_dangerous_deserialization = True,
)

clear_output()

CPU times: user 404 ms, sys: 313 ms, total: 717 ms
Wall time: 519 ms


In [13]:
%%time

### test if vector DB was loaded correctly
vectordb.similarity_search('scaling laws')

CPU times: user 158 ms, sys: 8 ms, total: 166 ms
Wall time: 206 ms


[Document(page_content='itory , 2020. doi: 10.48550/arXiv.2010.14701. URL\nhttps://arxiv.org/abs/2010.14701v2 . Ver-\nsion 2.\nHernandez, D., Kaplan, J., Henighan, T., and McCandlish,\nS. Scaling laws for transfer. Computing Research Repos-\nitory , 2021. doi: 10.48550/arXiv.2102.01293. URL\nhttps://arxiv.org/abs/2102.01293v1 . Ver-\nsion 1.\nHernandez, D., Brown, T., Conerly, T., DasSarma, N.,\nDrain, D., El-Showk, S., Elhage, N., Hatfield-Dodds,\nZ., Henighan, T., Hume, T., Johnston, S., Mann, B., Olah,\nC., Olsson, C., Amodei, D., Joseph, N., Kaplan, J., and\nMcCandlish, S. Scaling laws and interpretability of learn-\ning from repeated data. Computing Research Reposi-\ntory, 05 2022. doi: 10.48550/arXiv.2205.10487. URL\nhttps://arxiv.org/abs/2205.10487v1 . Ver-\nsion 1.', metadata={'source': '/kaggle/input/100-llm-papers-to-explore/2304.01373.pdf', 'page': 11}),
 Document(page_content='Contents\n1 Introduction 3\n2 Scaling law experiments 7\n2.1 Scaling laws . . . . . . . . . . . . 

# Model

In [14]:
def build_model(model_repo = CFG.model_name):

    print('\nDownloading model: ', model_repo, '\n\n')

    ### tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_repo)

    ### quantization
    bnb_config = BitsAndBytesConfig(
        load_in_4bit = True,
        bnb_4bit_quant_type = "nf4",
        bnb_4bit_compute_dtype = torch.float16,
        bnb_4bit_use_double_quant = True,
    )        

    ### model
    model = AutoModelForCausalLM.from_pretrained(
        model_repo,
        quantization_config = bnb_config,
        device_map = 'auto',
        low_cpu_mem_usage = True,
    )

    return tokenizer, model

In [15]:
%%time

tokenizer, model = build_model(model_repo = CFG.model_name)

clear_output()

CPU times: user 39.9 s, sys: 38.8 s, total: 1min 18s
Wall time: 2min 36s


In [16]:
gc.collect()

51

In [17]:
model.eval()

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): Ll

In [18]:
### check how Accelerate split the model across the available devices (GPUs)
model.hf_device_map

{'model.embed_tokens': 0,
 'model.layers.0': 0,
 'model.layers.1': 0,
 'model.layers.2': 0,
 'model.layers.3': 0,
 'model.layers.4': 0,
 'model.layers.5': 0,
 'model.layers.6': 0,
 'model.layers.7': 0,
 'model.layers.8': 1,
 'model.layers.9': 1,
 'model.layers.10': 1,
 'model.layers.11': 1,
 'model.layers.12': 1,
 'model.layers.13': 1,
 'model.layers.14': 1,
 'model.layers.15': 1,
 'model.layers.16': 1,
 'model.layers.17': 1,
 'model.layers.18': 1,
 'model.layers.19': 1,
 'model.layers.20': 1,
 'model.layers.21': 1,
 'model.layers.22': 1,
 'model.layers.23': 1,
 'model.layers.24': 1,
 'model.layers.25': 1,
 'model.layers.26': 1,
 'model.layers.27': 1,
 'model.layers.28': 1,
 'model.layers.29': 1,
 'model.layers.30': 1,
 'model.layers.31': 1,
 'model.norm': 1,
 'lm_head': 1}

# Pipeline

In [19]:
terminators = [
    tokenizer.eos_token_id, # 128001
    tokenizer.convert_tokens_to_ids("<|eot_id|>") # 128009
]


### hugging face pipeline
pipe = pipeline(
    task = "text-generation",
    
    model = model,
    
    tokenizer = tokenizer,
#     pad_token_id = tokenizer.eos_token_id,
    eos_token_id = terminators,
    
    do_sample = True,
#     max_length = CFG.max_len,
    max_new_tokens = CFG.max_new_tokens,
    
    
    temperature = CFG.temperature,
    top_p = CFG.top_p,
    repetition_penalty = CFG.repetition_penalty,
)

### langchain pipeline
llm = HuggingFacePipeline(pipeline = pipe)

# Prompt Template

[Prompt format for Llama-3 chat](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/)

```
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_message }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```

In [20]:
prompt_template = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant who receives excerpted parts of a context along with a question.

Use only the following parts of the context to answer the question at the end.

<|eot_id|><|start_header_id|>user<|end_header_id|>

Context: {context}

Answer the following question:

{question}

<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""


PROMPT = PromptTemplate(
    template = prompt_template, 
    input_variables = ["context", "question"]
)

# Retrieval chain

- Retrieval chain using previously created vector database

In [21]:
retriever = vectordb.as_retriever(
    search_type = "similarity",
    search_kwargs = {"k": CFG.k}
)

qa_chain = RetrievalQA.from_chain_type(
    llm = llm,
    chain_type = "stuff", # map_reduce, map_rerank, stuff, refine
    retriever = retriever, 
    chain_type_kwargs = {"prompt": PROMPT},
    return_source_documents = True,
    verbose = False
)

# Post-process outputs

In [22]:
def wrap_text_preserve_newlines(text, width=700):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text


def process_llm_response(llm_response):
    
    sources_used = ' \n'.join(
        [
            source.metadata['source'].split('/')[-1][:-4]
            + ' - page: '
            + str(source.metadata['page'])
            for source in llm_response['source_documents']
        ]
    )
    
    ans = wrap_text_preserve_newlines(llm_response['result'])
    
    ans = ans + '\n\nSources: \n' + sources_used
    
    ### return only the text after the pattern
    pattern = "<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
    index = ans.find(pattern)
    if index != -1:
        ans = ans[index + len(pattern):]    
    
    return ans

def llm_ans(query):
    start = time.time()
    
    llm_response = qa_chain.invoke(query)
    ans = process_llm_response(llm_response)
    
    end = time.time()

    time_elapsed = int(round(end - start, 0))
    time_elapsed_str = f'\n\nTime elapsed: {time_elapsed} s'
    return ans + time_elapsed_str

# Ask Questions

In [23]:
query = "Tell me about the scaling laws for language models"
result = llm_ans(query)
clear_output()
print(result)



According to the provided context, there have been studies on understanding the scaling behavior of language models and their transfer properties. Specifically, researchers have identified unified scaling laws for various types of language models, such as those mentioned in references [22], [24], [25], and [27]. These scaling laws aim to describe how the performance of language models changes as they grow in size. Some notable examples include "Unified scaling laws for routed language models" by Hernandez et al. (2022) and "Scaling laws for autoregressive generative modeling" by Henighan et al. (2020). Additionally, other papers have explored specific aspects of scaling laws, such as the
relationship between model size and performance, or the impact of scaling on different tasks and applications.

Sources: 
2203.15556 - page: 2 
2203.15556 - page: 16 
2303.18223 - page: 87 
2101.03961 - page: 36 
2203.15556 - page: 17 
2005.14165 - page: 40

Time elapsed: 16 s


In [24]:
query = "what is grouped query attention and its advantages?"
result = llm_ans(query)
clear_output()
print(result)



According to the provided context, Grouped Query Attention (GQA) is mentioned as a method that makes a trade-off between multi-query attention and multi-head attention. It assigns heads into different groups, and those heads that belong to the same group will share the same transformation matrices. The advantage of GQA is that it has been adopted and empirically tested in the recently released LLaMA 2 model.

Sources: 
2303.18223 - page: 24 
2204.02311 - page: 52 
2303.18223 - page: 81 
2211.05102 - page: 8 
1911.02150 - page: 6 
2307.03172 - page: 13

Time elapsed: 9 s


In [25]:
query = "Explain Rotary Position Embeddings to me as if I'm 10 years old"
result = llm_ans(query)
clear_output()
print(result)



So, you know how sometimes we need to remember things like what's happening before or after something else? Like, "I had breakfast before I went to school"? Well, when computers try to understand sentences or stories, they also need to keep track of what comes before and after certain words. That's where "Rotary Position Embeddings" come in!

Imagine you're playing a game where you spin around in a circle. You can move your arms and legs around, and everything looks different because you've moved. But, if someone asked you about what happened yesterday, you could still tell them because you remembered what you did. It's kind of like that! When we talk about Rotary Position Embeddings, we're talking about making sure the computer remembers what came before and after each word, so it can make sense of the whole sentence.

The way it works is like taking all the words and twisting them around in a special way. This makes it easier for the computer to figure out what's going on. It's lik

In [26]:
query = "can you give me some tips on how to properly prune neural networks?"
result = llm_ans(query)
clear_output()
print(result)



Based on the provided text, here's what can be inferred about proper pruning of neural networks according to the authors' method:

* Parameters of each individual layer should be pruned independently based on second-order derivatives of a layer-wise error function with respect to the corresponding parameters.
* The final prediction performance drop after pruning is bounded by a linear combination of the reconstructed errors caused at each layer. This means that by controlling layer-wise errors properly, only a light retraining process may be necessary to resume the original prediction performance.
* To prune effectively, it seems important to consider the reconstructed errors for each layer and ensure that they do not exceed certain thresholds, thereby maintaining the overall performance of the network.

These points suggest that the key to proper pruning lies in carefully considering the impact of pruning on each layer individually and ensuring that the resulting errors remain withi

In [27]:
query = "What is Sequence to Sequence Learning in Neural Networks?"
result = llm_ans(query)
clear_output()
print(result)



Based on the provided context, Sequence to Sequence Learning in Neural Networks refers to the ability of neural networks to map sequences to sequences, where the network learns to transform one sequence into another.

Sources: 
1803.02155 - page: 0 
1409.3215v3 - page: 1 
1409.3215v3 - page: 0 
radford2018improving - page: 10 
1706.03762 - page: 1 
1901.02860 - page: 8

Time elapsed: 5 s


In [28]:
query = "what is the difference between DistilBERT and BERT?"
result = llm_ans(query)
clear_output()
print(result)



According to the provided text, DistilBERT has 40% fewer parameters than BERT and is 60% faster than BERT.

Sources: 
2211.09718 - page: 11 
1910.01108 - page: 3 
1910.01108 - page: 0 
1910.01108 - page: 2 
2211.09718 - page: 9 
1910.01108 - page: 3

Time elapsed: 4 s


# pt-br

- Testing on another language (brazilian portuguese)
- It was hard to find a generic prompt that makes the model answer the user question in the same language it was asked, detecting it on the fly
- So I found that we can help the model do that by using a prompt that is in the same language we want it to respond. It is less flexible, but usually works well

In [29]:
prompt_template = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Você é um assistente útil que recebe partes extraídas de um contexto junto com uma pergunta.

Use apenas as seguintes partes do contexto para responder à pergunta no final.

<|eot_id|><|start_header_id|>user<|end_header_id|>

Context: {context}

Responda a pergunta abaixo em português brasileiro.

{question}

<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""

PROMPT = PromptTemplate(
    template = prompt_template, 
    input_variables = ["context", "question"]
)

retriever = vectordb.as_retriever(
    search_type = "similarity",
    search_kwargs = {"k": CFG.k}
)

qa_chain = RetrievalQA.from_chain_type(
    llm = llm,
    chain_type = "stuff", # map_reduce, map_rerank, stuff, refine
    retriever = retriever, 
    chain_type_kwargs = {"prompt": PROMPT},
    return_source_documents = True,
    verbose = False
)

In [30]:
query = "Conte-me sobre as leis de escala para modelos de linguagem"
result = llm_ans(query)
clear_output()
print(result)



Aqui está a resposta:

As leis de escala para modelos de linguagem (LLMs) se referem às técnicas e métodos utilizados para desenvolver e utilizar esses modelos. No entanto, não há uma lei específica chamada "leis de escala" aplicável exclusivamente aos LLMs.

No entanto, o artigo discute a escalabilidade dos modelos de linguagem, mencionando trabalhos como "Scaling Instruction-Finetuned Language Models" [1] e "TyDiQA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages" [2], que abordam questões relacionadas ao tamanho e complexidade dos modelos de linguagem.

Além disso, o artigo também destaca a importância da adaptação e capacitação dos modelos de linguagem para diferentes tarefas e domínios, bem como a necessidade de avaliação cuidadosa desses modelos para garantir sua eficácia e precisão.

Portanto, embora não haja uma lei específica chamada "leis de escala", os autores do artigo enfatizam a importância da escalabilidade e adaptabilidade dos

In [31]:
query = "o que é 'grouped query attention' e quais são as suas vantagens?"
result = llm_ans(query)
clear_output()
print(result)



No artigo "Rethinking Attention with Performers" (2020), os autores Choromanski et al. discutem sobre o conceito de "Grouped Query Attention". Em resumo, a Grouped Query Attention é uma abordagem de atenção que agrupa queries relacionadas em grupos antes de calcular a atenção. Isso permite capturar relações entre queries mais efetivamente e melhorar o desempenho nos problemas de Question Answering.

As principais vantagens da Grouped Query Attention incluem:

* Melhor performance em problemas de Question Answering, especialmente quando há múltiplas queries relacionadas;
* Capacidade de capturar relações complexas entre queries;
* Redução do overhead computacional em comparação com outras abordagens de atenção;

Essa técnica tem sido utilizada em várias aplicações, incluindo o processamento de linguagem natural e o aprendizado automático.

Sources: 
2204.02311 - page: 52 
2307.03172 - page: 2 
2303.18223 - page: 59 
2304.01373 - page: 13 
1409.3215v3 - page: 6 
2305.03047 - page: 19



In [32]:
query = "Me explique Rotary Position Embeddings como se eu tivesse 10 anos"
result = llm_ans(query)
clear_output()
print(result)



Entendi!

O Rotary Position Embedding (RoPE) é uma maneira de incluir informações sobre a posição dos tokens (palavras ou símbolos) dentro de uma sequência de texto na linguagem natural. Isso ajuda o modelo a entender melhor onde cada palavra está relacionada às outras e como elas estão conectadas.

Imagine você está lendo uma história e quer saber qual personagem está falando agora. O RoPE ajuda a capturar essa informação de posição, mostrando quais palavras estão próximas ou distantes entre si.

Para fazer isso, os desenvolvedores criaram uma matriz chamada "sinusoid encoding" (ou encoding sinusoide) que transforma as posições absolutas em relativas. Isso significa que, ao invés de usar números fixos para representar a posição de cada palavra, o RoPE usa valores que mudam gradualmente ao longo da sequência.

Além disso, o RoPE também permite que os modelos aprendam a considerar a importância das palavras em diferentes posições. Por exemplo, uma palavra pode ser mais importante se e

In [33]:
query = "você pode me dar algumas dicas sobre como podar redes neurais adequadamente??"
result = llm_ans(query)
clear_output()
print(result)



Entendi!

Dessa forma, posso fornecer algumas dicas sobre como podar redes neurais adequadas:

* **Regularização**: A regularização é um método importante para evitar overfitting em redes neurais. Isso pode ser feito ajustando os parâmetros da rede, como a taxa de aprendizado ou a penalidade por complexidade.
* **Dropout**: O dropout é outro método eficaz para reduzir a complexidade das redes neurales. Ele envolve desativar aleatoriamente certos neurons durante treinamento, o que ajuda a prevenir overfitting.
* **Early stopping**: O early stopping consiste em interromper o treinamento da rede quando ela atinge um ponto de equilibrio entre precisão e generalização. Isso evita que a rede se torne muito especializada nos dados de treino.
* **L1 e L2 regularization**: Essas são técnicas de regularização que adicionam uma penalidade à função de custo da rede neural para evitar pesos grandes e complexos.
* **Batch normalization**: A normalização por batch ajuda a estabilizar o treinamento 

In [34]:
query = "o que é Sequence to Sequence Learning em redes neurais?"
result = llm_ans(query)
clear_output()
print(result)



A sequência para sequência é um tipo de aprendizado de sequências em redes neurais que envolve mapear sequências de entrada para sequências de saída. Isso pode ser aplicado em tarefas como tradução de texto, geração de texto e outros problemas de processamento de linguagem natural.

Sources: 
1409.3215v3 - page: 0 
1909.08053 - page: 9 
2005.14165 - page: 1 
10000000_662098952474184_2584067087619170692_n - page: 1 
2307.09288 - page: 1 
2203.16634 - page: 7

Time elapsed: 8 s


In [35]:
query = "qual é a diferença entre DistilBERT e BERT?"
result = llm_ans(query)
clear_output()
print(result)



Infelizmente, não há informações sobre a diferença entre DistilBERT e BERT na seção fornecida. No entanto, posso fornecer alguma informação geral sobre essas tecnologias:

* BERT (Bidirectional Encoder Representations from Transformers) é um modelo de linguagem pre-treinado por meio da técnica de treinamento auto-supervisionada, que utiliza uma rede neural baseada em transformers para aprender representações de texto.
* DistillBERT é uma versão mais leve e eficiente do modelo BERT, projetado para ser mais fácil de implementar em dispositivos móveis e computadores pessoais. Ele foi treinado usando uma técnica chamada "distillation", que envolve transferir conhecimento do modelo original BERT para um modelo menor e mais rápido.

Se você deseja saber mais sobre a diferença entre os dois modelos, sugiero pesquisar outros recursos online ou consultar estudos científicos publicados sobre o assunto.

Sources: 
2005.14165 - page: 1 
2211.09110 - page: 30 
palm2techreport - page: 82 
2211.091

# Conclusion

- Simple RAG pipeline using new Llama-3 model
- **<font color='orange'>Leave your upvote if you liked this content! 🔼</font>**