<a href="https://colab.research.google.com/github/jyotirmaya/Domain-Agnostic-Sentence-Specificity-Prediction/blob/master/DataPhoenix_Simple_LangChain_RAG_Pipeline_with_Llama_3_and_Arctic_Embeddings_Notebook_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG or Fine-Tuning: When and Why?

Today we'll explore when and why we would want to use fine-tuning, vs. RAG, and why the answer is often to use both at the same time!

For our fine-tuning use-case, we'll leveraging [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) on a summarization task in a particular style for a particular domain!

For our RAG use-case, we'll be looking at doing retrieval across a recent complaint generated by Elon Musk.



# Fine-tuning Example

We'll start, as we always do, with some dependencies!

In [None]:
!pip install -qU transformers peft trl accelerate bitsandbytes datasets

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/199.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m194.6/199.1 kB[0m [31m6.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.2/245.2 kB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m37.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m50.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.0/102.0 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━

Next, let's set-up some data!

## Data

We'll be using the Legal Summarization [dataset](https://github.com/lauramanor/legal_summarization) from the paper [Plain English Summarization of Contracts](https://www.aclweb.org/anthology/W19-2201) today.

This dataset contains pairs in the following format:

- Original Text: A blob of legal text, think ToS
- Reference Summary: A short natural language summary of the legal text

We'll start by cloning the repository containing our data.

In [None]:
!git clone https://github.com/lauramanor/legal_summarization.git

Cloning into 'legal_summarization'...
remote: Enumerating objects: 31, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 31 (delta 2), reused 0 (delta 0), pack-reused 25[K
Receiving objects: 100% (31/31), 136.60 KiB | 19.51 MiB/s, done.
Resolving deltas: 100% (10/10), done.


Let's convert this into an expected format - in this case a list of `json` objects.

In [None]:
import json

jsonl_array = []

with open('legal_summarization/tldrlegal_v1.json') as f:
  data = json.load(f)
  for key, value in data.items():
    jsonl_array.append(value)

Now we can convert that into the desired format for our fine-tuning - a Hugging Face `Dataset`!

In [None]:
from datasets import Dataset, load_dataset

legal_dataset = Dataset.from_list(jsonl_array)

Let's see how many items we're working with in our dataset.

> NOTE: Keep in mind that this is a relatively simple example meant to showcase fine-tuning - in practice, we'd want to use somewhere in the neighbourhood of ~500-50,000 examples.

In [None]:
legal_dataset

Dataset({
    features: ['doc', 'id', 'original_text', 'reference_summary', 'title', 'uid'],
    num_rows: 85
})

Let's look at an example of our original text and summary!

In [None]:
print(f"Original Text: {legal_dataset[0]['original_text']}\n\nSummary: {legal_dataset[0]['reference_summary']}")

Original Text: welcome to the pokémon go video game services which are accessible via the niantic inc niantic mobile device application the app. to make these pokémon go terms of service the terms easier to read our video game services the app and our websites located at http pokemongo nianticlabs com and http www pokemongolive com the site are collectively called the services. please read carefully these terms our trainer guidelines and our privacy policy because they govern your use of our services.

Summary: hi.


Now, we mentioned earlier we were going to fine-tune meta-llama/Meta-Llama-3-8B-Instruct, which is important for our next step: Creating the instruction format.

Let's take a look at the instructions (so meta) to generate an instruction prompt from the [repository](https://github.com/meta-llama/llama3?tab=readme-ov-file#instruction-tuned-models)


> The fine-tuned models were trained for dialogue applications. To get the expected features and performance for them, a specific formatting defined in [`ChatFormat`](https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py#L202) needs to be followed: The prompt begins with a `<|begin_of_text|>` special token, after which one or more messages follow. Each message starts with the `<|start_header_id|>` tag, the role `system`, `user` or `assistant`, and the `<|end_header_id|>` tag. After a double newline "\n\n" the contents of the message follow. The end of each message is marked by the `<|eot_id|>` token.

Let's look at an example of how we might format our instruction - and then reproduce that in code.

```python
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Please convert the following legal content into a human-readable summary<|eot_id|><|start_header_id|>user<|end_header_id|>

[LEGAL_DOC]
welcome to the pokémon go video game services which are accessible via the niantic inc niantic mobile device application the app. to make these pokémon go terms of service the terms easier to read our video game services the app and our websites located at http pokemongo nianticlabs com and http www pokemongolive com the site are collectively called the services. please read carefully these terms our trainer guidelines and our privacy policy because they govern your use of our services.
[END_LEGAL_DOC]<|eot_id|><|start_header_id|>assistant<|end_header_id|>

hi.<|eot_id|>
```

> NOTE: We're adding our own special tokens here in `[LEGAL_DOC]` and `[END_LEGAL_DOC]` to encourage the model to better understand our context, but these are not special tokens that are already understood by the model

In [None]:
INSTRUCTION_PROMPT_TEMPLATE = """\
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Please convert the following legal content into a human-readable summary<|eot_id|><|start_header_id|>user<|end_header_id|>

[LEGAL_DOC]{LEGAL_TEXT}[END_LEGAL_DOC]<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

RESPONSE_TEMPLATE = """\
{NATURAL_LANGUAGE_SUMMARY}<|eot_id|><|end_of_text|>"""

Now we can create a helper function that will convert our dataset row into the above prompt!

In [None]:
def create_instruction(sample, return_response=True):
  prompt = INSTRUCTION_PROMPT_TEMPLATE.format(LEGAL_TEXT=sample["original_text"])

  if return_response:
    prompt += RESPONSE_TEMPLATE.format(NATURAL_LANGUAGE_SUMMARY=sample["reference_summary"])

  return prompt

Let's try it out!

In [None]:
create_instruction(legal_dataset[0])

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nPlease convert the following legal content into a human-readable summary<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n[LEGAL_DOC]welcome to the pokémon go video game services which are accessible via the niantic inc niantic mobile device application the app. to make these pokémon go terms of service the terms easier to read our video game services the app and our websites located at http pokemongo nianticlabs com and http www pokemongolive com the site are collectively called the services. please read carefully these terms our trainer guidelines and our privacy policy because they govern your use of our services.[END_LEGAL_DOC]<|eot_id|><|start_header_id|>assistant<|end_header_id|>hi.<|eot_id|><|end_of_text|>'

We'll partition our dataset so we can test some of the outputs after we've completed our training.

In [None]:
prepared_legal_dataset = legal_dataset.train_test_split(test_size=0.1)

In [None]:
prepared_legal_dataset

DatasetDict({
    train: Dataset({
        features: ['doc', 'id', 'original_text', 'reference_summary', 'title', 'uid'],
        num_rows: 76
    })
    test: Dataset({
        features: ['doc', 'id', 'original_text', 'reference_summary', 'title', 'uid'],
        num_rows: 9
    })
})

## Loading Our Model

Now we can move onto loading our model!

We're going to be dependent on two major technologies to allow us to train our model with <=16GB GPU RAM.

1. Quantization
2. LoRA

> NOTE: We've done some events on [LoRA](https://www.youtube.com/watch?v=kV8yXIUC5_4&list=PLrSHiQgy4VjGMzyXsSlvN-TjPaqFFsAGP&index=4) and [QLoRA](https://www.youtube.com/watch?v=XOb-djcw6hs&list=PLrSHiQgy4VjGMzyXsSlvN-TjPaqFFsAGP&index=5) for deeper dives into those respective technologies

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16,
    quantization_config=quantization_config,
)

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/136 [00:00<?, ?B/s]

We'll load our tokenizer and do a few pre-processing steps to prepare it for training.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Now we can set-up our LoRA configuration file - which will let the TRL library know how to create our LoRA adapters!

In [None]:
from peft import LoraConfig

peft_config = LoraConfig(
    r=32,
    lora_alpha=64,
    bias="none",
    target_modules="all-linear",
    task_type="CAUSAL_LM",
)

### Fine-tuning!

Now onto the star of today's show: Fine-tuning!

We're going to use the `SFTTrainer` or "Supervised Fine-tuning Trainer" from the [TRL](https://github.com/huggingface/trl/tree/main) library today.

In essence, this is a trainer that will handle most of the fine-tuning process for us - including but not limited to:

- Dataset packing
- LoRA initialization
- Tokenizing

Let's set up some training hyper-parameters through transformers `TrainingArguments` class to get started. Here's a breakdown of which parameters are doing what:

- `outpur_dir` - indicates where we store the results of training locally
- `num_train_epochs` - how many epochs we will train for (somewhere ~3-4 is a good place to start)
- `per_device_train_batch_size` - how many batches we want per device. In this case, we only have one device - but we set this to a low value to keep memory consumption below 16GB GPU RAM
- `gradient_accumulation_steps` - this hyper-parameter lets us indicate how many steps we wish to accumulate our gradients for, this is a way to "cheat out" a larger batch size without extra memory consumption
- `gradient_checkpointing` - this lets us [trade off memory consumption for increased training time](https://huggingface.co/docs/transformers/v4.18.0/en/performance#gradient-checkpointing)
- `optim` - our optimizer! In this case, we're using  a fused ADAMW optimiser. The fused method is a faster version of the ADAMW optimizer but is reliant on CUDA (GPU). More information can be read [here](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html)

The rest of the hyper-parameters are taken directly from the QLoRA [paper](https://arxiv.org/abs/2305.14314) and are discussed in more detail there!

In [None]:
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="leagaleasy-llama-3-instruct-v1",
    num_train_epochs=4,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    optim="adamw_torch_fused",
    save_strategy="epoch",
    learning_rate=2e-4,
    fp16=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="constant",
    push_to_hub=True,
)

Because we're going to automatically push our model to the hub, thanks to `push_to_hub=True`, we'll want to provide a Hugging Face Write token.

> NOTE: You can skip this step by commenting out `push_to_hub=True`

Now, finally, we can set-up our `SFTTrainer` which is going to help us fine-tune this model on our dataset we create at the beginning of the notebook!

We'll discuss a few parameters to clarify what they're doing:

- `formatting_func` - since we created a helper function to convert our dataset row into a Mistral-style Instruction prompt, we need to let TRL know to use it to create our prompts!
- `peft_config` - TRL will automatically load our model in LoRA format with this config.
- `packing` - this will let our model "pack" the context window to ensure we're not wasting precious memory on padding tokens where posssible
- `dataset_kwargs` - because we already added the special tokens to our prompts, we want to ensure we don't "re-add" them!

With those parameters set - we're clear for training!

In [None]:
from trl import SFTTrainer

max_seq_length=2048

trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=prepared_legal_dataset["train"],
    formatting_func=create_instruction,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    dataset_kwargs={
        "add_special_tokens" : False,
        "append_concat_token" : False,
    }
)



Generating train split: 0 examples [00:00, ? examples/s]

All that's left to do is fine-tune our model!

In [None]:
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss




TrainOutput(global_step=8, training_loss=1.4547849893569946, metrics={'train_runtime': 44.2195, 'train_samples_per_second': 0.905, 'train_steps_per_second': 0.181, 'total_flos': 2984041808658432.0, 'train_loss': 1.4547849893569946, 'epoch': 3.2})

Now we can save it.

In [None]:
trainer.save_model()

adapter_model.safetensors:   0%|          | 0.00/336M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

events.out.tfevents.1713884524.f73df055b54c.1396.0:   0%|          | 0.00/5.61k [00:00<?, ?B/s]

Let's clear up memory so we can do inference while staying under our memory budget.

In [None]:
del model
del trainer
torch.cuda.empty_cache()

We'll need to load our mode back as a PEFT model, due to the use of LoRA, and then merge the LoRA layers back into the original model for use in Hugging Face pipelines.

In [None]:
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    args.output_dir,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    device_map="auto"
)

merged_model = model.merge_and_unload()

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Now we can load our pipeline for `text-generation`.

In [None]:
from transformers import pipeline

summary_pipe = pipeline("text-generation", merged_model, tokenizer=tokenizer, max_new_tokens=256, return_full_text=False)

## Testing Fine-tuned Model

Now that we've fine-tuned, lets see how we did!

In [None]:
prepared_legal_dataset["test"][0]["original_text"]

'to the extent permitted by applicable law neither niantic nor tpc or tpci or any other party involved in creating producing or delivering the services or content will be liable to you for any indirect incidental special punitive exemplary or consequential damages including lost profits loss of data or goodwill service interruption computer damage or system failure or the cost of substitute services arising out of or in connection with these terms or from the use of or inability to use the services or content or from any communications interactions or meetings with other users of the services or persons with whom you communicate or interact as a result of your use of the services whether based on warranty contract tort including negligence product liability or any other legal theory and whether or not niantic tpc or tpci have been advised of the possibility of such damages even if a limited remedy set forth herein is found to have failed of its essential purpose. some jurisdictions do 

In [None]:
outputs = summary_pipe(create_instruction(prepared_legal_dataset["test"][0], return_response=False), do_sample=True, max_new_tokens=256, temperature=0.1, top_k=50)

In [None]:
outputs[0]["generated_text"].split("[LEGAL_DOC]")[-1]

'niantic tpc and tpci are not liable for indirect incidental punitive or consequential damages including lost profits or data or goodwill. they are only liable for direct damages up to 1 000.'

In [None]:
prepared_legal_dataset["test"][0]["reference_summary"]

'as much as the law permits it s not our fault if people lose money or data. if it is our fault we can t owe you more than 1000.'

Another example!

In [None]:
prepared_legal_dataset["test"][5]["original_text"]

'you may not use user data from our apis for advertising purposes unless i you are explicitly authorized by google or ii you are using an advertising solution that google provides for this purpose. you may not and may not permit any third party to sell or transmit any user data received from our apis including anonymized aggregate or derivative data to any third party ad network or service data broker or other advertising or marketing provider.'

In [None]:
outputs = summary_pipe(create_instruction(prepared_legal_dataset["test"][1], return_response=False),do_sample=True,  max_new_tokens=256, temperature=0.1, top_k=50)

In [None]:
outputs[0]["generated_text"].split("[LEGAL_DOC]")[-1]

'you agree that your use of the services is at your own risk. youtube disclaims all warranties express or implied and is not liable for any damages including personal injury or property damage of any kind.'

In [None]:
prepared_legal_dataset["test"][1]["reference_summary"]

'there are no warranties and we are not liable for anything bad that happens when using youtube.'

Overall, our fine-tuning did a great job and allowed our model to generate our desired output - all with <16GB GPU memory, and 4 epochs of fine-tuning!

# Retrieval Augmented Generation



## Installing Required Libraries

One of the [key features](https://blog.langchain.dev/langchain-v0-1-0/) of LangChain v0.1.0 is the compartmentalization of the various LangChain ecosystem packages.

Instead of one all encompassing Python package - LangChain has a `core` package and a number of additional supplementary packages.

We'll start by grabbing all of our LangChain related packages!

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
!pip install -qU langchain langchain-core langchain-community sentence_transformers

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/817.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m409.6/817.7 kB[0m [31m12.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m817.7/817.7 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/291.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m291.3/291.3 kB[0m [31m33.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m72.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.2/115.2 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[2K     [9

Let's finally get `tiktoken` and `pymupdf` so we can leverage them later on!

In [None]:
!pip install -qU tiktoken pymupdf bitsandbytes accelerate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m32.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m94.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.8/30.8 MB[0m [31m53.6 MB/s[0m eta [36m0:00:00[0m
[?25h

### Data Collection

We'll be leveraging the `PyMUPDFLoader` to load our PDF!

In [None]:
!git clone https://github.com/AI-Maker-Space/DataRepository.git

Cloning into 'DataRepository'...
remote: Enumerating objects: 62, done.[K
remote: Counting objects: 100% (54/54), done.[K
remote: Compressing objects: 100% (41/41), done.[K
remote: Total 62 (delta 16), reused 29 (delta 8), pack-reused 8[K
Receiving objects: 100% (62/62), 51.51 MiB | 13.28 MiB/s, done.
Resolving deltas: 100% (16/16), done.


In [None]:
from langchain_community.document_loaders import PyMuPDFLoader

docs = PyMuPDFLoader("DataRepository/MuskComplaint.pdf").load()

### Chunking Our Documents

We'll use the `RecursiveCharacterTextSplitter` to create our toy example.

It will split based on the following rules:

- Each chunk has a maximum size of 100 tokens
- It will try and split first on the `\n\n` character, then on the `\n`, then on the `<SPACE>` character, and finally it will split on individual tokens.

Let's implement it and see the results!

In [None]:
import tiktoken
from langchain.text_splitter import RecursiveCharacterTextSplitter

def tiktoken_len(text):
    tokens = tiktoken.encoding_for_model("gpt-3.5-turbo").encode(
        text,
    )
    return len(tokens)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 200,
    chunk_overlap = 0,
    length_function = tiktoken_len,
)

split_chunks = text_splitter.split_documents(docs)

In [None]:
len(split_chunks)

189

## Embeddings and Dense Vector Search

Now that we have our individual chunks, we need a system to correctly select the relevant pieces of information to answer our query.

This sounds like a perfect job for embeddings!

If you come from an NLP background, embeddings are something you might be intimately familiar with - otherwise, you might find the topic a bit...dense. (this attempt at a joke will make more sense later)

In all seriousness, embeddings are a powerful piece of the NLP puzzle, so let's dive in!

> NOTE: While this notebook language/NLP-centric, embeddings have uses beyond just text!

We're going to be using Snowflake's `snowflake-arctic-embed-m` today.

In order to choose our embeddings model, we'll refer to the MTEB leaberboard - which can be found [here](https://huggingface.co/spaces/mteb/leaderboard)!

The basic logic is: We sort by our desired task - in this case `Retrieval Average (15 Datasets)`, and we're going to pick a model that performs well on that task - to keep cost in mind, we'll go with the `snowflake-arctic-embed-m` over the `snowflake-arctic-embed-l` since there's only a separation of ~5 points between the two on this task - but the cost is a significant factor less for the `medium` version of the model.

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(
    model_name="ai-maker-space/snowflake-ft",
    model_kwargs={"device" : "cuda"}
)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/252 [00:00<?, ?B/s]






README.md:   0%|          | 0.00/2.53k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Now we can set-up our `VectorStore`! We'll be using Meta's FAISS to power our dense vector search today.

In [None]:
!pip install -qU faiss-cpu

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m60.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from langchain_community.vectorstores import FAISS

vector_store = FAISS.from_documents(split_chunks, embedding_model)

Now we can convert our vector store into a retriever!

In [None]:
retriever = vector_store.as_retriever()

### Setting up our RAG

We'll use the LCEL we touched on earlier to create a RAG chain.

Let's think through each part:

1. First we need to retrieve context
2. We need to pipe that context to our model
3. We need to parse that output

Let's start by setting up our model!

#### Setting up Llama 3 8B Instruct

Today, we'll be using: [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)

This is a great new model from Meta that we can use to power our RAG application!

First, we need to load our tokenizer for our model!

In [None]:
from transformers import AutoTokenizer

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Next, we'll load the model itself to prepare it for our Hugging Face pipeline!

In [None]:
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16,
    quantization_config=quantization_config,
)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Next we'll be using our Hugging Face `pipeline` to load our model for inference!

In [None]:
from transformers import pipeline

text_pipe = pipeline("text-generation", model, tokenizer=tokenizer, max_new_tokens=256, return_full_text=False)

Now we can connect our LLM to LangChain to be used in our pipeline!

In [None]:
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline

llm_pipeline = HuggingFacePipeline(pipeline=text_pipe, pipeline_kwargs={"max_new_tokens" : 256})

Let's see how this works by attaching it to a prompt!

In [None]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Please use the context provided to answer the question simply. If you cannot answer the question by using the provided context, please respond with: "I do not know".<|eot_id|><|start_header_id|>user<|end_header_id|>

CONTEXT:
{context}

QUERY:
{question}<|eot_id|>"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

In [None]:
INSTRUCTION_PROMPT_TEMPLATE = """\
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Please convert the following legal content into a human-readable summary<|eot_id|><|start_header_id|>user<|end_header_id|>

[LEGAL_DOC]{LEGAL_TEXT}[END_LEGAL_DOC]<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

In [None]:
def summary(input: str) -> str:
  output = summary_pipe(INSTRUCTION_PROMPT_TEMPLATE.format(LEGAL_TEXT=input),do_sample=True,  max_new_tokens=256, temperature=0.1, top_k=50)
  return output[0]["generated_text"].split("[LEGAL_DOC]")[-1]

#### Our RAG Chain

Notice how we have a bit of a more complex chain this time - that's because we want to return our sources with the response.

Let's break down the chain step-by-step:

1. We invoke the chain with the `question` item. Notice how we only need to provide `question` since both the retreiver and the `"question"` object depend on it.
  - We also chain our `"question"` into our `retriever`! This is what ultimately collects the context!
2. We assign our collected context to a `RunnablePassthrough()` from the previous object. This is going to let us simply pass it through to the next step, but still allow us to run that section of the chain.
3. We finally collect our response by chaining our prompt, which expects both a `"question"` and `"context"`, into our `llm`. We also, collect the `"context"` again so we can output it in the final response object.

The key thing to keep in mind here is that we need to pass our context through *after* we've retrieved it - to populate the object in a way that doesn't require us to call it or try and use it for something else.

In [None]:
from operator import itemgetter
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough

retrieval_augmented_qa_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | llm_pipeline | StrOutputParser() | summary | StrOutputParser(), "context": itemgetter("context")}
)

Let's test our chain out!

In [None]:
response = retrieval_augmented_qa_chain.invoke({"question" : "What issues does Elon have with OpenAI?"})

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Let's see that result!

In [None]:
response["response"].split("[/INST]")[-1]

"\n\nElon Musk is concerned that OpenAI's non-profit status and lack of transparency around intellectual property rights could hinder innovation and his own company's ability to continue developing products if OpenAI were to cease operations."

Let's look at another query!

In [None]:
response = retrieval_augmented_qa_chain.invoke({"question" : "Who is, and what is, fiduciary duty?"})

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [None]:
response["response"].split("[/INST]")[-1]

'\n\nElon Musk is owed a duty by the defendants under California law.'

In [None]:
response["context"]

[Document(page_content='Attorneys for Plaintiff Elon Musk', metadata={'source': 'DataRepository/MuskComplaint.pdf', 'file_path': 'DataRepository/MuskComplaint.pdf', 'page': 34, 'total_pages': 46, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': '', 'creationDate': '', 'modDate': '', 'trapped': ''}),
 Document(page_content='exceeds this Court’s jurisdictional minimum of $35,000.', metadata={'source': 'DataRepository/MuskComplaint.pdf', 'file_path': 'DataRepository/MuskComplaint.pdf', 'page': 29, 'total_pages': 46, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': '', 'creationDate': '', 'modDate': '', 'trapped': ''}),
 Document(page_content='1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \n12 \n13 \n14 \n15 \n16 \n17 \n18 \n19 \n20 \n21 \n22 \n23 \n24 \n25 \n26 \n27 \n28 \n \n \n– 31 – \nCOMPLAINT \n \nTHIRD CAUSE OF ACTION \nBreach of Fiduciary Duty  \nAgainst All Defendants \n