# RAG Project

LangChain Tutorials: https://python.langchain.com/docs/tutorials/

LangChain HowTos: https://python.langchain.com/docs/how_to/

LangChain Conceptual Guide: https://python.langchain.com/docs/concepts/#retrieval

High-Level Overview of RAG: https://python.langchain.com/docs/tutorials/rag/



In [None]:
%pip install --quiet --upgrade bitsandbytes langchain langchain-community langchain-huggingface transformers beautifulsoup4 faiss-gpu

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.6/50.6 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m48.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m76.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m100.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m407.7/407.7 kB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## **Part 1: Setup LLM**

**Set up LLM**
- https://python.langchain.com/docs/integrations/providers/huggingface/

<br/>
<br/>

<u>**Experimentation and Improvements** (not very important)</u>
- Find a better LLM that also fits into memory
  - Mistral-Small-Instruct-2409: Positioned Between Mistral NeMo 12B and Mistral Large 123B (crashed)
    - https://www.marktechpost.com/2024/09/18/mistral-ai-released-mistral-small-instruct-2409-a-game-changing-open-source-language-model-empowering-versatile-ai-applications-with-unmatched-efficiency-and-accessibility/
    - https://huggingface.co/mistralai/Mistral-Small-Instruct-2409
  - Mistral Nemo (Mistral-Nemo-Instruct-2407) (12 mins with quantization)
      - https://www.reddit.com/r/LocalLLaMA/comments/1eg5j2t/which_small_model_12b_do_you_guys_are_using_for/
      - https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407
  - Mistral 7B Quantized, Mistral 7B Instruct
    - https://www.reddit.com/r/LocalLLaMA/comments/1av779p/experiences_with_smaller_models_with_rag/

  - Qwen2–72B-Instruct, Qwen1.5-32B-Chat
      - https://medium.com/@naman1011/whats-the-best-llm-to-use-for-rag-476bec1bfa97
      - https://huggingface.co/Qwen/Qwen2.5-72B-Instruct

Quantization
- https://medium.com/@rakeshrajpurohit/model-quantization-with-hugging-face-transformers-and-bitsandbytes-integration-b4c9983e8996

In [None]:
import torch
from langchain_huggingface.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from transformers import BitsAndBytesConfig

#from huggingface_hub import notebook_login
#notebook_login()

# We load the quantized weights for faster generation
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

model_id = "mistralai/Mistral-Nemo-Instruct-2407"
# model_id = "Qwen/Qwen2.5-0.5B-Instruct" # choose this for faster inference
tokenizer = AutoTokenizer.from_pretrained(model_id)
#model = AutoModelForCausalLM.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,quantization_config=quantization_config)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer,max_new_tokens=1000)
llm = HuggingFacePipeline(pipeline=pipe)

tokenizer_config.json:   0%|          | 0.00/181k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.26M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/623 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/29.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/5 [00:00<?, ?it/s]

model-00001-of-00005.safetensors:   0%|          | 0.00/4.87G [00:00<?, ?B/s]

model-00002-of-00005.safetensors:   0%|          | 0.00/4.91G [00:00<?, ?B/s]

model-00003-of-00005.safetensors:   0%|          | 0.00/4.91G [00:00<?, ?B/s]

model-00004-of-00005.safetensors:   0%|          | 0.00/4.91G [00:00<?, ?B/s]

model-00005-of-00005.safetensors:   0%|          | 0.00/4.91G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## **Part 2: Load Document**

Load Data
- https://python.langchain.com/docs/how_to/#document-loaders

Can try loading data from web pages using the <em>Simple and Fast Parsing Approach</em> to recover one ```Document``` object per webpage
- https://python.langchain.com/docs/how_to/document_loader_web/
- https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.web_base.WebBaseLoader.html
- https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html

Going to use text document because the web one had too much irrelevant info
- https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.text.TextLoader.html

<br/>
<br/>

<u>Concern for Data collection</u>

- Example: When we ask the model what are the best places for hiking, and we have 10+ documents for hiking destinations, will that be a problem? Maybe we need metadata to denote the location
- Example: When we have 1 document that is all about hiking like 20 best hikes, how much is to be retrieved? -> Experiment with chunk and overlap size, parent-document retrieval approaches

<br/>
<br/>

<u>**Experimentation and Improvements**</u>

<u>1. Decide the categories of documents that we are going to put in (by country) and how many of each</u>
- Activities
- Food
- Accomodation
- Transport
- Places of interest
- etc.

<u>2. Scrape the web better</u>
- Use advanced parsing
- Use lazy loading and async for efficiency
- Specify parameters for BeautifulSoup through bs_kwargs to pick up only body text instead of extra info like navigation bars i.e. Parse web pages better

<u>3. Instead of scraping web, use other sources such as text documents</u>
- Find another way to get data

<u>4. Organize data (either local folder of text documents or through cloud or DB?)</u>

In [None]:
"""
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader(
    web_path = "https://www.earthtrekkers.com/norway-bucket-list-best-things-to-do-in-norway/"
)

docs = loader.load()
"""

from langchain_community.document_loaders import TextLoader
# Change to reading by a directory with all our text files if we are really going ahead with reading from text files
loader = TextLoader(
    file_path = "Norway Example.txt"
)

docs = loader.load()

In [None]:
docs

[Document(metadata={'source': 'Norway Example.txt'}, page_content='15 of the Most Beautiful Fjord Hikes in Norway\n\nGlacial lagoons, verdant forests and soaring mountains - our guide to the best fjord hikes in Norway...\n\nLooking for the best fjord hikes in Norway? Hiking is second nature to Norwegians and with the vast and varied landscapes on their doorstep, it’s easy to see why. From the white-sand beaches of the Lofoten Islands to the dramatic mountains of the south, and the spectacular fjords in between, there’s something to suit all types of hiker. It\'s Norway\'s fjords we\'re going to focusing on today - because many of the best hiking in Norway happens to be along, around or above a fjord.\n\nMany fjords promise spectacular natural views of untouched verdant forests, ice cold lagoons and sweeping views over seemingly endless cliffs and mountains. They\'re also great places to chase the Northern Lights during winter. Here, we\'re going to give you an introduction to the best 

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## **Part 3: Chunking/Split Text**

Split LangChain ```Document``` objects into smaller chunks. This is useful for
- Indexing and retrieving relevant data
- LLM (cannot fit into model's finite context window)
- Overcoming embedding model size limitations
- https://python.langchain.com/docs/concepts/text_splitters/
- https://python.langchain.com/docs/how_to/#text-splitters

Try ```RecursiveCharacterTextSplitter``` first
- Overlap helps to mitigate the possibility of separating a statement from important context related to it
- ```RecursiveCharacterTextSplitter``` recursively split the document using common separators like new lines until each chunk is the appropriate size. This is the recommended text splitter for generic text use cases
- Set ```add_start_index=True``` so that the character index at which each split Document starts within the initial Document is preserved as metadata attribute ```“start_index”```.
- This is a text-structured based splitting approach: Creating split that maintain natural language flow, maintain semantic coherence within split, and adapts to varying levels of text granularity

<br/>
<br/>

<u>**Experimentation and Improvements**</u>
1. Clean Data
2. Add metadata to chunks
  - This is for self query: https://python.langchain.com/docs/how_to/self_query/
  - I think some useful metadata that could be useful could be activity type and country. Not sure about other metadata
3. Experiment with Chunk size and Overlap size for recursive text splitting
4. Experiment with other splitting strategies: Document-structured based splitting, Semantic meaning based splitting (not sure if relevant)

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True # Following tutorial specs
)
all_splits = text_splitter.split_documents(docs)

In [None]:
all_splits

[Document(metadata={'source': 'Norway Example.txt', 'start_index': 0}, page_content="15 of the Most Beautiful Fjord Hikes in Norway\n\nGlacial lagoons, verdant forests and soaring mountains - our guide to the best fjord hikes in Norway...\n\nLooking for the best fjord hikes in Norway? Hiking is second nature to Norwegians and with the vast and varied landscapes on their doorstep, it’s easy to see why. From the white-sand beaches of the Lofoten Islands to the dramatic mountains of the south, and the spectacular fjords in between, there’s something to suit all types of hiker. It's Norway's fjords we're going to focusing on today - because many of the best hiking in Norway happens to be along, around or above a fjord.\n\nMany fjords promise spectacular natural views of untouched verdant forests, ice cold lagoons and sweeping views over seemingly endless cliffs and mountains. They're also great places to chase the Northern Lights during winter. Here, we're going to give you an introduction

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## **Part 4: Indexing/Storing chunks/splits**

Embed the contents of each chunk/split and insert these embeddings into a vector store
- When we want to search over our splits, we take a text search query, embed it, and perform some sort of “similarity” search to identify the stored splits with the most similar embeddings to our query embedding.
- Simplest similarity measure is cosine similarity. We measure the cosine of the angle between each pair of embeddings (which are high dimensional vectors).
- https://python.langchain.com/docs/how_to/
- https://python.langchain.com/docs/how_to/embed_text/
- https://python.langchain.com/docs/concepts/embedding_models/
- https://python.langchain.com/docs/how_to/vectorstores/
- https://python.langchain.com/docs/concepts/vectorstores/
- https://python.langchain.com/docs/integrations/vectorstores/
  - https://python.langchain.com/docs/integrations/vectorstores/faiss/
  - https://python.langchain.com/api_reference/community/vectorstores/langchain_community.vectorstores.faiss.FAISS.html

Embedding vectors (capture semantic meanings) can be compared easily
- Use ```embed_documents``` to embed multiple texts (documents)
- Use ```embed_query``` to embed a single text (query)
- As they are a set of coords in high-dim space, their similarity can be measured by
  - Cosine Similarity, Euclidean Distance, Dot Product

Vector Store
- Takes care of storing embedding vectors and performing similarity vector search against embedded query, therefore helping to retrieve relevant information based on semantic similarity
- Standard interface for working with vector stores: ```add_documents```, ```delete_documents```, ```similarity_search```
- Choice of similarity metric can sometimes be selected when initlaising vector store: refer to documentation of specific vectorstore we are using
- Choice of similarity search: Given the similarity metric to measure distance between the embedded query and embedded chunks, we need an algo to efficiently search over all embedded chunks. Many vector stores implement ```Hierarchical Navigable Small World```, a graph-based index structure that allows for eficient similarity search. See what search parameters we can add for specific choice of vector store
- Many vector stores support metadata filtering: Allows structured filters to reduce size of similarity search space
- Semantic search and Metadata search work hand in hand

<br/>

For now, use Hugging Face's ```sentence-transformers/all-mpnet-base-v2``` embedding model and ```FAISS``` as the vector store

<br/>
<br/>

<u>**Experimentation and Improvements**</u>

1. Explore other indexes used by FAISS
  - In our HW we played with Nbits for LSH
  - Compared cosine similarity, search time, precision, recall when changing Nbits
2. Explore Hierarchical Navigable Small World (HNSW), knowledge graphs
3. Explore other vector DBs like Chroma, Spotify's Annoy (not a priority unless their search methods are novel)
4. Explore other embedding models (not a priority)
5. IR System Evaluation Metrics (Must do)
  - Binary Relevance
  - Multiple Levels of Relevance

Note: Comparison of FAISS with chroma
- https://medium.com/@stepkurniawan/comparing-faiss-with-chroma-vector-stores-0953e1e619eb




In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore

embeddings_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

index = faiss.IndexFlatL2(len(embeddings_model.embed_query("hello world")))

vector_store = FAISS(
    embedding_function=embeddings_model,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

vector_store.add_documents(all_splits)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

['d49596f6-8f31-4f21-b055-f38fba0cbbd4',
 '68d44cf3-f790-4ce4-9da4-f7d420e15da7',
 'c22be216-987d-4b46-b1a0-224ef063156b',
 'b9f94fad-72f0-4bcb-807b-364f404deaac',
 'c3f037c7-0a10-4fae-99ab-5a2c96880d5a',
 'c296da7b-157e-44aa-b29d-0f166ea04ab2',
 '6bf9b5f7-e568-41eb-af4a-951332232cde',
 '3eab7335-16c9-4be6-a00e-11a8932a026c',
 '8e26dbec-e58c-4c66-a4c6-b38ec228b2e3',
 'f9dccba1-08a5-4ae5-b75d-7421e16b1b7b',
 '52f818b5-6b49-41c6-b737-77a2031c3231',
 'dee565f9-c45c-4e7b-9b81-d75332dcc55c',
 'b1eabe8f-5319-4ae9-8411-ef41c20bef7e',
 'f73fa1f5-51ea-4d77-84c1-31c8fc815226',
 '6a269c5c-1bb7-4736-b0ce-9399cf24b938',
 'c7099369-22b7-4996-89ab-48e66bafb3b9',
 '09df9862-3468-4593-abc5-58ff7415d799',
 'fd5f31b7-2c34-4ace-8362-83060b44a313',
 'bb409ba4-f6c3-4517-8069-ba6c0ed48cbc',
 '91fbcdb2-7a55-44b2-94f0-a69bb8da3e4c',
 'b7fa1077-c404-4b75-b57d-d5f7cc8d891d',
 'e3fbae0c-d614-40cf-9f2f-d502b081a2ba',
 'f0fc0910-fe27-48e9-a6db-9043a86ad154',
 '85befd53-76af-417a-a8ca-c09715a23b78',
 'c4dd5de8-77ae-

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## **Part 5: Retrieval**

Given a user query, retrieve relevant splits from our vector store
- Experimentations (See experimentation and improvements section): https://python.langchain.com/docs/how_to/#retrievers

```Retriever``` interface wraps an index
- Uniform interface for interacting with different types of retrieval systems: Vector stores, Graph databases, Relational databases
- Input: Query string
- Output: List of LangChain Document objects
- Its a runnable (standard interface for LangChain components). Therefore, it has the method ```invoke()``` to invoke it with a query
- A vectorstore can be used as a retriever by calling the ```as_retriever()``` method
- https://python.langchain.com/docs/concepts/retrievers/

Query Analysis: Models transform or construct search queries from raw user input to optimise retrieval
- https://python.langchain.com/docs/concepts/retrieval/
- Query Re-writing
  - Multi-query
  - Decomposition
  - Step-back
  - HyDE
- Query Construction
  - Self Query (use metadata)

Information Retrieval: Search queries are used to fetch information from various retrieval systems.
- https://python.langchain.com/docs/concepts/retrieval/
- Lexical search indexes (based on keywords). Data structure to implement this is called inverted index. Lexical search algorithms include BM25
- Vector indexes (based on word embeddings)


Most common type of ```Retriever``` is the ```VectorStoreRetriever```: Uses the similarity search capabilities of a vector store to facilitate retrieval. Any VectorStore can easily be turned into a Retriever with ```VectorStore.as_retriever()```

<br/>
<br/>


<u>**Experimentation and Improvements**</u>
1. Ensemble/Hybrid Search: Combine multiple retrievers. Particularly useful when you have multiple retrievers that are good at finding different types of relevant documents. We could fist try combining a sparse retriever (BM25 that is keyword based) and dense retriever (embedding similarity)
  - https://python.langchain.com/docs/concepts/retrievers/
  - https://python.langchain.com/docs/how_to/ensemble_retriever/
  - https://python.langchain.com/docs/how_to/hybrid/

2. Source Document Retention Approaches: Retain a link between the transformed document (chunks) and original document, giving the retriever the ability to return the original document. For example, you may use small chunk size for indexing documents in a vectorstore. If you return only the chunks as the retrieval result, then the model will have lost the original document context for the chunks.
  - Multi-Vector retriever: https://python.langchain.com/docs/how_to/multi_vector/
  - ParentDocument retriever: https://python.langchain.com/docs/how_to/parent_document_retriever/

3. Query Rewriting Approaches:
  - https://python.langchain.com/docs/concepts/retrieval/
  - Multi-query
  - Decomposition
  - Step-back
  - HyDE

4. Query Construction Approaches:
  - https://python.langchain.com/docs/concepts/retrieval/
  - Self query

5. Others:
  - Contextual Compression
  - Reorder/Rank Documents
    - https://python.langchain.com/docs/how_to/long_context_reorder/
    - RSV, RSV with smoothing (assume relevance feedback)
    - IDF (assume relevance feedback)
    - BM25 (assume no relevance feedback)
  - Maximal marginal relevance?
  - Multi Vector Retrieval?: https://python.langchain.com/docs/how_to/multi_vector/

6. IR System Evaluation Metrics (Must do)
  - Binary Relevance
  - Multiple Levels of Relevance


A list of techniques in LangChain:
- https://python.langchain.com/docs/how_to/#retrievers


**Basic**

In [None]:
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 10})

retrieved_docs = retriever.invoke("What are the best hikes in Norway?")

In [None]:
retrieved_docs[0].page_content

"In many countries, the best hiking routes are only obtainable to those in the know or with high-tech equipment – this is not the case in Norway, the undisputed home of the fjords! However, with many of the best Norway fjord hikes accessible only by boat or kayak, we would recommend travelling with a guide where possible.\n\nHere are 15 fjords worth hiking on your next Norway walking adventure. These are all day hikes with breathtaking views, many of which are accessible by public transport.\n\n1. Preikestolen, Lysefjord\n\n2. Breiskrednosi Summit Hike, Naerøyfjord\n\n3. Romsdalseggen Ridge, Isfjorden\n\n4. Mount Skåla, Nordfjord\n\n5. Trolltunga (aka The Devil's Tongue), Hardangerfjord\n\n6. Aurlandsdalen Valley, Sognefjord\n\n7. Dronningstien, Hardangerfjord\n\n8. Vidasethovden, Sognefjord\n\n9. Mount Hanguren, Sognefjord/ Hardangerfjord\n\n10. Mount Fløyen, Sognefjord/ Hardangerfjord\n\n11. Mount Ulriken, Sognefjord/ Hardangerfjord\n\n12. Rimstigen, Næroyfjord\n\n13. Urkeegga, Hjøru

<br/>
<br/>
<br/>
<br/>
<br/>

## **Part 5 Experimentation**


<u>**Setup Simple Experiment Data**</u>

In [1]:
%pip install --quiet --upgrade bitsandbytes langchain langchain-community langchain-huggingface transformers beautifulsoup4 faiss-gpu rank_bm25 lark

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m64.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.0/111.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m409.3/409.3 kB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m54.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.5/49.5 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
from langchain_core.documents import Document

docs = [
    Document(
        page_content="The best hikes in Norway include the Reinebringen hike in the Lofoten islands. At a modest 448 meters high, Reinebringen is far from one of the highest peaks on the Lofoten islands. Yet this is more than made up for by the iconic view from the summit of Reine. It is not suitable for winter! Also, the trail can be quite demanding as the steps are quite steep.",
        metadata={"activity": 'Hiking', "country": 'Norway'},
    ),
    Document(
        page_content="Unique hike that can be done are volcanic hikes which can be done in Iceland. It is recommended to go with a tour of experienced people!",
        metadata={"activity": 'Hiking', "country": 'Iceland'},
    ),
    Document(
        page_content="The best seafood in the Nordic region can be found in Norway. The seafood is freshly caught from the arctic ocean. Popular choices include the famous norwegian salmon. Other delicacies include whale steak!",
        metadata={"activity": 'Food', "country": 'Norway'},
    ),
    Document(
        page_content="One of the famous street foods of Iceland is the Hotdog! It is called the Baejarins Beztu Pylsur hot dog is made of a mix of lamb, beef and pork. Other delicacies of iceland include Fish and Chips as well as Tommi's burger.",
        metadata={"activity": 'Food', "country": 'Norway'},
    ),
    Document(
        page_content="Transportation within Reykjavik is fairly convenient as there is a public bus service called BSI. All you need to do is to download their mobile app, follow the instructions, and you're good to go. Transportation to places outside Reykjavik however requires a car. Some options include car rentals as well as booking bus tours.",
        metadata={"activity": 'Transportation', "country": 'Iceland'},
    ),
    Document(
        page_content="Finland is easily accessible with its HSL public transportation services where all you need to do is to download a mobile app and follow the instructions.",
        metadata={"activity": 'Transportation', "country": 'Iceland'},
    ),
    Document(
        page_content="Finland is known for its snowy-like landscape and captivating auroras. One of the best places to stay is the Glass huts in Skyfire village in Rovaniemi, Lapland where you can admire the beautiful northern lights and snowy landscape. The village has its very own restaurant called Sky Huts Restaurant and Bar which offers tailor-made menus by a professional chef using local ingredients.",
        metadata={"activity": 'Accomodation', "country": 'Finland'},
    ),
    Document(
        page_content="A nice place to stay in Norway is the Lofoten Islands, in particlar Unstad which provides a breathtaking view of the mountain valley, ocean, and if you're lucky, northern lights.",
        metadata={"activity": 'Accomodation', "country": 'Norway'},
    ),

]

### **Experiment 1: Hybrid/Ensemble**

- https://python.langchain.com/docs/how_to/ensemble_retriever/

Combine sparse retriever like BM25 that is based on keyword search with dense retriever that is based on embedding similarity/semantic similarity

EnsembleRetrievers support ensembling of results from multiple retrievers (base type BaseRetriever)

They rerank the results of constituent retrievers based on the Reciprocal Rank Fusion algorithm

This common approach, combining a basic keyword and similarity search, could be applicable to our use case, a Nordic-Region information retrieval system, to retrieve relevant (BM25) as well as similar (LSH) results for a start.


In [4]:
from langchain.retrievers import EnsembleRetriever # Supports Ensembling of results from multiple retrievers
from langchain_community.retrievers import BM25Retriever
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore

In [5]:
# Initialise BM25 retreiver -> Ranked retrieval, Probabilistic IR Technique that is non-binary -> takes into account binary presence, but also freq info.
# https://python.langchain.com/docs/integrations/retrievers/bm25/
# https://www.kaggle.com/code/marcinrutecki/rag-ensemble-retriever-in-langchain/notebook#Key-Features-of-BM25
# ranking function used in information retrieval systems to estimate the relevance of documents to a given search query.
bm25_retriever = BM25Retriever.from_documents(
    docs
)
bm25_retriever.k = 5 # number of docs to return from BM_25

# Initialise the FAISS retriever -> Ranked retrieval, Similarity search using Locality Sensitive Hashing with Random Projections
# https://python.langchain.com/docs/integrations/vectorstores/faiss/
embeddings_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

index = faiss.IndexFlatL2(len(embeddings_model.embed_query("hello world"))) # Initialise FAISS index with the dimensionality

faiss_vector_store = FAISS(
    embedding_function=embeddings_model,
    index=index, # what index to use
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

faiss_vector_store.add_documents(docs)
faiss_retriever = faiss_vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 3}) # num docs to return from FAISS

# Initialise the hybrid/ensemble retriever
# Uses RRF to sum the rankings of each doc from both retrievers, discounting rankings that are lower.
# https://medium.com/@devalshah1619/mathematical-intuition-behind-reciprocal-rank-fusion-rrf-explained-in-2-mins-002df0cc5e2a
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5]
)

  from tqdm.autonotebook import tqdm, trange


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [6]:
docs = ensemble_retriever.invoke("What are the best hikes?")
docs

[Document(metadata={'activity': 'Hiking', 'country': 'Norway'}, page_content='The best hikes in Norway include the Reinebringen hike in the Lofoten islands. At a modest 448 meters high, Reinebringen is far from one of the highest peaks on the Lofoten islands. Yet this is more than made up for by the iconic view from the summit of Reine. It is not suitable for winter! Also, the trail can be quite demanding as the steps are quite steep.'),
 Document(metadata={'activity': 'Hiking', 'country': 'Iceland'}, page_content='Unique hike that can be done are volcanic hikes which can be done in Iceland. It is recommended to go with a tour of experienced people!'),
 Document(metadata={'activity': 'Accomodation', 'country': 'Finland'}, page_content='Finland is known for its snowy-like landscape and captivating auroras. One of the best places to stay is the Glass huts in Skyfire village in Rovaniemi, Lapland where you can admire the beautiful northern lights and snowy landscape. The village has its v

In [7]:
docs = ensemble_retriever.invoke("What are the best food?")
docs

[Document(metadata={'activity': 'Hiking', 'country': 'Norway'}, page_content='The best hikes in Norway include the Reinebringen hike in the Lofoten islands. At a modest 448 meters high, Reinebringen is far from one of the highest peaks on the Lofoten islands. Yet this is more than made up for by the iconic view from the summit of Reine. It is not suitable for winter! Also, the trail can be quite demanding as the steps are quite steep.'),
 Document(metadata={'activity': 'Food', 'country': 'Norway'}, page_content='The best seafood in the Nordic region can be found in Norway. The seafood is freshly caught from the arctic ocean. Popular choices include the famous norwegian salmon. Other delicacies include whale steak!'),
 Document(metadata={'activity': 'Food', 'country': 'Norway'}, page_content="One of the famous street foods of Iceland is the Hotdog! It is called the Baejarins Beztu Pylsur hot dog is made of a mix of lamb, beef and pork. Other delicacies of iceland include Fish and Chips 

**Questions about Hybrid/Ensemble**
- Why does it only return 4 examples
- For the food query, why is a hiking related document being returned first?

**Investigating BM25**

In [20]:
bm25_retriever = BM25Retriever.from_documents(
    docs
)
bm25_retriever.k = 5 # number of docs to return

bm25_retriever.invoke("what are the best food?")

[Document(metadata={'activity': 'Hiking', 'country': 'Norway'}, page_content='The best hikes in Norway include the Reinebringen hike in the Lofoten islands. At a modest 448 meters high, Reinebringen is far from one of the highest peaks on the Lofoten islands. Yet this is more than made up for by the iconic view from the summit of Reine. It is not suitable for winter! Also, the trail can be quite demanding as the steps are quite steep.'),
 Document(metadata={'activity': 'Food', 'country': 'Norway'}, page_content='The best seafood in the Nordic region can be found in Norway. The seafood is freshly caught from the arctic ocean. Popular choices include the famous norwegian salmon. Other delicacies include whale steak!'),
 Document(metadata={'activity': 'Accomodation', 'country': 'Finland'}, page_content='Finland is known for its snowy-like landscape and captivating auroras. One of the best places to stay is the Glass huts in Skyfire village in Rovaniemi, Lapland where you can admire the be

In [19]:
bm25_retriever = BM25Retriever.from_documents(
    docs
)
bm25_retriever.k = 5 # number of docs to return

bm25_retriever.invoke("food")

[Document(metadata={'activity': 'Accomodation', 'country': 'Finland'}, page_content='Finland is known for its snowy-like landscape and captivating auroras. One of the best places to stay is the Glass huts in Skyfire village in Rovaniemi, Lapland where you can admire the beautiful northern lights and snowy landscape. The village has its very own restaurant called Sky Huts Restaurant and Bar which offers tailor-made menus by a professional chef using local ingredients.'),
 Document(metadata={'activity': 'Hiking', 'country': 'Iceland'}, page_content='Unique hike that can be done are volcanic hikes which can be done in Iceland. It is recommended to go with a tour of experienced people!'),
 Document(metadata={'activity': 'Food', 'country': 'Norway'}, page_content="One of the famous street foods of Iceland is the Hotdog! It is called the Baejarins Beztu Pylsur hot dog is made of a mix of lamb, beef and pork. Other delicacies of iceland include Fish and Chips as well as Tommi's burger."),
 Do

**Questions about BM25**

- Why are documents that do not contain the exact query term being returned?
  - No filtering of zero scores by LangChain's implementation
  - Other terms in the query are also considered (stopwords are not removed) which will be factored into the simlarity BM25 calculation

If we fix include the keyword food in documents we want, as well as remove words like 'best', those documents will be returned at the top

In [9]:
docs_with_query_keyword = [
    Document(
        page_content="The best hikes in Norway include the Reinebringen hike in the Lofoten islands. At a modest 448 meters high, Reinebringen is far from one of the highest peaks on the Lofoten islands. Yet this is more than made up for by the iconic view from the summit of Reine. It is not suitable for winter! Also, the trail can be quite demanding as the steps are quite steep.",
        metadata={"activity": 'Hiking', "country": 'Norway'},
    ),
    Document(
        page_content="Unique hike that can be done are volcanic hikes which can be done in Iceland. It is recommended to go with a tour of experienced people!",
        metadata={"activity": 'Hiking', "country": 'Iceland'},
    ),
    Document(
        page_content="The best food in Norway is seafood in the Nordic region can be found in Norway. The seafood is freshly caught from the arctic ocean. Popular choices include the famous norwegian salmon. Other delicacies include whale steak!",
        metadata={"activity": 'Food', "country": 'Norway'},
    ),
    Document(
        page_content="One of the famous street food of Iceland is the Hotdog! It is called the Baejarins Beztu Pylsur hot dog is made of a mix of lamb, beef and pork. Other delicacies of iceland include Fish and Chips as well as Tommi's burger.",
        metadata={"activity": 'Food', "country": 'Norway'},
    ),
    Document(
        page_content="Transportation within Reykjavik is fairly convenient as there is a public bus service called BSI. All you need to do is to download their mobile app, follow the instructions, and you're good to go. Transportation to places outside Reykjavik however requires a car. Some options include car rentals as well as booking bus tours.",
        metadata={"activity": 'Transportation', "country": 'Iceland'},
    ),
    Document(
        page_content="Finland is easily accessible with its HSL public transportation services where all you need to do is to download a mobile app and follow the instructions.",
        metadata={"activity": 'Transportation', "country": 'Iceland'},
    ),
    Document(
        page_content="Finland is known for its snowy-like landscape and captivating auroras. One of the best places to stay is the Glass huts in Skyfire village in Rovaniemi, Lapland where you can admire the beautiful northern lights and snowy landscape. The village has its very own restaurant called Sky Huts Restaurant and Bar which offers tailor-made menus by a professional chef using local ingredients.",
        metadata={"activity": 'Accomodation', "country": 'Finland'},
    ),
    Document(
        page_content="A nice place to stay in Norway is the Lofoten Islands, in particlar Unstad which provides a breathtaking view of the mountain valley, ocean, and if you're lucky, northern lights.",
        metadata={"activity": 'Accomodation', "country": 'Norway'},
    ),

]

In [11]:
bm25_retriever = BM25Retriever.from_documents(
    docs_with_query_keyword
)
bm25_retriever.k = 5 # number of docs to return

bm25_retriever.invoke("food")

[Document(metadata={'activity': 'Food', 'country': 'Norway'}, page_content='The best food in Norway is seafood in the Nordic region can be found in Norway. The seafood is freshly caught from the arctic ocean. Popular choices include the famous norwegian salmon. Other delicacies include whale steak!'),
 Document(metadata={'activity': 'Food', 'country': 'Norway'}, page_content="One of the famous street food of Iceland is the Hotdog! It is called the Baejarins Beztu Pylsur hot dog is made of a mix of lamb, beef and pork. Other delicacies of iceland include Fish and Chips as well as Tommi's burger."),
 Document(metadata={'activity': 'Accomodation', 'country': 'Norway'}, page_content="A nice place to stay in Norway is the Lofoten Islands, in particlar Unstad which provides a breathtaking view of the mountain valley, ocean, and if you're lucky, northern lights."),
 Document(metadata={'activity': 'Accomodation', 'country': 'Finland'}, page_content='Finland is known for its snowy-like landscap

Note: Other terms in the query are also considered (stopwords are not removed) which will be factored into the simlarity BM25 calculation

In [16]:
bm25_retriever = BM25Retriever.from_documents(
    docs_with_query_keyword
)
bm25_retriever.k = 5 # number of docs to return

bm25_retriever.invoke("what are the best food?")

[Document(metadata={'activity': 'Hiking', 'country': 'Norway'}, page_content='The best hikes in Norway include the Reinebringen hike in the Lofoten islands. At a modest 448 meters high, Reinebringen is far from one of the highest peaks on the Lofoten islands. Yet this is more than made up for by the iconic view from the summit of Reine. It is not suitable for winter! Also, the trail can be quite demanding as the steps are quite steep.'),
 Document(metadata={'activity': 'Hiking', 'country': 'Iceland'}, page_content='Unique hike that can be done are volcanic hikes which can be done in Iceland. It is recommended to go with a tour of experienced people!'),
 Document(metadata={'activity': 'Food', 'country': 'Norway'}, page_content='The best food in Norway is seafood in the Nordic region can be found in Norway. The seafood is freshly caught from the arctic ocean. Popular choices include the famous norwegian salmon. Other delicacies include whale steak!'),
 Document(metadata={'activity': 'Acc

**Improving BM25**
- Drop stopwords from the query
- Consider this: Do user queries more often contain words that are variations of keywords or do they more often contain the exact keyword?

**Investigating FAISS**

In [21]:
# Initialise the FAISS retriever -> Ranked retrieval, Similarity search using Locality Sensitive Hashing with Random Projections
# https://python.langchain.com/docs/integrations/vectorstores/faiss/
embeddings_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

index = faiss.IndexFlatL2(len(embeddings_model.embed_query("hello world"))) # Initialise FAISS index with the dimensionality

faiss_vector_store = FAISS(
    embedding_function=embeddings_model,
    index=index, # what index to use
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

faiss_vector_store.add_documents(docs)
faiss_retriever = faiss_vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 3}) # num docs to return from FAISS

In [22]:
faiss_retriever.invoke("What are the best food?")

[Document(metadata={'activity': 'Food', 'country': 'Norway'}, page_content='The best seafood in the Nordic region can be found in Norway. The seafood is freshly caught from the arctic ocean. Popular choices include the famous norwegian salmon. Other delicacies include whale steak!'),
 Document(metadata={'activity': 'Food', 'country': 'Norway'}, page_content="One of the famous street foods of Iceland is the Hotdog! It is called the Baejarins Beztu Pylsur hot dog is made of a mix of lamb, beef and pork. Other delicacies of iceland include Fish and Chips as well as Tommi's burger."),
 Document(metadata={'activity': 'Hiking', 'country': 'Norway'}, page_content='The best hikes in Norway include the Reinebringen hike in the Lofoten islands. At a modest 448 meters high, Reinebringen is far from one of the highest peaks on the Lofoten islands. Yet this is more than made up for by the iconic view from the summit of Reine. It is not suitable for winter! Also, the trail can be quite demanding as 

**Thoughts about FAISS**

Seems like the embedding similarity works well (even though we use locality sensitive hashing with projections to increase the search speed by reducing vector resolution)

### **Experiment 2: Query Construction -> Self-Query**

- https://python.langchain.com/docs/concepts/retrieval/
- https://python.langchain.com/docs/how_to/self_query/

Self-Query is a Query Construction technique which focuses on combining semantic search using natural language user queries together with specialised metadata filters.

It may be effective because users searching for information on holidays tend to search for information relating to a category, which we have included as metadata in our documents. Therefore, the questions that they ask may be better answered by fetching documents based on metadata rather than solely based on simiarity with text.
- This uses an LLM to transform user input into two things: (1) a string to look up semantically, (2) a metadata filter to go along with it. This is useful because oftentimes questions are about the METADATA of documents (not the content itself).

Rough steps of constructing from scratch with LCEL
- Create a ```StructuredQuery``` object using an instruction template, given the doc desc, metadata info and user query, that is passed to an LLM for construction
- Translate the ```StructuredQuery``` object into a metadata filter in the syntax of the vector store we are using

In [12]:
%pip install --quiet --upgrade bitsandbytes langchain langchain-community langchain-huggingface transformers beautifulsoup4 faiss-gpu rank_bm25 lark qdrant-client langchain-chroma

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m615.5/615.5 kB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m77.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.9/94.9 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m273.8/273.8 kB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m74.9 MB/s[0m eta [36m0:00:00

In [2]:
from langchain_core.documents import Document

docs = [
    Document(
        page_content="The best hikes in Norway include the Reinebringen hike in the Lofoten islands. At a modest 448 meters high, Reinebringen is far from one of the highest peaks on the Lofoten islands. Yet this is more than made up for by the iconic view from the summit of Reine. It is not suitable for winter! Also, the trail can be quite demanding as the steps are quite steep.",
        metadata={"activity": 'Hiking', "country": 'Norway'},
    ),
    Document(
        page_content="Unique hike that can be done are volcanic hikes which can be done in Iceland. It is recommended to go with a tour of experienced people!",
        metadata={"activity": 'Hiking', "country": 'Iceland'},
    ),
    Document(
        page_content="The best seafood in the Nordic region can be found in Norway. The seafood is freshly caught from the arctic ocean. Popular choices include the famous norwegian salmon. Other delicacies include whale steak!",
        metadata={"activity": 'Food', "country": 'Norway'},
    ),
    Document(
        page_content="One of the famous street foods of Iceland is the Hotdog! It is called the Baejarins Beztu Pylsur hot dog is made of a mix of lamb, beef and pork. Other delicacies of iceland include Fish and Chips as well as Tommi's burger.",
        metadata={"activity": 'Food', "country": 'Norway'},
    ),
    Document(
        page_content="Transportation within Reykjavik is fairly convenient as there is a public bus service called BSI. All you need to do is to download their mobile app, follow the instructions, and you're good to go. Transportation to places outside Reykjavik however requires a car. Some options include car rentals as well as booking bus tours.",
        metadata={"activity": 'Transportation', "country": 'Iceland'},
    ),
    Document(
        page_content="Finland is easily accessible with its HSL public transportation services where all you need to do is to download a mobile app and follow the instructions.",
        metadata={"activity": 'Transportation', "country": 'Iceland'},
    ),
    Document(
        page_content="Finland is known for its snowy-like landscape and captivating auroras. One of the best places to stay is the Glass huts in Skyfire village in Rovaniemi, Lapland where you can admire the beautiful northern lights and snowy landscape. The village has its very own restaurant called Sky Huts Restaurant and Bar which offers tailor-made menus by a professional chef using local ingredients.",
        metadata={"activity": 'Accomodation', "country": 'Finland'},
    ),
    Document(
        page_content="A nice place to stay in Norway is the Lofoten Islands, in particlar Unstad which provides a breathtaking view of the mountain valley, ocean, and if you're lucky, northern lights.",
        metadata={"activity": 'Accomodation', "country": 'Norway'},
    ),

]

In [3]:
from langchain.chains.query_constructor.base import AttributeInfo

# Provide info about the metadata fields that our doc support and a short desc of the doc contents
metadata_field_info = [
    AttributeInfo(
        name="activity",
        description="The activities mentioned in the article",
        type="string",
    ),
    AttributeInfo(
        name="country",
        description="The country that the article is talking about",
        type="string",
    )
]

document_content_description = "Summary of particular information of a place"

In [4]:
import torch
from langchain_huggingface.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from transformers import BitsAndBytesConfig

#from huggingface_hub import notebook_login
#notebook_login()

# Initialise the LLM
# We load the quantized weights for faster generation
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

model_id = "mistralai/Mistral-Nemo-Instruct-2407"
# model_id = "Qwen/Qwen2.5-0.5B-Instruct" # choose this for faster inference
tokenizer = AutoTokenizer.from_pretrained(model_id)
#model = AutoModelForCausalLM.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,quantization_config=quantization_config)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer,max_new_tokens=1000)
llm = HuggingFacePipeline(pipeline=pipe)

tokenizer_config.json:   0%|          | 0.00/181k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.26M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/622 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now default to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/29.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/5 [00:00<?, ?it/s]

model-00001-of-00005.safetensors:   0%|          | 0.00/4.87G [00:00<?, ?B/s]

model-00002-of-00005.safetensors:   0%|          | 0.00/4.91G [00:00<?, ?B/s]

model-00003-of-00005.safetensors:   0%|          | 0.00/4.91G [00:00<?, ?B/s]

model-00004-of-00005.safetensors:   0%|          | 0.00/4.91G [00:00<?, ?B/s]

model-00005-of-00005.safetensors:   0%|          | 0.00/4.91G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [5]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore

# Initialise the FAISS retriever -> Ranked retrieval, Similarity search using Locality Sensitive Hashing with Random Projections
# https://python.langchain.com/docs/integrations/vectorstores/faiss/
embeddings_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

index = faiss.IndexFlatL2(len(embeddings_model.embed_query("hello world"))) # Initialise FAISS index with the dimensionality

faiss_vector_store = FAISS(
    embedding_function=embeddings_model,
    index=index, # what index to use
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

faiss_vector_store.add_documents(docs)

  from tqdm.autonotebook import tqdm, trange


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

['bfdbb8e2-82b4-43e3-a4b6-9b1f6770fca6',
 '67a3cdf9-592d-4b49-b06b-727aa365aff5',
 'd454a1f6-237d-47aa-a0d2-f9dd040ff388',
 '6600735b-c095-47b0-a44f-e0bc83c2ffb5',
 'a92dee0c-d8b7-421e-bde2-1a2775f42f39',
 'dfe80858-5e29-4998-9fa7-7d66f4465d32',
 'f5971ea3-0133-4508-8a05-b8d9d0fb12c0',
 '017723de-9be4-4a3f-aaf5-970c3d47a04b']

In [7]:
from langchain.retrievers.self_query.base import SelfQueryRetriever

retriever = SelfQueryRetriever.from_llm(
    llm,
    faiss_vector_store,
    document_content_description,
    metadata_field_info,
)

ValueError: Self query retriever with Vector Store type <class 'langchain_community.vectorstores.faiss.FAISS'> not supported.

**Use other VectorDBs because FAISS is not supported for self-query**
- https://python.langchain.com/docs/how_to/self_query/
- https://python.langchain.com/docs/integrations/retrievers/self_query/

Use ChromaDB


In [13]:
from langchain_chroma import Chroma

embeddings_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

chroma_vectorstore = Chroma.from_documents(docs, embeddings_model)

In [14]:
from langchain.retrievers.self_query.base import SelfQueryRetriever
chroma_retriever = SelfQueryRetriever.from_llm(
    llm, chroma_vectorstore, document_content_description, metadata_field_info, verbose=True
)

In [15]:
chroma_retriever.invoke("What is the best food?")

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


OutputParserException: Parsing text
Your goal is to structure the user's query to match the request schema provided below.

<< Structured Request Schema >>
When responding use a markdown code snippet with a JSON object formatted in the following schema:

```json
{
    "query": string \ text string to compare to document contents
    "filter": string \ logical condition statement for filtering documents
}
```

The query string should contain only text that is expected to match the contents of documents. Any conditions in the filter should not be mentioned in the query as well.

A logical condition statement is composed of one or more comparison and logical operation statements.

A comparison statement takes the form: `comp(attr, val)`:
- `comp` (eq | ne | gt | gte | lt | lte): comparator
- `attr` (string):  name of attribute to apply the comparison to
- `val` (string): is the comparison value

A logical operation statement takes the form `op(statement1, statement2, ...)`:
- `op` (and | or): logical operator
- `statement1`, `statement2`, ... (comparison statements or logical operation statements): one or more statements to apply the operation to

Make sure that you only use the comparators and logical operators listed above and no others.
Make sure that filters only refer to attributes that exist in the data source.
Make sure that filters only use the attributed names with its function names if there are functions applied on them.
Make sure that filters only use format `YYYY-MM-DD` when handling date data typed values.
Make sure that filters take into account the descriptions of attributes and only make comparisons that are feasible given the type of data being stored.
Make sure that filters are only used as needed. If there are no filters that should be applied return "NO_FILTER" for the filter value.

<< Example 1. >>
Data Source:
```json
{
    "content": "Lyrics of a song",
    "attributes": {
        "artist": {
            "type": "string",
            "description": "Name of the song artist"
        },
        "length": {
            "type": "integer",
            "description": "Length of the song in seconds"
        },
        "genre": {
            "type": "string",
            "description": "The song genre, one of "pop", "rock" or "rap""
        }
    }
}
```

User Query:
What are songs by Taylor Swift or Katy Perry about teenage romance under 3 minutes long in the dance pop genre

Structured Request:
```json
{
    "query": "teenager love",
    "filter": "and(or(eq(\"artist\", \"Taylor Swift\"), eq(\"artist\", \"Katy Perry\")), lt(\"length\", 180), eq(\"genre\", \"pop\"))"
}
```


<< Example 2. >>
Data Source:
```json
{
    "content": "Lyrics of a song",
    "attributes": {
        "artist": {
            "type": "string",
            "description": "Name of the song artist"
        },
        "length": {
            "type": "integer",
            "description": "Length of the song in seconds"
        },
        "genre": {
            "type": "string",
            "description": "The song genre, one of "pop", "rock" or "rap""
        }
    }
}
```

User Query:
What are songs that were not published on Spotify

Structured Request:
```json
{
    "query": "",
    "filter": "NO_FILTER"
}
```


<< Example 3. >>
Data Source:
```json
{
    "content": "Summary of particular information of a place",
    "attributes": {
    "activity": {
        "description": "The activities mentioned in the article",
        "type": "string"
    },
    "country": {
        "description": "The country that the article is talking about",
        "type": "string"
    }
}
}
```

User Query:
What is the best food?

Structured Request:
```json
{
    "query": "best food",
    "filter": "NO_FILTER"
}
```

<< Example 4. >>
Data Source:
```json
{
    "content": "Summary of particular information of a place",
    "attributes": {
    "activity": {
        "description": "The activities mentioned in the article",
        "type": "string"
    },
    "country": {
        "description": "The country that the article is talking about",
        "type": "string"
    }
}
}
```

User Query:
What are the activities in France?

Structured Request:
```json
{
    "query": "activities in France",
    "filter": "eq(\"country\", \"France\")"
}
```

<< Example 5. >>
Data Source:
```json
{
    "content": "Summary of particular information of a place",
    "attributes": {
    "activity": {
        "description": "The activities mentioned in the article",
        "type": "string"
    },
    "country": {
        "description": "The country that the article is talking about",
        "type": "string"
    },
    "date": {
        "description": "The date when the article was published",
        "type": "date"
    }
}
}
```

User Query:
What are the activities in France after 2020?

Structured Request:
```json
{
    "query": "activities in France",
    "filter": "and(eq(\"country\", \"France\"), gt(\"date\", \"2020-01-01\"))"
}
```

<< Example 6. >>
Data Source:
```json
{
    "content": "Summary of particular information of a place",
    "attributes": {
    "activity": {
        "description": "The activities mentioned in the article",
        "type": "string"
    },
    "country": {
        "description": "The country that the article is talking about",
        "type": "string"
    },
    "date": {
        "description": "The date when the article was published",
        "type": "date"
    }
}
}
```

User Query:
What are the activities in France before 2020?

Structured Request:
```json
{
    "query": "activities in France",
    "filter": "and(eq(\"country\", \"France\"), lte(\"date\", \"2020-01-01\"))"
}
```

<< Example 7. >>
Data Source:
```json
{
    "content": "Summary of particular information of a place",
    "attributes": {
    "activity": {
        "description": "The activities mentioned in the article",
        "type": "string"
    },
    "country": {
        "description": "The country that the article is talking about",
        "type": "string"
    },
    "date": {
        "description": "The date when the article was published",
        "type": "date"
    }
}
}
```

User Query:
What are the activities in France between 2015 and 2020?

Structured Request:
```json
{
    "query": "activities in France",
    "filter": "and(eq(\"country\", \"France\"), gte(\"date\", \"2015-01-01\"), lte(\"date\", \"2020-01-01\"))"
}
```

<< Example 8. >>
Data Source:
```json
{
    "content": "Summary of particular information of a place",
    "attributes": {
    "activity": {
        "description": "The activities mentioned in the article",
        "type": "string"
    },
    "country": {
        "description": "The country that the article is talking about",
        "type": "string"
    },
    "date": {
        "description": "The date when the article was published",
        "type": "date"
    }
}
}
```

User Query:
What are the activities in France between 2015 and 2020 that are related to sports?

Structured Request:
```json
{
    "query": "activities in France sports",
    "filter": "and(eq(\"country\", \"France\"), gte(\"date\", \"2015-01-01\"), lte(\"date\", \"2020-01-01\"))"
}
```

<< Example 9. >>
Data Source:
```json
{
    "content": "Summary of particular information of a place",
    "attributes": {
    "activity": {
        "description": "
 raised following error:
Got invalid JSON object. Error: Expecting value: line 2 column 14 (char 15)
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE
For troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/OUTPUT_PARSING_FAILURE

**Seems like using the LLM to construct to the JSON StructuredQuery object is prone to errors**

- Check this out later:
  - https://stackoverflow.com/questions/78489329/how-to-filter-documents-based-on-a-list-of-metadata-in-langchains-chroma-vector
  - https://github.com/langchain-ai/langchain/discussions/9645
  - https://medium.com/kx-systems/optimizing-vector-search-with-metadata-filtering-41276e1a7370

### **Experiment 3: Query Re-writing -> Decomposition**

- https://python.langchain.com/docs/concepts/retrieval/
- https://github.com/langchain-ai/rag-from-scratch/blob/main/rag_from_scratch_5_to_9.ipynb

Decomposition is a Query Re-writing technique that focuses decomposing a question into a set of subquestions.

This may be effective for our system as users planning a holiday tend may string together many requests in one question.

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## **Part 6: Generate**

<u>Overview of Generate Step</u>

An LLM produces an answer using a prompt that includes the question and retrieved data

A chain in LangChain is a sequence of interconnected components that process a user's query to generate and deliver valuable output. It is like a pipeline

The components of a chain typically include:

- Prompts: Templates that guide the LLM's responses.

- LLMs or Chat Models: The engines that generate responses based on the prompts.

- Output Parsers: Tools that parse the LLM's output.

- Tools: Extensions that allow LLMs to extract additional information from APIs or run code, turning LLMs into agents.

- General Functions: Additional functionalities that can be chained together.

In LangChain, these components, known as runnables, can be combined/chained to form a comprehensive pipeline. Pipelines are a way to think of chains

An LLM chain is similar to a data pipeline. In a data pipeline, raw data is transformed into clean, structured data.

Similarly, in an LLM chain, a query is transformed into valuable output using LLM calls, functions, and additional data.

<br/>

<u>More on Prompts</u>

A prompt for a language model is a set of instructions provided by a user to guide the model's response, helping it to understand the context and generate relevant and coherent output

Prompt templates in LangChain are predefined receipes for generating prompts for language models

```
from langchain.prompts import PromptTemplate

# Create a simple prompt template
# Formulate the prompt to the LLM with a string as a placeholder, in this case {topic}
prompt_template = """
You are a helpful assistant that explains AI topics. Given the following input:
{topic}
Provide an explanation of the given topic.
"""

# Create the prompt from the prompt template from LangChain
prompt = PromptTemplate(
    input_variables=["topic"],
    template=prompt_template,
)

# Assemble the chain using the pipe operator "|", more on that later. Assemble a chain using both components
chain = prompt | llm

# Invoke the chain with an input variable
chain.invoke({"topic":"What is LangChain"})
```

LangChain rag-prompts from prompt hub:
https://smith.langchain.com/hub/rlm/rag-prompt

<br/>

<u>LangChain Expression Language (LCEL)</u>

LangChain Expression Language (LCEL) simplifies building complex chains from basic components.

It uses the pipe operator (|) to chain different components, feeding the output from one component as input to the next.

 A simple example of a chain composed this way would be a prompt combined with a model and an output parser.

 ```chain = prompt | model | output_parser```

 These components are called "runnables". Think of LangChain Expression Language(LCEL) as a declarative way of composing these runnables into chains.

 We can use the string output of this chain and send it to a new chain that will return a different output

LangChain allows multiple chains to be chained together in this manner.

```chain = chain_1 | chain_2```

The Runnable Protocol

A runnable is a unit of work that can be invoked, batched, streamed, transformed, and composed.

The chains we build with LangChain and their components (the components of those chains) are runnables.

We can also pass arbitrary functions into a chain, which will be converted into runnables.

```chain = prompt | (lambda input: {"x":input}) | model | output_parser```

These chains can be:

```
chain.invoke(...)
chain.batch([...]) if we have multiple inputs
chain.stream(...)
```

Each of these components (```retriever, prompt, llm, etc.```) are instances of Runnable. This means that they implement the same methods - such as sync and async ```.invoke```, ```.stream```, or ```.batch``` - which makes them easier to connect together. They can be connected into a ```RunnableSequence```--another Runnable--, via the ```|``` operator.

LangChain will automatically cast certain objects to runnables when met with the | operator.

Core runnable objects in LangChain:

RunnableSequence: A class that chains together multiple runnable components, ensuring each component processes its input and sequentially passes its output to the next component in the pipeline.

RunnableLambda: A class that turns a Python callable (like a function) into a runnable component, allowing integration of arbitrary functions into chains.

RunnablePassthrough: A class that either passes its input through unchanged or adds additional keys to the output. It can act as a placeholder or allow flexible integrations into sequences where we need to modify the input.

RunnableParallel: A class that runs multiple runnables concurrently, allowing branching where two chains run on the same input but return different outputs (return a mapping of their outputs).

<br/>

<u>Experimentation and Improvements</u>
1. Citations (Must do)
2. Prompt Engineering?



In [None]:
from langchain import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

template = """
You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Question: {question}
Context: {context}
Helpful Answer:
"""

prompt = PromptTemplate.from_template(template)

In [None]:
prompt

PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="\nYou are an assistant for question-answering tasks.\nUse the following pieces of retrieved context to answer the question.\nIf you don't know the answer, just say that you don't know.\nQuestion: {question}\nContext: {context}\nHelpful Answer:\n")

In [None]:
output_parser = StrOutputParser()

chain = (
    {"context": (lambda x: x["question"]) | retriever,

     "question": (lambda x: x["question"])}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
answer = chain.invoke({"question":"What are the best things to do in Norway?"})

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


In [None]:
print(answer)


You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Question: What are the best things to do in Norway?
Context: [Document(metadata={'source': 'Norway Example.txt', 'start_index': 1786}, page_content="9. Mount Hanguren, Sognefjord/ Hardangerfjord\n\n10. Mount Fløyen, Sognefjord/ Hardangerfjord\n\n11. Mount Ulriken, Sognefjord/ Hardangerfjord\n\n12. Rimstigen, Næroyfjord\n\n13. Urkeegga, Hjørundfjord\n\n14. Himakånå, Nedstrandsfjorden\n\n15. Langfoss Waterfall Hike, Åkrafjord\n\nWe've suggested some day hikes within the fjords, but given that wild camping is permissible in Norway we also recommend embarking on a multi-day adventure.\n\nPlease note that the route descriptions we have included are summaries, designed to inspire rather than for any navigational purpose. Further mapping and planning - or an experienced local guide - will be needed to safely walk 

The best things to do in Norway include hiking in the fjords. Some of the best fjords for hiking are Nærøyfjord, Hjørundfjord, Sognefjord, and Hardangerfjord. These fjords offer excellent summit hikes with stunning views. Some popular hikes include Preikestolen, Breiskrednosi Summit Hike, Romsdalseggen Ridge, Mount Skåla, Trolltunga, Aurlandsdalen Valley, Dronningstien, Vidasethovden, Mount Hanguren, Mount Fløyen, Mount Ulriken, Rimstigen, Urkeegga, Himakånå, and Langfoss Waterfall Hike. These hikes offer a range of difficulties and are accessible by public transport. The best time to visit Norway's fjords for hiking is from spring to October, with peak season being mid-June to the end of July. However, weather can be changeable, so it's important to pack accordingly.

In [None]:
chain_skipprompt = (
    {"context": (lambda x: x["question"]) | retriever,

     "question": (lambda x: x["question"])}
    | prompt
    | llm.bind(skip_prompt=True)
    | StrOutputParser()
)

In [None]:
answer = chain_skipprompt.invoke({"question":"What are the best things to do in Norway?"})

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


In [None]:
print(answer) # Check why is this repeated for some LLMs?

The best things to do in Norway include hiking in the fjords. Some of the best fjords for hiking are Nærøyfjord, Hjørundfjord, Sognefjord, and Hardangerfjord. These fjords offer excellent summit hikes with stunning views. Some popular hikes include Preikestolen, Breiskrednosi Summit Hike, Romsdalseggen Ridge, Mount Skåla, Trolltunga, Aurlandsdalen Valley, Dronningstien, Vidasethovden, Mount Hanguren, Mount Fløyen, Mount Ulriken, Rimstigen, Urkeegga, Himakånå, and Langfoss Waterfall Hike. These hikes offer a range of difficulties and are accessible by public transport. The best time to visit Norway's fjords for hiking is from spring to October, with peak season being mid-June to the end of July. However, weather can be changeable, so it's important to pack accordingly.


Comparing it to no-RAG LLM response

In [None]:
print(llm.invoke("What are the best things to do in Norway??"))

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


What are the best things to do in Norway???

Norway is a country known for its stunning natural beauty, rich history, and vibrant culture. Here are some of the best things to do in Norway:

1. **Visit the Fjords**: Norway's fjords are one of its most iconic natural features. The most famous is the Geirangerfjord, but others like the Hardangerfjord and the Sognefjord are also breathtaking. You can explore them by boat, kayak, or even hike along their edges.

2. **Explore Bergen**: This historic city is a gateway to the fjords and offers a mix of cultural attractions, museums, and vibrant neighborhoods. Don't miss the Bryggen Hanseatic Wharf, a UNESCO World Heritage site.

3. **Experience the Northern Lights (Aurora Borealis)**: If you're lucky enough to visit Norway between September and March, you might get to see this natural light show. The best places to view them are in the north, such as Tromsø or the Lofoten Islands.

4. **Visit Oslo**: Norway's capital offers a blend of history,