### test RAG

#### Retrieval augmented generation (RAG)

[langchain.com](https://www.langchain.com/)

```bash
uv add langchain langchain_community faiss-cpu sentence-transformers transformers
```

#### whole of code

In [1]:
# uv add ipywidgets

# from langchain.llms import HuggingFacePipeline
from langchain_huggingface import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain.chains.question_answering import load_qa_chain
from langchain.prompts import PromptTemplate
from transformers import pipeline
from langchain_community.vectorstores import FAISS


import os
import urllib.request
import zipfile

zip_url = "https://github.com/gakudo-ai/open-datasets/raw/refs/heads/main/asia_documents.zip"
zip_path = "asia_documents.zip"
extract_folder = "asia_txt_files"

print("Downloading zip file...")
urllib.request.urlretrieve(zip_url, zip_path)
print("Download complete!")

print("Extracting files...")
os.makedirs(extract_folder, exist_ok=True)
with zipfile.ZipFile(zip_path, "r") as zip_ref:
    zip_ref.extractall(extract_folder)

print(f"Files extracted to: {extract_folder}")

print("Extracted files:")
print(os.listdir(extract_folder))


# uv add -U langchain-huggingface

import os
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
# from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings

folder_path = "asia_txt_files"

documents = []
for filename in os.listdir(folder_path):
    if filename.endswith(".txt"):
        file_path = os.path.join(folder_path, filename)
        loader = TextLoader(file_path)
        documents.extend(loader.load())

text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=100)
docs = text_splitter.split_documents(documents)

embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

vectorstore = FAISS.from_documents(docs, embedding_model)
retriever = vectorstore.as_retriever()


from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from transformers import pipeline

llm_pipeline = pipeline("text-generation", model="gpt2", device=0, max_new_tokens=200)
llm = HuggingFacePipeline(pipeline=llm_pipeline)


from langchain_core.prompts import PromptTemplate
from langchain.chains import RetrievalQA

prompt_template = """Answer the following question based on the provided context:
{context}

Question: {question}
Answer:"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template
)

retrieval_qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt},  # Pass the prompt here
    verbose=True
)


def truncate_to_max_tokens(text, max_tokens=500):
    tokens = text.split()
    if len(tokens) > max_tokens:
        return " ".join(tokens[:max_tokens])
    return text


query = "What are the best Asian cuisine dishes?"

# Use `invoke` instead of `get_relevant_documents`
retrieved_docs = retriever.invoke(query)[:1]  # Top-1 document
context = " ".join([doc.page_content for doc in retrieved_docs])
context = truncate_to_max_tokens(context, max_tokens=500)

# Use `invoke` instead of `run`
response = retrieval_qa.invoke({"query": query})
print("Answer:", response["result"])  # Access the result via ["result"]

Downloading zip file...
Download complete!
Extracting files...
Files extracted to: asia_txt_files
Extracted files:
['Malaysia.txt', 'Mongolia.txt', 'Philippines.txt', 'South_Korea.txt', 'Thailand.txt', 'Japan.txt', 'Taiwan.txt', 'Indonesia.txt', 'Vietnam.txt']


Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Answer: Answer the following question based on the provided context:
Vietnam is a Southeast Asian country known for its rich history, diverse landscapes, and delicious cuisine. Hanoi and Ho Chi Minh City are its major urban centers, each with a unique character. Ha Long Bay’s limestone karsts and the Mekong Delta’s floating markets are famous geographical highlights. Vietnamese culture is deeply influenced by Confucian values, French colonial heritage, and indigenous traditions.

Thailand is a Southeast Asian country famous for its tropical beaches, ornate temples, and bustling street food culture. Bangkok, the capital, is known for its vibrant nightlife and historical sites like the Grand Palace and Wat Arun. Northern Thailand features mountainous landscapes and cultural cities like Chiang Mai, while the south offers world-renowned islands such as Phuket and Koh Samui.

Malaysia is a diverse country in Southeast 

#### seperate of code

In [22]:
# uv add ipywidgets

# from langchain.llms import HuggingFacePipeline
from langchain_huggingface import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain.chains.question_answering import load_qa_chain
from langchain.prompts import PromptTemplate
from transformers import pipeline
from langchain_community.vectorstores import FAISS

In [23]:
import os
import urllib.request
import zipfile

zip_url = "https://github.com/gakudo-ai/open-datasets/raw/refs/heads/main/asia_documents.zip"
zip_path = "asia_documents.zip"
extract_folder = "asia_txt_files"

print("Downloading zip file...")
urllib.request.urlretrieve(zip_url, zip_path)
print("Download complete!")

print("Extracting files...")
os.makedirs(extract_folder, exist_ok=True)
with zipfile.ZipFile(zip_path, "r") as zip_ref:
    zip_ref.extractall(extract_folder)

print(f"Files extracted to: {extract_folder}")

print("Extracted files:")
print(os.listdir(extract_folder))

Downloading zip file...
Download complete!
Extracting files...
Files extracted to: asia_txt_files
Extracted files:
['Malaysia.txt', 'Mongolia.txt', 'Philippines.txt', 'South_Korea.txt', 'Thailand.txt', 'Japan.txt', 'Taiwan.txt', 'Indonesia.txt', 'Vietnam.txt']


In [31]:
# uv add -U langchain-huggingface

import os
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
# from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings

folder_path = "asia_txt_files"

documents = []
for filename in os.listdir(folder_path):
    if filename.endswith(".txt"):
        file_path = os.path.join(folder_path, filename)
        loader = TextLoader(file_path)
        documents.extend(loader.load())

text_splitter = CharacterTextSplitter(
                    chunk_size=500, 
                    chunk_overlap=100
                    )
docs = text_splitter.split_documents(documents)

# Option 1: Auto-cached (after first download)  ==> this is recomended
# embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Option 2: Explicit local path                 ==> this is faster
# local_model_path = os.path.expanduser("~/.cache/huggingface/hub/models--sentence-transformers--all-MiniLM-L6-v2/snapshots/c9745ed1d9f207416be6d2e6f8de32d1f16199bf")
# embedding_model_local = HuggingFaceEmbeddings(model_name=local_model_path)

vectorstore = FAISS.from_documents(docs, embedding_model)
retriever = vectorstore.as_retriever()

In [36]:
# Ollama models specifically for embeddings/RAG
#
# 1- nomic-embed-text
# 2- mxbai-embed-large
# 3- llama3
# 4- nomic-embed 
# 5- all-MiniLM
#
# "nomic-embed-text" is the best ollama model for embedding/RAG

# Example:
#
# from langchain.embeddings import OllamaEmbeddings
# embedding_model = OllamaEmbeddings(model_name="nomic-embed-text")

In [37]:
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from transformers import pipeline

llm_pipeline = pipeline("text-generation", model="gpt2", device=0, max_new_tokens=200)
llm = HuggingFacePipeline(pipeline=llm_pipeline)

"""
                  [1]
########################################
# For limited GPU memory (8GB or less) #
########################################

# 4-bit quantized model with better memory efficiency
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

# Configure quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

# Load a smaller, but powerful model
model_name = "mistralai/Mistral-7B-Instruct-v0.2"  # Alternative to Llama 2
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

llm_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=300,
    do_sample=True,
    temperature=0.7,
    top_p=0.95
)

llm = HuggingFacePipeline(pipeline=llm_pipeline)
"""



"""
                  [2]
#####################################
# For using Ollama (easiest option) #
#####################################

from langchain_community.llms import Ollama

# Simple setup with Ollama - handles all memory management for you
llm = Ollama(
    model="mistral",  # or "llama2", "llama3" if available
    temperature=0.7,
    max_tokens=500
)

# Note: With this approach, you don't need the pipeline creation
# The rest of your RAG code can remain the same
"""


"""
                      [3]
##################################################
# Cloud API option if local resource is an issue #
##################################################

from langchain_openai import ChatOpenAI  # or any other API-based model

# No local resources needed
llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",  # Or use another API model
    temperature=0.7
)

# The rest of your RAG setup remains the same
"""

SyntaxError: incomplete input (1749299754.py, line 48)

In [38]:
from langchain_core.prompts import PromptTemplate
from langchain.chains import RetrievalQA

prompt_template = """Answer the following question based on the provided context:
{context}

Question: {question}
Answer:"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template
)

retrieval_qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt},  # Pass the prompt here
    verbose=True
)

In [34]:
def truncate_to_max_tokens(text, max_tokens=500):
    tokens = text.split()
    if len(tokens) > max_tokens:
        return " ".join(tokens[:max_tokens])
    return text

In [35]:
query = "What are the best Asian cuisine dishes?"

# Use `invoke` instead of `get_relevant_documents`
retrieved_docs = retriever.invoke(query)[:1]  # Top-1 document
context = " ".join([doc.page_content for doc in retrieved_docs])
context = truncate_to_max_tokens(context, max_tokens=500)

# Use `invoke` instead of `run`
response = retrieval_qa.invoke({"query": query})
print("Answer:", response["result"])  # Access the result via ["result"]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Answer: Answer the following question based on the provided context:
Vietnam is a Southeast Asian country known for its rich history, diverse landscapes, and delicious cuisine. Hanoi and Ho Chi Minh City are its major urban centers, each with a unique character. Ha Long Bay’s limestone karsts and the Mekong Delta’s floating markets are famous geographical highlights. Vietnamese culture is deeply influenced by Confucian values, French colonial heritage, and indigenous traditions.

Thailand is a Southeast Asian country famous for its tropical beaches, ornate temples, and bustling street food culture. Bangkok, the capital, is known for its vibrant nightlife and historical sites like the Grand Palace and Wat Arun. Northern Thailand features mountainous landscapes and cultural cities like Chiang Mai, while the south offers world-renowned islands such as Phuket and Koh Samui.

Malaysia is a diverse country in Southeast 

#### Sample of chunk with `langchain` [1]

In [15]:
from langchain.text_splitter import CharacterTextSplitter

text = "Your long document here... Singapore has many amazing restaurants.\n Bangkok offers the best seafood and seafood from all over the world. \n Northern Thailand features mountainous landscapes and cultural cities like Chiang Mai, while the south\n offers world-renowned islands such as Phuket and Koh Samui."
splitter = CharacterTextSplitter(chunk_size=30, 
                                 chunk_overlap=10, 
                                 separator="\n")
chunks = splitter.split_text(text)
print(f"Split into {len(chunks)} chunks!")

Created a chunk of size 66, which is longer than the specified 30
Created a chunk of size 70, which is longer than the specified 30
Created a chunk of size 103, which is longer than the specified 30


Split into 4 chunks!


In [16]:
chunks

['Your long document here... Singapore has many amazing restaurants.',
 'Bangkok offers the best seafood and seafood from all over the world.',
 'Northern Thailand features mountainous landscapes and cultural cities like Chiang Mai, while the south',
 'offers world-renowned islands such as Phuket and Koh Samui.']

#### Sample of chunk with `langchain` [2]

In [9]:
from langchain.text_splitter import CharacterTextSplitter

text = "Your long document here... Singapore has many amazing restaurants. Bangkok offers the best seafood and seafood from all over the world. Northern Thailand features mountainous landscapes and cultural cities like Chiang Mai, while the south offers world-renowned islands such as Phuket and Koh Samui."

# Specify separator (e.g., space) to force splitting
splitter = CharacterTextSplitter(
    chunk_size=30,
    chunk_overlap=10,
    separator=" "  # Split on spaces
)
chunks = splitter.split_text(text)
print(f"Split into {len(chunks)} chunks!")
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk}")

Split into 15 chunks!
Chunk 1: Your long document here...
Chunk 2: here... Singapore has many
Chunk 3: has many amazing restaurants.
Chunk 4: Bangkok offers the best
Chunk 5: the best seafood and seafood
Chunk 6: seafood from all over the
Chunk 7: over the world. Northern
Chunk 8: Northern Thailand features
Chunk 9: features mountainous
Chunk 10: landscapes and cultural cities
Chunk 11: cities like Chiang Mai, while
Chunk 12: Mai, while the south offers
Chunk 13: offers world-renowned islands
Chunk 14: islands such as Phuket and Koh
Chunk 15: and Koh Samui.
