### encapsulating using polymorphism
`Favor Composition Over Inheritance` principle

### test RAG

#### Retrieval augmented generation (RAG)

[langchain.com](https://www.langchain.com/)

```bash
uv add langchain langchain_community faiss-cpu sentence-transformers transformers
```

In [1]:
# uv add ipywidgets

# from langchain.llms import HuggingFacePipeline
from langchain_huggingface import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain.chains.question_answering import load_qa_chain
from langchain.prompts import PromptTemplate
from transformers import pipeline
from langchain_community.vectorstores import FAISS

In [2]:
import os
import urllib.request
import zipfile

zip_url = "https://github.com/gakudo-ai/open-datasets/raw/refs/heads/main/asia_documents.zip"
zip_path = "asia_documents.zip"
extract_folder = "asia_txt_files"

print("Downloading zip file...")
urllib.request.urlretrieve(zip_url, zip_path)
print("Download complete!")

print("Extracting files...")
os.makedirs(extract_folder, exist_ok=True)
with zipfile.ZipFile(zip_path, "r") as zip_ref:
    zip_ref.extractall(extract_folder)

print(f"Files extracted to: {extract_folder}")

print("Extracted files:")
print(os.listdir(extract_folder))

Downloading zip file...
Download complete!
Extracting files...
Files extracted to: asia_txt_files
Extracted files:
['Malaysia.txt', 'Mongolia.txt', 'Philippines.txt', 'South_Korea.txt', 'Thailand.txt', 'Japan.txt', 'Taiwan.txt', 'Indonesia.txt', 'Vietnam.txt']


In [3]:
# uv add -U langchain-huggingface

import os
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
# from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings

folder_path = "asia_txt_files"

documents = []
for filename in os.listdir(folder_path):
    if filename.endswith(".txt"):
        file_path = os.path.join(folder_path, filename)
        loader = TextLoader(file_path)
        documents.extend(loader.load())

text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=100)
docs = text_splitter.split_documents(documents)

embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

vectorstore = FAISS.from_documents(docs, embedding_model)
retriever = vectorstore.as_retriever()

In [4]:
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from transformers import pipeline

llm_pipeline = pipeline("text-generation", model="gpt2", device=0, max_new_tokens=200)
llm = HuggingFacePipeline(pipeline=llm_pipeline)

Device set to use cpu


In [5]:
from langchain_core.prompts import PromptTemplate
from langchain.chains import RetrievalQA

prompt_template = """Answer the following question based on the provided context:
{context}

Question: {question}
Answer:"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template
)

retrieval_qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt},  # Pass the prompt here
    verbose=True
)

In [6]:
def truncate_to_max_tokens(text, max_tokens=500):
    tokens = text.split()
    if len(tokens) > max_tokens:
        return " ".join(tokens[:max_tokens])
    return text

In [7]:
query = "What are the best Asian cuisine dishes?"

# Use `invoke` instead of `get_relevant_documents`
retrieved_docs = retriever.invoke(query)[:1]  # Top-1 document
context = " ".join([doc.page_content for doc in retrieved_docs])
context = truncate_to_max_tokens(context, max_tokens=500)

# Use `invoke` instead of `run`
response = retrieval_qa.invoke({"query": query})
print("Answer:", response["result"])  # Access the result via ["result"]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Answer: Answer the following question based on the provided context:
Vietnam is a Southeast Asian country known for its rich history, diverse landscapes, and delicious cuisine. Hanoi and Ho Chi Minh City are its major urban centers, each with a unique character. Ha Long Bay’s limestone karsts and the Mekong Delta’s floating markets are famous geographical highlights. Vietnamese culture is deeply influenced by Confucian values, French colonial heritage, and indigenous traditions.

Thailand is a Southeast Asian country famous for its tropical beaches, ornate temples, and bustling street food culture. Bangkok, the capital, is known for its vibrant nightlife and historical sites like the Grand Palace and Wat Arun. Northern Thailand features mountainous landscapes and cultural cities like Chiang Mai, while the south offers world-renowned islands such as Phuket and Koh Samui.

Malaysia is a diverse country in Southeast 

In [20]:
path = os.path.join('.', 'mohsen.txt')

In [21]:
obj_texts = TextLoader(path)

In [22]:
obj_texts.load()

[Document(metadata={'source': './mohsen.txt'}, page_content='Mohsen Moghimbegloo\n\n0935000000\n\nPython developer. I work it while 8 years.\nI will make AI soulusion for all of people.\n\nEnd test\n')]