# RAG Workshop

## What is RAG?
1) Combination of **information retrieval** with **generative models**.
2) To get appropriate output
3) Without the need to finetune

## Common Tools & Frameworks
- **FAISS** - Efficient vector similarity search library for storing and retrieving embeddings.
- **LangChain** - Framework for building RAG pipelines, chatbots etc.
- **Other Vector Stores** - Pinecone, Chroma, Weaviate, etc.

## RAG Use Cases
RAG can support multiple retrieval and generation patterns:

- **Text-to-Text**  
- **Text-to-Image**  
- **Image-to-Text**  
- **Image-to-Image**  

# Implementation

In [1]:
!pip install -r requirements.txt -q

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m10.5/10.5 MB[0m [31m106.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m23.8/23.8 MB[0m [31m88.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m9.1/9.1 MB[0m [31m99.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m566.4/566.4 kB[0m [31m56.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m6.9/6.9 MB[0m [31m102.6 MB/s[0m eta [36m0:00:00[0m
[2K 

## FAISS Library

### DECLARING GLOBAL VARIABLES + OBJECTS

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

import pandas as pd

# EMBEDDING MODEL
embedding_model = SentenceTransformer("paraphrase-mpnet-base-v2") # bert-base-nli-mean-tokens

# DATA STORE THAT WILL ALSO BE STORED AS  VECTOR STORE
data = [
    ['What is the weather like today?', 'general'],
    ['Can you provide the latest stock market updates?', 'finance'],
    ['Recommend a good Italian restaurant nearby', 'food'],
    ['How do I reset my password?', 'tech support'],
    ['Tell me a joke', 'entertainment'],
    ['What are the symptoms of a flu?', 'health'],
    ['Book a flight to New York', 'travel'],
    ['How to make a chocolate cake?', 'cooking'],
    ['In todays football game, Barcelona beat Real Madrid 5-2', 'sports'],
    ['Im feeling happy today', 'personal emotion']
]
df = pd.DataFrame(data, columns=['text', 'category'])

# USER QUERY
USER_QUERY = "What was the score in today's football game"

# GENERATION MODEL (LOADING MODEL+TOKENIZER)
model_id = "Qwen/Qwen2.5-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model.eval() # drops dropout layer and uses learned running mean and variance in batch-norm.

df

config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

KeyboardInterrupt: 

### VectorDB creation

In [None]:
text = df['text'] # ["What is the weather like today?", "Can you provide the latest stock market updates?", ...]

with torch.no_grad():
  embeddings = embedding_model.encode(text)

print(embeddings.shape) # (10, 768)
print(type(embeddings[0]))

embd_dim = embeddings.shape[1] # get embedding dimension (768)

index = faiss.IndexFlatL2(embd_dim) # create faiss index of 768 dimension and use L2 distance as distance metric (FLAT=brute force)
faiss.normalize_L2(embeddings) # In-place normalization of all embeddigns. Magnitude of all vectors become 1. so only angle matters, not the

index.add(embeddings) # normalized embeddings added into index/VectorDB

(10, 768)
<class 'numpy.ndarray'>


### Retrieval

In [None]:
with torch.no_grad():
  search_vector = embedding_model.encode(USER_QUERY)
print(search_vector.shape, type(search_vector))
new_vector = np.array([search_vector])
print(new_vector.shape)
faiss.normalize_L2(new_vector)

distances, indices = index.search(new_vector, k=1) # Fetch 1 Nearest Neighbours based on L2 distance
results = pd.DataFrame({'distances': distances[0], 'ann': indices[0]})
results

(768,) <class 'numpy.ndarray'>
(1, 768)


Unnamed: 0,distances,ann
0,1.404838,8


In [None]:
df_merged = pd.merge(results, df, left_on='ann', right_index=True)
df_merged.head()

Unnamed: 0,distances,ann,text,category
0,1.404838,8,"In todays football game, Barcelona beat Real M...",sports


### Augmentation

In [None]:
prompt_template = """
Give output to user question based on relvant context.

User Question: {USER_QUERY}
Context:
{Context}

Answer:
""".strip()

prompt = prompt_template.format(USER_QUERY=USER_QUERY, Context=" ".join(df_merged["text"].tolist()))
prompt

"Give output to user question based on relvant context.\n\nUser Question: What was the score in today's football game\nContext:\nIn todays football game, Barcelona beat Real Madrid 5-2\n\nAnswer:"

### Generation

In [None]:
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False, # if tokenize True, it will return token ids: [151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553...]. Otherwise, string returned.
    add_generation_prompt=True
)

# WITHOUT GENERATION PROMPT:
"""
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
Give output to user question based on relvant context.

User Question: What was the score in today's football game
Context:
In todays football game, Barcelona beat Real Madrid 5-2

Answer:<|im_end|>
"""

# WITH GENERATION PROMPT:
"""
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
Give output to user question based on relvant context.

User Question: What was the score in today's football game
Context:
In todays football game, Barcelona beat Real Madrid 5-2

Answer:<|im_end|>
<|im_start|>assistant
"""

model_inputs = tokenizer([text],
                         return_tensors="pt"
                         ).to(model.device)

# {
#     'input_ids': tensor([[151644,   8948,    198,...]], device='cuda:0'),
#     'attention_mask': tensor([[1, 1, ..., 1]], device='cuda:0')
# }




generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
# same as:
# generated_ids = model.generate(
#     input_ids=...,
#     attention_mask=...,
#     max_new_tokens=512
# )

# Output: tensor([[151644, 8948, 198,..., ]], device='cuda:0')
# Output contains input tokens + output tokens

generated_ids = [generated_ids[0][len(model_inputs.input_ids[0]):]] # remove input token ids, just keep output token ids

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
# skip_special_tokens=True
# The score in today's football game between Barcelona and Real Madrid was 5-2 in favor of Barcelona.<|im_end|>

# skip_special_tokens=False
# The score in today's football game between Barcelona and Real Madrid was 5-2 in favor of Barcelona.

response

NameError: name 'generation_pipe' is not defined

## LangChain Framework

### DECLARING GLOBAL VARIABLES + OBJECT

In [None]:
import os
import pandas as pd
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore
from transformers import pipeline
import torch

# os.environ["HUGGINGFACEHUB_API_TOKEN"] = "hf_..."

data = [
    ['What is the weather like today?', 'general'],
    ['Can you provide the latest stock market updates?', 'finance'],
    ['Recommend a good Italian restaurant nearby', 'food'],
    ['How do I reset my password?', 'tech support'],
    ['Tell me a joke', 'entertainment'],
    ['What are the symptoms of a flu?', 'health'],
    ['Book a flight to New York', 'travel'],
    ['How to make a chocolate cake?', 'cooking'],
    ['In todays football game, Barcelona beat Real Madrid 5-2', 'sports'],
    ['Im feeling happy today', 'personal emotion']
]
df = pd.DataFrame(data, columns=['text', 'category'])


embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/paraphrase-mpnet-base-v2")

# model_id = "Qwen/Qwen2.5-1.5B-Instruct"
# generation_pipe = pipeline(
#     "text-generation",
#     model=model_id,
#     torch_dtype=torch.float32,
#     max_new_tokens=100,
#     do_sample=False,
#     temperature=0.0,
#     device = 0
#     # num_return_sequences=3,
# )

### VectorDB creation

In [None]:
vector_store = InMemoryVectorStore(embedding_model)
vector_store.add_texts(df["text"])

['33b76b88-e062-42ef-a0a4-a32e3b87f1f2',
 'a2d1eee8-5c48-4c9a-bc21-987dd964481c',
 'bc29a990-f692-4028-9b6f-0ad1f7aa3d76',
 'a3353947-96e4-47f0-8206-37c89787cab5',
 '1f13c892-1ee0-4bc6-8357-022cb86ed0d6',
 '06634e59-a839-4127-b41b-1a1fe126666f',
 'f9d6d822-f0f4-4848-9cd8-716a5960c63a',
 '3c945efd-e02d-4528-aa73-f544e93dd9ef',
 'b88f404b-e5da-4d0f-a3a2-ff8b043173ae',
 'cf276a48-85ee-4fcc-b1d1-f4e45e5fcfad']

### Retrieval

In [None]:
query = "What's the score in the latest Barcelona game?"
retrieved_docs = vector_store.similarity_search(query, k=3)
print(retrieved_docs)

[Document(id='b88f404b-e5da-4d0f-a3a2-ff8b043173ae', metadata={}, page_content='In todays football game, Barcelona beat Real Madrid 5-2'), Document(id='a2d1eee8-5c48-4c9a-bc21-987dd964481c', metadata={}, page_content='Can you provide the latest stock market updates?'), Document(id='33b76b88-e062-42ef-a0a4-a32e3b87f1f2', metadata={}, page_content='What is the weather like today?')]


### Augmentation

In [None]:
prompt_template = """
Give output to user question based on relvant context.

User Question: {USER_QUERY}
Context:
{Context}

Answer:
""".strip()

context = "\n".join([doc.page_content for doc in retrieved_docs])
prompt = prompt_template.format(USER_QUERY=query, Context=context)
prompt

### Generation

In [None]:
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)

generated_ids = [generated_ids[0][len(model_inputs.input_ids[0]):]]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
response

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


[{'generated_text': "Give output to user question based on relvant context.\n\nUser Question: What's the score in the latest Barcelona game?\nContext:\nIn todays football game, Barcelona beat Real Madrid 5-2\nCan you provide the latest stock market updates?\nWhat is the weather like today?\n\nAnswer: The score of the latest Barcelona game was 5-2 against Real Madrid. For the most up-to-date stock market updates, please visit our financial news section. And for current weather conditions, we recommend checking a reliable weather website or app. \n\nPlease let me know if there are any other questions I can assist with! üè´‚öΩÔ∏èüå§Ô∏è\n\nNote: This response provides the requested information while maintaining a professional tone and avoiding direct repetition from the original context. It also includes additional relevant"}]

# LangChain Framework vs FAISS Library

**LangChain** and **FAISS** and two commonly used tools in AI applications, highlighting their strengths and weaknesses.

| Tool        | Strengths | Weaknesses |
|------------|-----------|------------|
| **LangChain** | - Enables rapid development of LLM-based applications such as chatbots, RAG systems, and AI agents. <br> - Provides high-level abstractions, reducing the need for deep AI or programming knowledge. <br> - Integrates easily with external APIs and vector databases (like FAISS). | - Internal workings are abstracted, making it harder to fully understand or customize low-level behavior. <br> - Can introduce overhead compared to a lean, custom implementation. |
| **FAISS** | - Highly efficient and scalable library for vector similarity search.. <br> - Flexible low-level control for optimized performance. | - Purely a vector search engine; does not handle LLMs, prompts, or application workflows. <br> - Requires additional effort to integrate embeddings and LLMs for complete AI applications. |

---

**Summary:**  
- **FAISS** is the engine for vector search and similarity tasks.  
- **LangChain** is a higher-level framework for building LLM-powered applications, which can leverage FAISS (or other vector stores) for retrieval.  


### Other VectorDB alternatives
1) ChromaDB
2) Qdrant DB
3) Pinecone
4) Weaviate

# Retrieval For Images

![Alt](diagrams/RAG%20-%20Retrieval%20For%20Images.jpg)

### Practice
1) Use FAISS library + CLIP's embedding model for vision
2) Use cat/dog images in "images" directory(paths already defined below)
3) Create an image store and VectorDB, store them in images directory
4) Use the query image(path defined below)
5) Perform similarity search and retrieve top 2 images

In [None]:
img_paths = {
    0: "images/german_sheperd.jpg",
    1: "images/Golden_Retriever.jpg",
    2: "images/siberian_husky.jpg",
    3: "images/persian_cat.jpg",
    4: "images/scottish_fold_cat.jpg",
    5: "images/sphynx_cat.jpg"
}

QUERY_IMG = "images/query_german_sheperd.jpg"

In [None]:
# SAMPLE CODE TO GENERATE IMAGE EMBEDDINGS USING CLIP'S IMAGE ENCODER

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

model.eval() # drops dropout layer and uses learned running mean and variance in batch-norm.

def get_img_embeddings_using_clip_img_encoder(img_path):
    img = Image.open(img_path)

    inputs = processor(images=img, return_tensors="pt")
    image_tensor = inputs['pixel_values']  # shape: (1, 3, 224, 224)

    # Encode
    with torch.no_grad():
        embeddings = model.get_image_features(image_tensor)  # Hugging Face
        embeddings = embeddings.cpu().numpy().astype('float32')  # convert to NumPy
        return embeddings

## Cross-model/Mulitmodal Retrieval

![Alt text](diagrams/RAG%20-%20Cross-Model%20RetrievalMultimodal%20Retrieval.jpg)

## Text-to-Image Retrieval

![Alt text](diagrams/RAG%20-%20Text-to-Image%20Retrieval.jpg)

### Practice
1) Use FAISS library & CLIP's vision encoder + text decoder
2) Load the existing image store + VectorDB created previously
3) Use the sample query text given below
3) Perform similarity search and retrieve top 2 images

In [None]:
img_paths = {
    0: "images/german_sheperd.jpg",
    1: "images/Golden_Retriever.jpg",
    2: "images/siberian_husky.jpg",
    3: "images/persian_cat.jpg",
    4: "images/scottish_fold_cat.jpg",
    5: "images/sphynx_cat.jpg"
}

QUERY_TXT = "A sphynx cat"

In [None]:
# SAMPLE CODE TO GENERATE TEXT EMBEDDINGS USING CLIP'S TEXT ENCODER

def get_text_embeddings_using_clip_text_encoder(text):
    inputs = processor(text=text, return_tensors="pt", padding=True)

    with torch.no_grad():
        embeddings = model.get_text_features(**inputs)
        embeddings = embeddings.cpu().numpy().astype('float32')  # convert to NumPy
        return embeddings

## Image-to-Text Retreival

![Alt text](diagrams/RAG%20-%20Image-to-text%20Retrieval.jpg)

### Practice
1) Use FAISS library & CLIP's vision encoder + text decoder.
2) load sentences from sentences.txt, make document store + VectorDB out of it.
3) Also store document store + VectorDB in texts folder.
4) Use the query image(path defined below)
5) Perform similarity search and retrieve top 2 sentences.

In [None]:
QUERY_IMG = "images/query_german_sheperd.jpg"

In [None]:
# Write code

# Create Streamlit App
### Create an HR Chatbot that uses RAG in backend to answer employee queries

## Running a streamlit app locally

### Instructions:
1) Code to create vectorDB and streamlit app.py is given below.
2) Run VectorDB creation code given below to create faiss index
3) copy the streamlit app code given below into new file: app.py
4) run using: streamlit run app.py

## Running a streamlit app on google colab



### Instructions
##### **Ngrok** is required when running streamlit app on google colab. ngrok is a tool that creates a secure public URL (tunnel) to your local app.
##### Why it‚Äôs needed on Google Colab:
Colab runs in a private environment with no direct public access. ngrok exposes your Streamlit app running inside Colab to the internet, so you can open it in a browser and share the link.

1) Sign up for an ngrok account at https://dashboard.ngrok.com/signup
.
Why: ngrok now requires a verified account to create public tunnels.

2) Get your ngrok authtoken from https://dashboard.ngrok.com/get-started/your-authtoken
 and run the command below in Colab: 
 
!ngrok authtoken <YOUR_AUTHTOKEN>


Why: The authtoken authenticates your Colab session so ngrok can create a public tunnel.

3) Run the VectorDB creation code to generate the FAISS index.

4) Copy the provided Streamlit code into a new file named app.py.

5) Run the app and create a public tunnel using the ngrok code given below.
Why: Streamlit runs on a local port inside Colab, and Colab does not expose local ports to the internet. The ngrok tunnel makes the app publicly accessible.

6) Copy the generated https://*.ngrok-free.app URL and open it in another browser tab.
Why: This URL forwards external traffic to the Streamlit server running in Colab.

### VectorDB creation code

In [19]:
import os
import faiss
import torch
from PyPDF2 import PdfReader
from sentence_transformers import SentenceTransformer
from langchain_text_splitters import RecursiveCharacterTextSplitter

# -----------------------------
# PDF processing
# -----------------------------
FOLDER_PATH = "app_docs"
TEXT_STORE_PATH = "texts/pdf_chunks_store.txt"
VECTOR_DB_PATH = "texts/pdf_chunks.index"

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

documents = []
metadatas = []
ids = []

for file_name in os.listdir(FOLDER_PATH):
    if file_name.endswith(".pdf"):
        pdf_path = os.path.join(FOLDER_PATH, file_name)

        reader = PdfReader(pdf_path)
        file_text = "".join(page.extract_text() or "" for page in reader.pages)

        chunks = text_splitter.split_text(file_text)

        documents.extend(chunks)
        metadatas.extend([{"source": file_name}] * len(chunks))
        ids.extend([f"{file_name}_chunk_{i}" for i in range(len(chunks))])

print(f"Total chunks: {len(documents)}")

# -----------------------------
# Embeddings
# -----------------------------
embedding_model = SentenceTransformer("paraphrase-mpnet-base-v2")

with torch.no_grad():
    embeddings = embedding_model.encode(
        documents,
        convert_to_numpy=True,
        show_progress_bar=True
    )

# -----------------------------
# FAISS index
# -----------------------------
embedding_dim = embeddings.shape[1]
index = faiss.IndexFlatL2(embedding_dim)

faiss.normalize_L2(embeddings)
index.add(embeddings)

# -----------------------------
# Save FAISS index
# -----------------------------
faiss.write_index(index, VECTOR_DB_PATH)

# -----------------------------
# Save chunks to TXT
# -----------------------------
with open(TEXT_STORE_PATH, "w", encoding="utf-8") as f:
    for i in range(len(documents)):
        f.write(f"{ids[i]}\n")
        f.write(f"{metadatas[i]['source']}\n")
        f.write(documents[i].replace("\n", " ") + "\n")
        f.write("---\n")

print("FAISS index and text store saved.")

Total chunks: 57


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

FAISS index and text store saved.


### Steamlit app code (copy to app.py and execute using: streamlit run app.py)

In [None]:
# STEAMLIT UI CODE
import streamlit as st
import faiss
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer

# -------------------------------
# Prompt template
# -------------------------------
prompt_template = """
You are acting an Car Expert.
Answer the user query using the given context.

User Query:
{user_query}

Context:
{context}
""".strip()

# -------------------------------
# Load generation+embedding model, tokenizer, FAISS
# -------------------------------
@st.cache_resource
def load_rag_components(
    model_name="Qwen/Qwen2.5-1.5B-Instruct",
    faiss_index_path="texts/pdf_chunks.index",
    documents_path="texts/pdf_chunks_store.txt", 
    embedding_model_name="paraphrase-mpnet-base-v2"
):

    # Embedding model
    embedding_model = SentenceTransformer(embedding_model_name)

    # Load FAISS index
    index = faiss.read_index(faiss_index_path)

    # Load chunk store (CUSTOM FORMAT)
    documents = []
    with open(documents_path, "r", encoding="utf-8") as f:
        block = []
        for line in f:
            if line.strip() == "---":
                documents.append({
                    "id": block[0],
                    "source": block[1],
                    "text": block[2]
                })
                block = []
            else:
                block.append(line.strip())

    # Load generation model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    model.eval()

    return tokenizer, model, index, documents, embedding_model


tokenizer, model, index, documents, embedding_model = load_rag_components()

import numpy as np
import pandas as pd
import torch

# -------------------------------
# Retrieve top-k documents
# -------------------------------
def retrieve(user_query, top_k=3):
    # Encode query
    with torch.no_grad():
        query_vector = embedding_model.encode(user_query)

    query_vector = np.array([query_vector])
    faiss.normalize_L2(query_vector)

    # FAISS search
    distances, indices = index.search(query_vector, k=top_k)

    # Extract text only (or include source if you want)
    retrieved_docs = [
        f"[Source: {documents[i]['source']}]\n{documents[i]['text']}"
        for i in indices[0]
    ]

    return retrieved_docs

# -------------------------------
# Create prompt with context
# -------------------------------
def augment(user_query, context_docs):
    context = "\n\n".join(context_docs)
    prompt = prompt_template.format(user_query=user_query, context=context)
    return prompt

# -------------------------------
# Generate response from model
# -------------------------------
def generate(prompt, max_new_tokens=150):
    st.subheader("Chunks Retrieved:")
    st.write(prompt)
    # System/user messages for Qwen-style chat
    messages = [
        {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]

    # Apply Qwen chat template
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    # Prepare model inputs
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

    # Generate tokens
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=max_new_tokens
    )

    # Remove input tokens, keep only generated tokens
    generated_ids = generated_ids[0][len(model_inputs.input_ids[0]):]

    # Decode to string
    response = tokenizer.batch_decode([generated_ids], skip_special_tokens=True)[0]

    return response


# -------------------------------
# RAG Pipeline
# -------------------------------
def RAG(user_query):
    context = retrieve(user_query)
    prompt = augment(user_query, context)
    response = generate(prompt)
    return response

# -------------------------------
# Streamlit UI
# -------------------------------
st.set_page_config(page_title="RAG Chat App", page_icon="üìö")

st.title("üìö RAG-powered Q&A")
st.write("Ask a question and get an answer using Retrieval-Augmented Generation.")

user_input = st.text_input("Enter your question:")

if st.button("Ask"):
    if not user_input.strip():
        st.warning("Please enter a question.")
    else:
        with st.spinner("Generating answer..."):
            response = RAG(user_input)

        st.subheader("Answer")
        st.write(response)



KeyboardInterrupt: 

In [10]:
!ngrok authtoken 2dWm4CaS1o9UVm51TeD6fcLJVSd_H6PuwB8DLmkBv8R7gsyJ

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


In [None]:
import os
from pyngrok import ngrok
import threading
import time

ngrok.kill()

# 1Ô∏è‚É£ Streamlit function
def run_streamlit():
    os.system("streamlit run app.py --server.port 8501 --server.headless true")

# 2Ô∏è‚É£ Start Streamlit in a background thread
threading.Thread(target=run_streamlit, daemon=True).start()

# 3Ô∏è‚É£ Wait for Streamlit to start
time.sleep(5)

# 4Ô∏è‚É£ Start ngrok tunnel correctly
public_url = ngrok.connect(addr=8501, proto="http")  # ‚úÖ addr + proto explicitly
print("Streamlit app running at:", public_url)

Streamlit app running at: NgrokTunnel: "https://b030-34-186-15-205.ngrok-free.app" -> "http://localhost:8501"
