# Practice 2: Retrieval-Augmented Generation (RAG) with Azure OpenAI and FAISS
### Plan:
- initialize embeddings and llm
- create vectors for individual queries, compare distances/similarities
- load dataset
- split into chunks, see documents created
- create vector store from chunks
- retrieve relevant chunks for a query manually
- generate answer with LLM manually
- create retriever + generation chain
- test end-to-end

In [11]:
from dotenv import load_dotenv
import os
from langchain_openai import AzureOpenAIEmbeddings, ChatOpenAI
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import numpy as np

load_dotenv()
AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT = os.getenv("EMBEDDING_DEPLOYMENT_NAME")
assert os.getenv("OPENAI_API_KEY"), "OPENAI_API_KEY missing"

### Initialize embeddings model, LLM, and similarity functions

In [12]:
embeddings = AzureOpenAIEmbeddings(
    deployment=AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT,
    base_url=None,
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
)

# (Vector store will be created after chunks are generated)

Vector store ready.


In [13]:
llm = ChatOpenAI(model="gpt-4.1", temperature=0.0)
print("LLM ready.")

LLM ready.


In [14]:
def l2_distance(a, b):
    return np.linalg.norm(np.array(a) - np.array(b))

def cosine_similarity(a, b):
    a = np.array(a)
    b = np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

### Task 1: Manually embedd queries and compare distances/similarities

In [None]:
vector1 = embeddings.embed_query(...)
vector2 = embeddings.embed_query(...)
vector3 = embeddings.embed_query(...)

print("L2 distance (similar things):", l2_distance(vector1, vector2))
print("Cosine similarity (similar things):", cosine_similarity(vector1, vector2))

print("L2 distance (different things):", l2_distance(vector1, vector3))
print("Cosine similarity (different things):", cosine_similarity(vector1, vector3))

### Prepare dataset: load and split into chunks

# New cell: fetch 1000 random Wikipedia articles (kept separate; not used below unless you swap in wiki_documents)

In [26]:
import requests, time
from langchain_core.documents import Document

API_URL = "https://en.wikipedia.org/w/api.php"
UA = "Mozilla/5.0"

def fetch_random_wikipedia_articles(n=1000, per_call=20, batch_size=50, sleep=0.2):
    session = requests.Session()
    session.headers.update({"User-Agent": UA})

    titles = []
    remaining = n
    while remaining > 0:
        chunk = min(per_call, remaining)
        r = session.get(
            API_URL,
            params={
                "action": "query",
                "list": "random",
                "rnnamespace": 0,
                "rnlimit": chunk,
                "rnfilterredir": "nonredirects",  # << exclude redirects
                "format": "json",
            },
            timeout=30,
        )
        r.raise_for_status()
        pages = r.json().get("query", {}).get("random", [])
        titles.extend(p["title"] for p in pages if "title" in p)
        remaining -= len(pages)
        time.sleep(sleep)

    documents_out = []
    for i in range(0, len(titles), batch_size):
        batch = titles[i:i+batch_size]
        resp = session.get(
            API_URL,
            params={
                "action": "query",
                "prop": "extracts",
                "explaintext": 1,
                "exintro": 1,              # keep it small; optional
                "exlimit": len(batch),
                "titles": "|".join(batch),
                "redirects": 1,            # << follow redirects if any
                "format": "json",
            },
            timeout=60,
        )
        resp.raise_for_status()
        pages_dict = resp.json().get("query", {}).get("pages", {})
        for page in pages_dict.values():
            text = (page.get("extract") or "").strip()
            if not text:
                continue
            page_id = page.get("pageid")
            documents_out.append(
                Document(
                    page_content=text,
                    metadata={
                        "title": page.get("title"),
                        "pageid": page_id,
                        "article_url": f"https://en.wikipedia.org/?curid={page_id}" if page_id else None,
                    },
                )
            )
        time.sleep(sleep)

    return documents_out


print("Fetching 1000 random Wikipedia articles ")

wiki_documents = fetch_random_wikipedia_articles()
print(f"Retrieved {len(wiki_documents)} articles with non-empty text.")

Fetching 1000 random Wikipedia articles 
Retrieved 400 articles with non-empty text.
Fetching 1000 random Wikipedia articles...


In [31]:
print("Showing 2 examples from the first 400 retrieved Wikipedia articles:\n")
for i, doc in enumerate(wiki_documents[:3]):
    print(f"Example {i+1}:")
    print(f"Title: {doc.metadata.get('title')}")
    print(f"Text (excerpt): {doc.page_content[:500]}")
    print(f"Metadata: {doc.metadata}\n")

Showing 2 examples from the first 400 retrieved Wikipedia articles:

Example 1:
Title: Belonocnema kinseyi
Text (excerpt): Belonocnema kinseyi is a species of gall wasp that forms galls on Quercus virginiana and Quercus fusiformis. There are both asexual and sexual generations. The asexual generation forms galls on the underside of leaves whereas the sexual generation form galls on the roots. It can be found in the United States, where it is known from Louisiana, Mississippi, Oklahoma and Texas. It, along with the other described Belonocnema species, have been used to study speciation.
Metadata: {'title': 'Belonocnema kinseyi', 'pageid': 76804159, 'article_url': 'https://en.wikipedia.org/?curid=76804159'}

Example 2:
Title: Botanist (disambiguation)
Text (excerpt): A botanist is a scientist specialized in botany.
It may also mean:

Botanist (liquor)
Botanist (band)
Metadata: {'title': 'Botanist (disambiguation)', 'pageid': 74197645, 'article_url': 'https://en.wikipedia.org/?curid=74197

## Initialize text splitter

In [34]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=50
)
chunks = text_splitter.split_documents(wiki_documents)
print(f"Created {len(chunks)} chunks")

print(f"Original number of documents: {len(wiki_documents)}")
print(f"Resulting number of chunks: {len(chunks)}")

print("Showing 2 example chunks:\n")
for i, chunk in enumerate(chunks[:2]):
    print(f"--- Chunk {i} ---\n {chunk.page_content} \n")

Created 1345 chunks
Original number of documents: 400
Resulting number of chunks: 1345
Showing 2 example chunks:

--- Chunk 0 ---
 Elections to Liverpool City Council were held on 1 May 1975.  One third of the council was up for election. The terms of office of the Councillors elected in 1973 with the third highest number of 

--- Chunk 1 ---
 elected in 1973 with the third highest number of votes in each ward, expired and so these election results were compared with the 1973 results. Bill Smyth of the Liberal Party became the Leader of 



### Create FAISS vector store from chunks

In [35]:
vector_store = FAISS.from_documents(chunks, embeddings)
print("FAISS vector store created.")

FAISS vector store created.


### Task 2: Manually retrieve relevant chunks for a query
Look through your wikipedia articles and pick a topic you remember seeing. Formulate a query about that topic and use the vector store to retrieve relevant chunks. Print out the retrieved chunks to see if they are relevant to your query.

In [51]:
# score_threshold = 1.3
results = vector_store.similarity_search_with_score("...", k=3)
print(f"Retrieved {len(results)} results. Showing excerpts:\n")
for i, (doc, score) in enumerate(results):
    print(f"--- Result {i} (score: {score}) url: {doc.metadata['article_url']} ---\n {doc.page_content[:500]} \n")

Retrieved 3 results. Showing excerpts:

--- Result 0 (score: 1.5495978593826294) url: https://en.wikipedia.org/?curid=49537660 ---
 KK FMP. However, not only according to the club's official website, but also according to the official website of the Adriatic League, this club still competes in the Adriatic League. 

--- Result 1 (score: 1.5620182752609253) url: https://en.wikipedia.org/?curid=36493000 ---
 Departments: 

--- Result 2 (score: 1.567713737487793) url: https://en.wikipedia.org/?curid=774551 ---
 In a conversation with Space.com contributor Michael Paine, SCAP head Jin Zhu said that the program's allotted time to use the Schmidt telescope was significantly reduced to make room for the 



### Task 3: Manually generate answer with LLM
Effectively complete the RAG pipeline.
In previous cell we retrieved relevant chunks for a query. Now, use those chunks as context to generate an answer with the LLM.
This is the last part of RAG pipeline.

In [44]:
llm = llm
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Use the provided context to answer the user's question. If unsure, say you don't know."),
    ("human", "Question: {question}\n\nContext:\n{context}\n\nAnswer:")
])

context = "\n\n".join(f"{d.metadata['article_url']}: \n {d.page_content}" for d, _ in results)

## YOUR CODE HERE
# query = prompt.format_prompt(...)
# response = llm.invoke(...)

...

### Now lets use built-in tools to create a RAG chain

In [47]:
def format_docs(docs):
    return "\n\n".join(d.page_content for d in docs)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Use the provided context to answer the user's question. If unsure, say you don't know."),
    ("human", "Question: {question}\n\nContext:\n{context}\n\nAnswer:")
])

#  "score_threshold": 1.3
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 4})

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

print("RAG chain constructed.")

RAG chain constructed.


In [49]:
question = "..."
answer = rag_chain.invoke(question)
print("Question:\n", question)
print("\nAnswer:\n", answer)

Question:
 Wasps

Answer:
 Chiloe micropteron is a species of wasp in the family Baeomorphidae, described from Chile. Belonocnema kinseyi is a gall wasp species that forms galls on oak trees (Quercus virginiana and Quercus fusiformis) and has both asexual and sexual generations.


In [None]:
# Persistence for vector store
PERSIST_DIR = "faiss_index"
os.makedirs(PERSIST_DIR, exist_ok=True)
vector_store.save_local(PERSIST_DIR)
print(f"Saved FAISS index to {PERSIST_DIR}")


In [None]:
reloaded = FAISS.load_local(PERSIST_DIR, embeddings, allow_dangerous_deserialization=True)
print("Reloaded store size:", len(reloaded.index_to_docstore_id))

# Home Assignment:
1. Complete the tasks marked as "YOUR CODE HERE" in the notebook.
2. Find a dataset of your choice (e.g. book reviews, product descriptions, news articles, wine reviews, cooking recipes, etc.)
3. It is proposed to search for datasets on websites like Kaggle, UCI Machine Learning Repository, Huggingface, or other open data sources, but you may scrape data from websites as well, however it is harder. In case of scraping, make use of tools like BeautifulSoup, Scrapy, or Selenium to extract the data.
4. download and load the dataset into LangChain documents
5. split the documents into chunks
6. create a FAISS vector store from the chunks
7. create a RAG chain using the vector store and an LLM of your choice
8. test the RAG chain with some queries relevant to your dataset
9. Include instructions as to how to run your code and reproduce your results. Where to download dataset from, how to set up environment variables, etc.
10. Reuse the above examples and code as much as possible.
11. You may use LLMs to assist you in writing code, but make sure to understand and be able to explain everything you submit.
12. Submit your completed notebook along with any additional files required to run it to your google drive folder.
13. Deadline: Next Monday (17.11.25) 23:59


## !!! The budget for Azure OpenAI usage is limited. Be mindful of the number of tokens you use to run embeddings. try not to exceed 5 million tokens in a single run !!!