# Welcome to the RAG Project!

### In this project, we have to work through stages from dataset a selection to connection to external artefacts (VectorDB, APIs), gaining a comprehensive understanding of RAG’s components and their integration.


### In this notebook, we will be working on the frameworks and connecting to OpenAI. 
### First, let's set some environments needed for the project. 

In [9]:
pip install -U google-cloud-aiplatform google-auth google-auth-oauthlib google-auth-httplib2

Note: you may need to restart the kernel to use updated packages.


In [68]:
pip install gradio

Note: you may need to restart the kernel to use updated packages.


In [42]:
pip install ipywidgets

Note: you may need to restart the kernel to use updated packages.


In [22]:
pip install -U langchain langchain-community google-cloud-aiplatform pydantic

Collecting pydantic
  Using cached pydantic-2.10.6-py3-none-any.whl.metadata (30 kB)
Collecting pydantic-core==2.27.2 (from pydantic)
  Downloading pydantic_core-2.27.2-cp312-cp312-win_amd64.whl.metadata (6.7 kB)
Collecting typing-extensions (from google-cloud-aiplatform)
  Using cached typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
Using cached pydantic-2.10.6-py3-none-any.whl (431 kB)
Downloading pydantic_core-2.27.2-cp312-cp312-win_amd64.whl (2.0 MB)
   ---------------------------------------- 0.0/2.0 MB ? eta -:--:--
   --------------- ------------------------ 0.8/2.0 MB 8.3 MB/s eta 0:00:01
   ---------------------------------------- 2.0/2.0 MB 12.2 MB/s eta 0:00:00
Using cached typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Installing collected packages: typing-extensions, pydantic-core, pydantic
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.11.0
    Uninstalling typing_extensions-4.11.0:
      Successfully uninstal

In [118]:
import os
from langchain.chains import RetrievalQA
from langchain_community.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI
import pandas as pd
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

# Here we have the data set!
### The data set that we chose was the Finance_laws data set!
### We will be briefly reading it after it was cleaned. 

In [116]:
text_files_path = r"C:\Users\torre\Desktop\rag_project-main\data\cleaned_texts"

text_files = [
    "AMLD_EURLEX.txt", "BRRD_EURLEX.txt", "CRD_EURLEX.txt", "CRR_EURLEX.txt",
    "DGSD_EURLEX.txt", "IFD_EURLEX.txt", "IFR_EURLEX.txt", "MCD_EURLEX.txt",
    "PSD_EURLEX.txt", "SecReg_EURLEX.txt"
]

text_data = []

# Analyze each text file
for file in text_files:
    file_path = os.path.join(text_files_path, file)
    try:
        with open(file_path, "r", encoding="utf-8") as f:
            content = f.read()
            num_words = len(content.split())
            num_chars = len(content)
            num_lines = content.count("\n")

            text_data.append({
                "File": file,
                "Words": num_words,
                "Characters": num_chars,
                "Lines": num_lines
            })
    except FileNotFoundError:
        print(f"File not found: {file_path}")

df_text_analysis = pd.DataFrame(text_data)
df_text_analysis

Unnamed: 0,File,Words,Characters,Lines
0,AMLD_EURLEX.txt,22303,145245,0
1,BRRD_EURLEX.txt,86460,554551,0
2,CRD_EURLEX.txt,64604,416530,0
3,CRR_EURLEX.txt,217635,1346844,0
4,DGSD_EURLEX.txt,14601,90371,0
5,IFD_EURLEX.txt,26612,173114,0
6,IFR_EURLEX.txt,34860,220766,0
7,MCD_EURLEX.txt,36690,232545,0
8,PSD_EURLEX.txt,48733,310048,0
9,SecReg_EURLEX.txt,29832,195097,0


## Here, we will be processesing legal text files and creating a FAISS vector database for retrieval-augmented generation (RAG) using HuggingFace’s MiniLM embeddings.

In [120]:
# Loading the embedding model
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Defining the directory for the text files
text_files_pathing = "text_files"  
documents = []

# Looping through all .txt Files in the Folder
for filename in os.listdir(text_files_path):
    if filename.endswith(".txt"):  
        # Reading the text Content & split into chunks
        with open(os.path.join(text_files_path, filename), "r", encoding="utf-8") as file:
            text = file.read()
            splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
            chunks = splitter.split_text(text)
            # Storing each chunk as a document
            for chunk in chunks:
                documents.append(Document(page_content=chunk, metadata={"source": filename}))
                
# Creating a FAISS vector database from chunks
vector_db = FAISS.from_documents(documents, embedding_model)

# Save the FAISS index locally
vector_db.save_local("regulatory_vectors")

## Next we will be: This script loads a FAISS vector database, retrieves relevant documents, and uses GPT-4 to answer a question based on retrieved information.


In [126]:
# Setting up OpenAI API key
os.environ["OPENAI_API_KEY"] = "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"

# Loading the FAISS vector database
vector_db = FAISS.load_local(
    "regulatory_vectors",
    embeddings=embedding_model,
    allow_dangerous_deserialization=True
)

# Creating a retriever
retriever = vector_db.as_retriever()

# Asking a question!
query = "What are the capital requirements under CRR?"

# Retrieving relevant documents from FAISS
retrieved_docs = retriever.get_relevant_documents(query)

# Printing retrieved documents 
print("\n Retrieved Documents from FAISS:")
for i, doc in enumerate(retrieved_docs, 1):
    source_file = doc.metadata.get("source", "Unknown File")  # Extract source file name
    print(f"\n Document {i} (Source: {source_file}):")
    print(doc.page_content)  # Printing the actual content retrieved from FAISS

# Loading GPT-4
llm = ChatOpenAI(model_name="gpt-4")

# Create the RAG pipeline
qa_chain = RetrievalQA.from_chain_type(
    llm=llm, retriever=retriever
)

# Generating a final answer
response = qa_chain.run(query)

print("\n Response:")
print(response)


 Retrieved Documents from FAISS:

 Document 1 (Source: CRR_EURLEX.txt):
Such standards need not, however be applied where the parent under­ taking is a financial holding company or a credit insti­ tution or where the other subsidiaries are either credit or financial institutions or undertakings offering ancillary services, provided that all such undertakings are covered by the supervision of the credit institution on a consolidated basis. (61) In view of the risk-sensitivity of the rules relating to capital requirements, it is desirable to keep under review whether these have significant effects on the economic cycle. The Commission, taking into account the contribution of the European Central Bank (ECB), should report on these aspects to the European Parliament and to the Council. (62) The capital requirements for commodity dealers, including those dealers currently exempt from the requirements of Directive 2004/39/EC of the European Parliament and of the Council of 21 April 2004 on 

## Evaluating the RAG System

In [153]:
# Generating test queries for the evaluation
test_queries = [
    {"query": "What are the capital requirements under CRR?", 
     "expected": "CRR requires banks to maintain a minimum capital ratio of 8% of risk-weighted assets."},
    
    {"query": "How does AMLD define money laundering?", 
     "expected": "According to AMLD, money laundering is defined as the process of concealing illicitly obtained funds."},

    {"query": "Explain the differences between PSD and PSD2.", 
     "expected": "PSD2 introduces enhanced security measures, including strong customer authentication (SCA)."}
]

# Adding a similarity model for the evaluation
similarity_model = SentenceTransformer("all-MiniLM-L6-v2")

def compute_similarity(expected, generated):
    return util.pytorch_cos_sim(
        similarity_model.encode(expected, convert_to_tensor=True),
        similarity_model.encode(generated, convert_to_tensor=True)
    ).item()
    
# Evaluating the RAG system
results = []
for test_case in test_queries:
    query = test_case["query"]
    expected_answer = test_case["expected"]

    # Retrieving documents from FAISS
    retrieved_docs_with_scores = vector_db.similarity_search_with_score(query, k=3)
    retrieved_info = []
    for doc, score in retrieved_docs_with_scores:
        retrieved_info.append(f"{doc.metadata.get('source', 'Unknown')} (Score: {score:.4f})")

    # Getting GPT-4's generated response
    response = qa_chain.run(query)
    
    similarity_score = compute_similarity(expected_answer, response)
    
    # Extracting retrieved document sources
    retrieved_sources = [doc.metadata.get("source", "Unknown File") for doc in retrieved_docs]
    
    # Storing results
    results.append({
        "Query": query,
        "Expected Answer": expected_answer,
        "Retrieved Sources & Scores": retrieved_info,
        "Generated Response": response,
        "Similarity Score": similarity_score
    })

# Printing evaluation results
import pandas as pd
df_results = pd.DataFrame(results)
df_results

Unnamed: 0,Query,Expected Answer,Retrieved Sources & Scores,Generated Response,Similarity Score
0,What are the capital requirements under CRR?,CRR requires banks to maintain a minimum capit...,"[CRR_EURLEX.txt (Score: 0.8198), CRR_EURLEX.tx...",The Capital Requirements Regulation (CRR) outl...,0.680368
1,How does AMLD define money laundering?,"According to AMLD, money laundering is defined...","[AMLD_EURLEX.txt (Score: 0.7128), AMLD_EURLEX....","According to Article 1 of the definitions, mon...",0.849591
2,Explain the differences between PSD and PSD2.,"PSD2 introduces enhanced security measures, in...","[CRR_EURLEX.txt (Score: 1.4864), AMLD_EURLEX.t...","I'm sorry, but the context provided does not c...",0.519274


## Bonus: Use an LLM as a judge to generate questions and evaluate your RAG's answers.

In [130]:
from langchain_openai import ChatOpenAI

# Loading GPT-4 Model
judge_llm = ChatOpenAI(model_name="gpt-4")

# Generate questions based on retrieved content
def generate_test_queries(doc_text, num_questions=3):
    prompt = f"""
    Based on the following document content, generate {num_questions} test questions:
    
    {doc_text}

    Ensure that the questions test key factual details from the document.
    """
    return judge_llm.invoke(prompt)

# Example usage (using a sample retrieved document)
test_questions = generate_test_queries(retrieved_text)
print("\n🔹 Generated Test Questions:")
print(test_questions)



🔹 Generated Test Questions:
content="1. What does the CRR regulation require banks to maintain?\n2. What is the minimum capital adequacy ratio required for banks as per the CRR regulation?\n3. According to the CRR regulation, what is the minimum percentage a bank's capital adequacy ratio should be?" additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 58, 'prompt_tokens': 58, 'total_tokens': 116, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4-0613', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None} id='run-2948e828-a562-4c4d-ac47-438325527fd6-0' usage_metadata={'input_tokens': 58, 'output_tokens': 58, 'total_tokens': 116, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}


## Bonus: Provide an interactive demo of your RAG system.

In [99]:
import os
import gradio as gr
from langchain.chains import RetrievalQA
from langchain_community.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI


# Creating a RAG QA pipeline
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

# Function to process query
def rag_pipeline(query):
    retrieved_docs = retriever.get_relevant_documents(query)

    # Format retrieved documents
    retrieved_text = "Retrieved Documents:\n"
    for i, doc in enumerate(retrieved_docs, 1):
        retrieved_text += f"\n Document {i} ( Source: `{doc.metadata.get('source', 'Unknown')}`)\n"
        retrieved_text += f"{doc.page_content[:500]}...\n"
    
    response = qa_chain.run(query)

    return retrieved_text, f"AI Response:\n\n{response}"

# Defining Gradio UI
with gr.Blocks() as demo:
    gr.Markdown("<h1 style='text-align: center; color: #0F52BA;'>RAG System Interactive Demo</h1>")
    gr.Markdown("<p style='text-align: center;'>Ask a regulatory question and get AI-powered answers with retrieved documents!</p>")

    with gr.Row():
        query_box = gr.Textbox(label="Enter Your Question:", placeholder="Type your question here...", interactive=True)
        submit_button = gr.Button("Get Answer", variant="primary")

    output_1 = gr.Textbox(label="Retrieved Legal Documents", lines=10, interactive=False)
    output_2 = gr.Textbox(label="AI-Generated Response", lines=10, interactive=False)

    submit_button.click(fn=rag_pipeline, inputs=query_box, outputs=[output_1, output_2])

# Running Gradio UI
demo.launch(share=True)


* Running on local URL:  http://127.0.0.1:7871
* Running on public URL: https://2b62fe53988fafa8c6.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


