# RAG System for CFA Curriculum (Google Colab Version)

This notebook implements a full Retrieval-Augmented Generation (RAG) pipeline to create high-quality, exam-style questions and flashcards from the CFA curriculum. This version is specifically adapted for use in Google Colab.

### **System Architecture**

1.  **Indexing Pipeline (Offline):**
    * **Load:** Read all `.md` files from the Colab environment.
    * **Chunk:** Split documents into meaningful sections based on Markdown headers.
    * **Embed & Store:** Convert text chunks into vector embeddings and store them in a `ChromaDB` vector store in the Colab runtime.

2.  **Generation Pipeline (Online):**
    * **Query:** The user specifies a Learning Outcome Statement (LOS) ID.
    * **Retrieve:** Fetch the most relevant text chunks for that LOS from the vector store.
    * **Generate:** Use a powerful LLM (Gemini 1.5 Pro) with a carefully engineered prompt to generate structured JSON output (MCQs or flashcards).

## 1. Setup and Installations

First, we install the necessary Python libraries for this project.

In [2]:
%pip install -q --upgrade langchain langchain_community langchain_google_vertexai google-cloud-aiplatform chromadb pydantic unstructured markdown

Note: you may need to restart the kernel to use updated packages.


### 1.1. Google Cloud Authentication

In Google Colab, we authenticate using the built-in `google.colab.auth` library. This will prompt you to log in to your Google account.

In [None]:
from google.colab import auth
auth.authenticate_user()

### 1.2. Configure Google Cloud Project

Enter your Google Cloud Project ID in the form below. Then, we initialize Vertex AI.

In [None]:
import vertexai

# --- CONFIGURATION ---
PROJECT_ID = "your-gcp-project-id"  #@param {type:"string"}
LOCATION = "us-central1"
# ---------------------

vertexai.init(project=PROJECT_ID, location=LOCATION)

## 2. The Indexing Pipeline

This is the offline process where we prepare our knowledge base. It only needs to be run once, or whenever the underlying curriculum documents are updated.

**Note on File Storage:** The files created in this section are stored in the temporary Colab runtime. They will be deleted when the runtime is disconnected. For persistent storage, you can mount your Google Drive (see next cell).

### (Optional) Mount Google Drive for Persistent Storage

Run this cell and follow the authentication prompts to mount your Google Drive. You can then change the paths in the following cells (e.g., `curriculum_path`, `db_path`) to save your files directly to your Drive.

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

# # Example of changing paths to use Google Drive
# curriculum_path = '/content/drive/MyDrive/cfa_curriculum_md/'
# db_path = "/content/drive/MyDrive/cfa_vector_db"

### 2.1. Load Markdown Documents

First, you need to upload your curriculum files. In the Colab file browser on the left, create a folder named `cfa_curriculum_md` and upload your `.md` files into it. 

The cell below creates some dummy files for demonstration purposes.

In [1]:
import os
from langchain_community.document_loaders import DirectoryLoader

# Use local Colab path by default
curriculum_path = "./markdown/"
db_path = "./cfa_vector_db"

# Create a dummy directory and files for demonstration purposes
# if not os.path.exists(curriculum_path):
#     os.makedirs(curriculum_path)

# dummy_content_a = """
# # Reading 42: Correlation and Regression
# ## LOS 42.a: Differentiate between correlation and covariance
# Covariance measures the directional relationship between two variables. A positive covariance means that variables move in the same direction, while a negative covariance means they move in opposite directions. The formula for population covariance is Cov(X,Y) = Σ[(Xi - μx)(Yi - μy)] / N.
# Correlation is a standardized measure of the linear relationship between two variables. It is calculated by dividing the covariance by the product of the standard deviations of the two variables. The value of correlation is always between -1 and +1.
# """
# dummy_content_b = """
# # Reading 42: Correlation and Regression
# ## LOS 42.b: Explain the properties of correlation
# The correlation coefficient has several key properties. It is a unitless measure. The correlation of a variable with itself is always 1. A correlation of 0 indicates no linear relationship, but does not rule out a non-linear one.
# """

# with open(os.path.join(curriculum_path, "Reading_42_LOS_a.md"), "w") as f:
#     f.write(dummy_content_a)
# with open(os.path.join(curriculum_path, "Reading_42_LOS_b.md"), "w") as f:
#     f.write(dummy_content_b)

# Load all markdown files
loader = DirectoryLoader(curriculum_path, glob="**/*.md", show_progress=True)
docs = loader.load()

print(f"Loaded {len(docs)} documents.")

100%|██████████| 204/204 [00:09<00:00, 20.95it/s]

Loaded 204 documents.





### 2.2. Chunk Documents and Add Metadata

In [2]:
import re
from langchain_text_splitters import MarkdownHeaderTextSplitter

def extract_los_from_path(path):
    """A simple regex to pull the LOS from a filename like 'LM12LOS40.pdf.md'. Returns 'LM12LOS40'."""
    # Match pattern like 'LM12LOS40.pdf.md' to extract LM 12 and LOS 40
    match = re.search(r'LM(\d{2})LOS(\d{2})', path, re.IGNORECASE)
    if match:
        return f"{match.group(1)}:{match.group(2)}"
    return None

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

all_chunks = []
for doc in docs:
    chunks = markdown_splitter.split_text(doc.page_content)
    file_path = doc.metadata.get('source', '')
    lm_los = extract_los_from_path(file_path)
    # lm_los is in the format 'LM12LOS40', extract both LM and LOS
    if lm_los is None:
        print(f"Warning: Could not extract LOS from file path: {file_path}")
        continue
    # Assuming lm_los is in the format '12:40', we can use it directly
    # If you need to split it further, you can do so here
    # For example, if you want to store it as 'LM12' and 'LOS40' by splitting on ':'
    lm_id, los_id = lm_los.split(':') if ':' in lm_los else (lm_los, None)

    for chunk in chunks:
        chunk.metadata['lm_id'] = lm_id
        chunk.metadata['los_id'] = los_id
        chunk.metadata['source_file'] = file_path
        all_chunks.append(chunk)

print(f"Split {len(docs)} documents into {len(all_chunks)} chunks.")
print("\n--- Example Chunk ---")
print(all_chunks[0].page_content)
print("\n--- Example Metadata ---")
print(all_chunks[0].metadata)

Split 204 documents into 192 chunks.

--- Example Chunk ---
in-dex, noun (pl.in•dex.es or in•di•ces) Latin indic-, index, from indicare to indicate: an indicator, sign, or measure of something.  
ORIGIN OF MARKET INDEXES  
Investors had access to regularly published data on individual security prices in London as early as 1698, but nearly 200 years passed before they had access to a simple indicator to reflect security market information. To give readers a sense of how the US stock market in general performed on a given day, publishers Charles H. Dow and Edward D. Jones introduced the Dow Jones Average, the world's first security market index, in 1884. The index, which appeared in The Customers' Afternoon Letter, consisted of the stocks of nine railroads and two industrial companies. It eventually became the Dow Jones Transportation Average. Convinced that industrial companies, rather than railroads, would be "the great speculative market" of the future, Dow and Jones introduced a seco

### 2.3. Embed and Store in Vector Database

In [3]:
from langchain_google_vertexai import VertexAIEmbeddings
from langchain_community.vectorstores import Chroma

embeddings = VertexAIEmbeddings(model_name="text-embedding-004")

print(f"Creating and persisting vector store at: {db_path}")
vector_db = Chroma.from_documents(
    documents=all_chunks,
    embedding=embeddings,
    persist_directory=db_path
)
print("Vector store created successfully!")

GoogleAuthError: Unable to find your project. Please provide a project ID by:
- Passing a constructor argument
- Using vertexai.init()
- Setting project using 'gcloud config set project my-project'
- Setting a GCP environment variable
- To create a Google Cloud project, please follow guidance at https://developers.google.com/workspace/guides/create-project

## 3. The Generation Pipeline

Now we build the real-time part of the system. We'll load the vector store from disk and create the LangChain RAG chain to generate content.

### 3.1. Load Existing Vector Store

In [None]:
from langchain_google_vertexai import VertexAIEmbeddings
from langchain_community.vectorstores import Chroma

embeddings = VertexAIEmbeddings(model_name="text-embedding-004")
vector_db_retriever = Chroma(
    persist_directory=db_path, 
    embedding_function=embeddings
)

print("Vector store loaded from disk.")

### 3.2. Prompt Engineering

In [None]:
from langchain.prompts import PromptTemplate

MCQ_PROMPT_TEMPLATE = """
You are an expert CFA exam question writer. Your task is to create a challenging, exam-style multiple-choice question based *only* on the provided context.
**Instructions:**
1. The question must directly test a key concept, formula, or definition from the context.
2. Generate three plausible distractors (incorrect options) that represent common mistakes or misunderstandings related to the topic.
3. The correct answer must be explicitly supported by the text.
4. Provide a brief but clear explanation for why the correct answer is right and why the distractors are wrong, citing the context.
5. Output the result in a single, clean JSON object. Do not include any text outside of the JSON block.
**Context from LOS {los_id}:**
---
{retrieved_text}
---
**JSON Output Format:**
```json
{
  "question": "The question text...",
  "options": {
    "A": "Option A",
    "B": "Option B",
    "C": "Option C"
  },
  "answer": "B",
  "explanation": "The correct answer is B because [...]. Option A is incorrect because [...]. Option C is incorrect because [...]."
}
```
"""

FLASHCARD_PROMPT_TEMPLATE = """
You are an expert CFA content creator specializing in study aids. Your task is to create a concise and effective flashcard based *only* on the provided context.
**Instructions:**
1. Identify the single most important term, concept, or formula from the context. This will be the "front" of the flashcard.
2. Create a clear, concise definition or explanation for the "back" of the flashcard.
3. Ensure the content is directly derived from the provided text.
4. Output the result in a single, clean JSON object. Do not include any text outside of the JSON block.
**Context from LOS {los_id}:**
---
{retrieved_text}
---
**JSON Output Format:**
```json
{
  "front": "Term or Concept Name",
  "back": "Concise definition, explanation, or formula."
}
```
"""

mcq_prompt = PromptTemplate(
    input_variables=['los_id', 'retrieved_text'],
    template=MCQ_PROMPT_TEMPLATE
)

flashcard_prompt = PromptTemplate(
    input_variables=['los_id', 'retrieved_text'],
    template=FLASHCARD_PROMPT_TEMPLATE
)

print("Prompt templates created.")

### 3.3. Build the RAG Chain

In [None]:
import json
from langchain_google_vertexai import ChatVertexAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

llm = ChatVertexAI(
    model_name="gemini-1.5-pro-preview-0409",
    temperature=0.3,
    generation_config={"response_mime_type": "application/json"}
)

def retrieve_context(input_dict):
    los_id_to_query = input_dict["los_id"]
    retriever = vector_db_retriever.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 3, "filter": {"los_id": los_id_to_query}}
    )
    retrieved_docs = retriever.invoke(f"content for LOS {los_id_to_query}")
    retrieved_text = "\n\n---\n\n".join([doc.page_content for doc in retrieved_docs])
    return {"retrieved_text": retrieved_text, "los_id": los_id_to_query}

# A parser function to safely clean and load the JSON output from the LLM
def clean_and_parse_json(text):
    # The model sometimes wraps the JSON in ```json ... ```, so we strip that
    cleaned_text = text.strip().removeprefix('```json').removesuffix('```').strip()
    return json.loads(cleaned_text)

mcq_rag_chain = (
    {"los_id": RunnablePassthrough()} 
    | RunnablePassthrough.assign(context=retrieve_context)
    | (lambda x: mcq_prompt.format(
        los_id=x["los_id"],
        retrieved_text=x["context"]["retrieved_text"]
      ))
    | llm
    | StrOutputParser()
    | clean_and_parse_json
)

flashcard_rag_chain = (
    {"los_id": RunnablePassthrough()} 
    | RunnablePassthrough.assign(context=retrieve_context)
    | (lambda x: flashcard_prompt.format(
        los_id=x["los_id"],
        retrieved_text=x["context"]["retrieved_text"]
      ))
    | llm
    | StrOutputParser()
    | clean_and_parse_json
)

print("RAG chains are ready to be used.")

## 4. Execution and Generation

Let's test our system! We'll specify an LOS and run both the MCQ and Flashcard generation chains.

In [None]:
import json

# --- Specify the Target LOS ---
target_los = "42.a"
# -----------------------------

print(f"--- Generating MCQ for LOS: {target_los} ---")
try:
    mcq_result = mcq_rag_chain.invoke(target_los)
    print(json.dumps(mcq_result, indent=2))
except Exception as e:
    print(f"An error occurred during MCQ generation: {e}")

print(f"\n--- Generating Flashcard for LOS: {target_los} ---")
try:
    flashcard_result = flashcard_rag_chain.invoke(target_los)
    print(json.dumps(flashcard_result, indent=2))
except Exception as e:
    print(f"An error occurred during Flashcard generation: {e}")

## 5. Next Steps and Evaluation

This notebook provides a complete, working prototype. To move this to production, consider the following:

1.  **Evaluation:** Implement a "human-in-the-loop" review process. Generate a large batch of questions and have CFA experts review them for accuracy, relevance, and clarity. 
2.  **Prompt Iteration:** Based on feedback, refine your prompt templates. 
3.  **Chunking Strategy:** If the retrieved context is often irrelevant, experiment with different chunk sizes or `MarkdownHeaderTextSplitter` settings.
4.  **UI/Application:** Wrap this logic in a web application using `Streamlit` or `FastAPI`.