# Setup

Like all things we need to first have a proper setup, meaning we would need to setup our LangChain API key and our LLM model API key setup. Also that we have enable the correct tracing to the correct project, so that way we can monitor correctly. 

In [1]:
import os
import getpass 
from dotenv import load_dotenv

load_dotenv(dotenv_path="../.env")


def _get_config(api_name: str):
    if api_name in os.environ:
        print(f"Using {api_name} from environment")
        return os.environ[api_name]
    else:
        print(f"{api_name} is not set in environments.")
        return getpass.getpass(f"Please enter your {api_name} API Key: ")



In [2]:
# Get LangSmith API Key 
LANGCHAIN_API_KEY = _get_config("LANGCHAIN_API_KEY")

# Get LLM Gemini 
GOOGLE_API_KEY = _get_config("GOOGLE_API_KEY")

# Set up tracing 
LANGCHAIN_TRACING_V2 = _get_config("LANGCHAIN_TRACING_V2")

# Set up a new project name
os.environ['LANGCHAIN_PROJECT']="RAG Application Dan Koe"
print(f"Using project: {os.environ['LANGCHAIN_PROJECT']}")

# Get User Agent for requests
USER_AGENT = _get_config("USER_AGENT")

Using LANGCHAIN_API_KEY from environment
Using GOOGLE_API_KEY from environment
Using LANGCHAIN_TRACING_V2 from environment
Using project: RAG Application Dan Koe
Using USER_AGENT from environment


Now that we have set up our project we will be using different function in order to get to actually get some responses. We will be basing our RAG based on the Dan Koe's websites where he would actually posts his letters. So we wil start by automating the ingestion process where we would be relying on [Documents Loader](https://python.langchain.com/docs/integrations/document_loaders/) by relying on SitemapLoader to discover all the articles without blindly guessing URLs. In order to best trace these component we will be using function and relying on the `@traceable` decorator.

In [3]:
import requests
import nest_asyncio
from langsmith import traceable 
from langchain_community.document_loaders import SitemapLoader
from langchain_core.documents import Document
from bs4 import BeautifulSoup



# Apply nest_asyncio to allow nested event loops
nest_asyncio.apply()

@traceable
def load_documents_from_sitemap(sitemap_url: str, filtered_urls: list):

    """
    A traceable function for discovering and loading all the articles from 
    a sitemap for specific posts URLS. 

    Args: 
    sitemap_url (str) : The URL of the sitemap to load documents from.
    filtered_urls (list) : A list of specific post URLs to filter the documents.
    
    Returns: 
    list: A list of loaded document objects.
    """
    # Sitemaper will automatically find and load all articles
    loader = SitemapLoader(
        web_path = sitemap_url, 
        filter_urls = filtered_urls
    )

    urls = [doc.metadata['source'] for doc in loader.load()]

    loaded_documents = []

    # Now, each URL neesds to be parsed
    for url in urls:
        try:
            # Fetching Webpages from our Sitemap and URLS we got
            response = requests.get(url, headers={"User-Agent": 'MyRAGBot/1.0'})
            response.raise_for_status()
            
            # Parsing our URL Parsing with BeautifulSoup 
            soup = BeautifulSoup(response.content, 'html.parser')
            main_content = soup.select_one('article .body')

            # We extract and store our clean text 
            if main_content: 
                text = main_content.get_text(separator="\n", strip = True)
                loaded_documents.append(Document(page_content=text, metadata={"source": url}))

        except requests.RequestException as e:
            print(f"Error fetching {url}: {e}")

    # Sanity Check - Verify the number of loaded documents
    print(f"Loaded total of {len(loaded_documents)} documents")

    return loaded_documents


sitemap_url = "https://letters.thedankoe.com/sitemap.xml"
fecthed_url = ["https://letters.thedankoe.com/p"]


documents = load_documents_from_sitemap(sitemap_url, fecthed_url)


Fetching pages: 100%|##########| 42/42 [00:08<00:00,  5.21it/s]


Loaded total of 42 documents


In [4]:
# Sanity Check 
if documents: # A better way to check if the list is not empty
    print("\nHere are the first 5 URLs found in the sitemap:")
    for i in range(min(5, len(documents))):
        print(documents[i].page_content[:20])
else: 
    print("\nStill found 0 documents. There might be another problem.")


Here are the first 5 URLs found in the sitemap:
This course was prev
Most creators will f
Before we begin, the
Most people aren't b
This post builds on 


This means that we have fetched our documents, we have a total of 41 documents we are working with and just as a way to check that we have the correct links we have printed the the first five URLS. So in the next step we would be chunking, this is nothing but creating a smaller content, more focused pieces of information than we require. 

We will reply on Text Splitter that would allow us to create turn the documents we have loaded into small yet precise retrieval that are large enough to contain complete and coherent thought.

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document 


@traceable 
def split_documents(documents : list[Document]) -> list[Document]:

    """
    A @traceable decorated function that takes a list of documents and splits them into smaller, overlapping chunks. 

    Args: 
        documents (list[Document]): A list of Document objects to be split.

    Returns: 
        list[Documents] : A list of Document objects representing the split chunks.
    """

    # Recursively Splits Long Text into smaller, semantically meaningful chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 1000, 
        chunk_overlap = 200, 
        add_start_index = True
    )

    # Split documents method works for chinking the documents 
    splits = text_splitter.split_documents(documents)

    # Sanity Check 
    print(f"Splitting Completed. Created a total of {len(splits)} chunks.")

    return splits

all_splits = split_documents(documents)

Splitting Completed. Created a total of 735 chunks.


Great! The documents have been split and now that we have 836 chunks we need to create an embedding (turning text-to-numbers) so that this way we are actually allowed to store it in a vector store. This will enable us to find text that is semantically close to the user's query. This is a two step process, where first we create an embedding model to convert it to meaningful numerical representation (vector) and the second step is a specialized database so that we can perform very fast similarity searches to the query vector. 

In [6]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_community.vectorstores import Chroma 

@traceable
def create_vector_store(chunks: list[Document]) -> Chroma:
    """
    A @traceable decorated function that creates a vector store from a list of documents.

    Args:
        chunks (list[Document]): A list of Document objects to be added to the vector store.

    Returns:
        Chroma: An instance of the Chroma vector store containing the document embeddings.
    """

    # This will be our transltor that will allow us to trun text into numbers 
    embedding_model = GoogleGenerativeAIEmbeddings(model="models/embedding-001")


    # We will use .from_documents class method to do the following: 
        # Take our lis of document chunks. 
        # Uses the provided embedding_model to turn them into vectors 
        # Creates a new Chroma database in memory and store documents 
    vector_store = Chroma.from_documents(documents=chunks, embedding = embedding_model)

    print("Successfully created vector store.")
    return vector_store


vector_store = create_vector_store(all_splits)


Successfully created vector store.


So to recap, we have successfully, ingested the documents from Dan Koe's letter from his website, then we have prepared the content of his posts into meaningful chunks then we have indexed the chunks by emedding them and using Chroma database so that we can effiicently retrieve responses based on the user's query. 

The next stage we will be putting all of thisngs together and we would actually build the RAG logic chain that will connect to the vector store to the LLM nad we will execute the chain by asking questions and getting responses. 

In [22]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.prompts import ChatPromptTemplate

@traceable 
def setup_rag_chain(vector_store):

    """
    Setup up and returns tracebale RAG chain. 

    Args: 
        vector_store (Chroma): The vector store to use for the RAG chain.

    Returns:
        Chaining: The constructed RAG chain.
    """
    # Setup the LLM Model we are working with 
    model = ChatGoogleGenerativeAI(model ='gemini-2.5-flash-lite', temperature = 1)

    # Setup the retriver 
    retriever = vector_store.as_retriever(search_kwargs={"k": 6})

    # Setup the prompt template
    prompt_template = ChatPromptTemplate.from_template(
        """Answer the question based on the following context.
        If you don't know the answer disclose to the user that you do not know the answer.
        
        Context: 
        {context}

        Question: 
        {question}
        """
    )

    def format_docs(docs): 
        return "\n\n".join([doc.page_content for doc in docs])
    

    rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt_template
        | model
        | StrOutputParser()
    )

    return rag_chain

To follow what has been done here, we are basically defining data transformations that are taking place at each stage. So our `rag_chain`basically would take the question and pass it to the retriever and also simply takes the original question and passess it through unchanged. Then we would be taking the dictionary of the context and the questions and pipe it to the prompt, where LangChain would automatically match the keys of dictionary context and question to their respective placeholders. Then the fomatted prompted is piped into the Gemini model to generate the answer and the output is parsed to just pull out the content giving us a clean string at as the final answer. 

In [25]:
# Build you chain using vector_store
rag_chain = setup_rag_chain(vector_store)

So now we will create a new function that would allow us to ask questions. 

In [26]:
@traceable 
def ask_question(chain, question: str):

    """
    Invoke the RAG chain with a question retruning the answer. 
    """
    return chain.invoke(question)

In [27]:
user_question = "How does the author suggest someone reinvent themselves?"
answer = ask_question(rag_chain, user_question)

print("\n--- Answer ---")
print(answer)


--- Answer ---
The author suggests reinventing yourself by stepping into the unknown, which is described as being "reborn." This process involves a buffer period of high anxiety as nature tests your seriousness about your capabilities. The author outlines two steps for engineering an identity:

1.  **Recognition:** Understand that you are pursuing goals, whether consciously or not. These goals shape your interpretation of reality. Recognize that your current goals likely stem from conditioning by your environment (parents, teachers, culture), which may be leading you to a dead end.
2.  **Strategic Dissonance:** Actively widen the gap between who you are and who you want to be. Cultivate dissatisfaction with your current lifestyle by considering where you will end up if you don't change.


In [30]:
user_question = "What are the author suggest in terms of skills that would set someone up for life"
answer = ask_question(rag_chain, user_question)

print("\n--- Answer ---")
print(answer)



--- Answer ---
The author suggests that to be set up for life, one should focus on learning the "liberating arts" which are:

*   **Logic**: how to derive truth from known facts
*   **Statistics**: how to understand the implications of data
*   **Rhetoric**: how to persuade, and spot persuasion tactics
*   **Research**: how to gather information on an unknown subject
*   **(Practical) Psychology**: how to discern and understand the true motives of others
*   **Investment**: how to manage and grow existing assets
*   **Agency**: how to make decisions about what course to pursue, and proactively take action to pursue it
*   **Risk Tolerance**: the ability to embrace uncertainty

In addition to these, the author also emphasizes the importance of self-development, self-reliance, self-education, self-sufficiency, and self-mastery. The author believes these are critical for creating value and navigating reality.
