# Setup

Like all things we need to first have a proper setup, meaning we would need to setup our LangChain API key and our LLM model API key setup. Also that we have enable the correct tracing to the correct project, so that way we can monitor correctly. 

In [1]:
import os
import getpass 
from dotenv import load_dotenv

load_dotenv(dotenv_path="../.env")


def _get_config(api_name: str):
    if api_name in os.environ:
        print(f"Using {api_name} from environment")
        return os.environ[api_name]
    else:
        print(f"{api_name} is not set in environments.")
        return getpass.getpass(f"Please enter your {api_name} API Key: ")



In [2]:
# Get LangSmith API Key 
LANGCHAIN_API_KEY = _get_config("LANGCHAIN_API_KEY")

# Get LLM Gemini 
GOOGLE_API_KEY = _get_config("GOOGLE_API_KEY")

# Set up tracing 
LANGCHAIN_TRACING_V2 = _get_config("LANGCHAIN_TRACING_V2")

# Set up a new project name
os.environ['LANGCHAIN_PROJECT']="RAG Application Dan Koe"
print(f"Using project: {os.environ['LANGCHAIN_PROJECT']}")

# Get User Agent for requests
USER_AGENT = _get_config("USER_AGENT")

Using LANGCHAIN_API_KEY from environment
Using GOOGLE_API_KEY from environment
Using LANGCHAIN_TRACING_V2 from environment
Using project: RAG Application Dan Koe
Using USER_AGENT from environment


Now that we have set up our project we will be using different function in order to get to actually get some responses. We will be basing our RAG based on the Dan Koe's websites where he would actually posts his letters. So we wil start by automating the ingestion process where we would be relying on [Documents Loader](https://python.langchain.com/docs/integrations/document_loaders/) by relying on SitemapLoader to discover all the articles without blindly guessing URLs. In order to best trace these component we will be using function and relying on the `@traceable` decorator.

In [3]:
import requests
import nest_asyncio
from langsmith import traceable 
from langchain_community.document_loaders import SitemapLoader
from langchain_core.documents import Document
from bs4 import BeautifulSoup



# Apply nest_asyncio to allow nested event loops
nest_asyncio.apply()

@traceable
def load_documents_from_sitemap(sitemap_url: str, filtered_urls: list):
    """
    A traceable function for discovering and loading all the articles from 
    a sitemap for specific posts URLS. 

    Args: 
    sitemap_url (str) : The URL of the sitemap to load documents from.
    filtered_urls (list) : A list of specific post URLs to filter the documents.
    
    Returns: 
    list: A list of loaded document objects.
    """
    # Sitemaper will automatically find and load all articles
    loader = SitemapLoader(
        web_path = sitemap_url, 
        filter_urls = filtered_urls
    )

    urls = [doc.metadata['source'] for doc in loader.load()]

    loaded_documents = []

    # Now, each URL neesds to be parsed
    for url in urls:
        try:
            # Fetching Webpages from our Sitemap and URLS we got
            response = requests.get(url, headers={"User-Agent": 'MyRAGBot/1.0'})
            response.raise_for_status()
            
            # Parsing our URL Parsing with BeautifulSoup 
            soup = BeautifulSoup(response.content, 'html.parser')
            main_content = soup.select_one('article .body')

            # We extract and store our clean text 
            if main_content: 
                text = main_content.get_text(separator="\n", strip = True)
                loaded_documents.append(Document(page_content=text, metadata={"source": url}))

        except requests.RequestException as e:
            print(f"Error fetching {url}: {e}")

    # Sanity Check - Verify the number of loaded documents
    print(f"Loaded total of {len(loaded_documents)} documents")

    return loaded_documents


sitemap_url = "https://letters.thedankoe.com/sitemap.xml"
fecthed_url = ["https://letters.thedankoe.com/p"]


documents = load_documents_from_sitemap(sitemap_url, fecthed_url)


Fetching pages: 100%|##########| 43/43 [00:07<00:00,  5.47it/s]


Loaded total of 43 documents


In [4]:
# Sanity Check 
if documents: # A better way to check if the list is not empty
    print("\nHere are the first 5 URLs found in the sitemap:")
    for i in range(min(5, len(documents))):
        print(documents[i].page_content[:20])
else: 
    print("\nStill found 0 documents. There might be another problem.")


Here are the first 5 URLs found in the sitemap:
This letter is long.
This course was prev
Most creators will f
Before we begin, the
Most people aren't b


This means that we have fetched our documents, we have a total of 41 documents we are working with and just as a way to check that we have the correct content we have printed the the first 20 characters of our page content. So in the next step we would be chunking, this is nothing but creating a smaller content, more focused pieces of information than we require. 

We will reply on Text Splitter that would allow us to create turn the documents we have loaded into small yet precise retrieval that are large enough to contain complete and coherent thought.

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document 


@traceable 
def split_documents(documents : list[Document]) -> list[Document]:

    """
    A @traceable decorated function that takes a list of documents and splits them into smaller, overlapping chunks. 

    Args: 
        documents (list[Document]): A list of Document objects to be split.

    Returns: 
        list[Documents] : A list of Document objects representing the split chunks.
    """

    # Recursively Splits Long Text into smaller, semantically meaningful chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 1000, 
        chunk_overlap = 200, 
        add_start_index = True
    )

    # Split documents method works for chinking the documents 
    splits = text_splitter.split_documents(documents)

    # Sanity Check 
    print(f"Splitting Completed. Created a total of {len(splits)} chunks.")

    return splits

all_splits = split_documents(documents)

Splitting Completed. Created a total of 773 chunks.


Great! The documents have been split and now that we have 836 chunks we need to create an embedding (turning text-to-numbers) so that this way we are actually allowed to store it in a vector store. This will enable us to find text that is semantically close to the user's query. This is a two step process, where first we create an embedding model to convert it to meaningful numerical representation (vector) and the second step is a specialized database so that we can perform very fast similarity searches to the query vector. 

In [6]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_community.vectorstores import Chroma 

@traceable(
        metadata={'vectordb': 'chroma'}, 
        run_type = "retriever"
        )
def build_retriver(chunks: list[Document]) -> Chroma:
    """
    A @traceable decorated function that creates a vector store from a list of documents.

    Args:
        chunks (list[Document]): A list of Document objects to be added to the vector store.

    Returns:
        Chroma: An instance of the Chroma vector store containing the document embeddings.
    """

    # This will be our transltor that will allow us to trun text into numbers 
    embedding_model = GoogleGenerativeAIEmbeddings(model="models/embedding-001")


    # We will use .from_documents class method to do the following: 
        # Take our lis of document chunks. 
        # Uses the provided embedding_model to turn them into vectors 
        # Creates a new Chroma database in memory and store documents 
    vector_store = Chroma.from_documents(documents=chunks, embedding = embedding_model)

    print("Successfully created vector store.")
    return vector_store.as_retriever(search_kwargs={"k": 6})


retriever = build_retriver(all_splits)




Successfully created vector store.


So to recap, we have successfully, ingested the documents from Dan Koe's letter from his website, then we have prepared the content of his posts into meaningful chunks then we have indexed the chunks by emedding them and using Chroma database so that we can effiicently retrieve responses based on the user's query. 

The next stage we will be putting all of thisngs together and we would actually build the RAG logic chain that will connect to the vector store to the LLM nad we will execute the chain by asking questions and getting responses. 


In the next stage we will further add more components that enable us to actually to retrieve the docuemnts that are relevant based on the user's question. 

In [8]:
from langchain_core.vectorstores import VectorStoreRetriever

@traceable(run_type="retriever")
def retrieve_documents(question: str, retriever: VectorStoreRetriever): 

    """
    A traceable function taking user's question and a retriever to return the most relevant
    document chunk.

    Args: 
        question (str): The user's question.
        retriever: The retriever instance used to fetch relevant documents.

    Returns:
        list: A list of the most relevant document chunks.
    """
    retrieved_docs = retriever.invoke(question)

    return retrieved_docs


Now that we have set up our retriever we would need specify the format we would like our prompt to be returned. So we will define a function that will format the documents into a single string and then set up the final prompy which will be sent to the LLM. 

In [9]:
@traceable(run_type="prompt")
def format_documents(question: str, documents: list): 

    """
    Traceable function that format the retrieved documents and teh question into a structured prompt
    so that it can be passed on to the LLM 

    Args: 
        question (str): The user's question 
        documents (list): The list of retrieved document chunks
    """
    formatted_docs = "\n".join(doc.page_content for doc in documents)


    final_prompt = f"""
            Answer the question based on the following context.
            If you don't know the answer disclose to the user that you do not know the answer.

        Context:
        {formatted_docs}

        Question: 
        {question}
         """
    
    return final_prompt 

Now that we have set up the logic for the propt constriction we we would require our LLM to feed it into. We will be using our Gemini model and then we would also us a parser in order so as to only extract the content from the model. 

In [None]:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.output_parsers import StrOutputParser

model_name = "gemini-2.5-flash-lite"
temperature = 0.0
@traceable(run_type = "llm", 
           metadata={'model_name': model_name, 'model_provider': 'google'})
def call_llm(prompt: str, model :str, temperature:float = 0.0) -> str: 

    # Setup the LLM Model we are working with 
    model = ChatGoogleGenerativeAI(model = model_name, temperature = temperature)


    chain = model | StrOutputParser()

    return chain.invoke(prompt)



Now we will bring everything together in our final rag chain where we would basically retriev the docuents and get the fomat for the prompt so that it can be inputted in to the model. 

In [11]:
@traceable(run_type="chain")
def rag_pipeline(question : str, retriever : VectorStoreRetriever): 

    """
    Orchasterates our entire RAG application by calling all of the specialist function, mainly
    retrievr, formtted documents or promots that would be submitted into the LLM and the call to 
    the LLM as well. 

    Args:
        question (str): The question to ask the LLM.

    """


    retrieved_docs = retrieve_documents(question, retriever)
    formatted_prompt = format_documents(question,retrieved_docs)
    response = call_llm(formatted_prompt, model_name)

    return response 



In [12]:
user_question = "What is the key to creating a successful one-person business, according to the author?"
final_answer = rag_pipeline(user_question, retriever)

In [13]:
print("\n--- Final Answer ---")
print(final_answer)


--- Final Answer ---
The key to creating a successful one-person business, according to the author, is to **build a business around your life, not a life around your business**, prioritizing family and well-being. The author also emphasizes that **you are the niche** and that practicing **self-awareness** is the greatest business, marketing, and sales skill. Additionally, the author highlights the accessibility of the internet and writing as a powerful starting point for building a business, allowing individuals to become "one-person media companies."


In [14]:
user_question = "What are the author suggest in terms of skills that would set someone up for life"
answer = rag_pipeline(user_question, retriever)

print("\n--- Answer ---")
print(answer)



--- Answer ---
The author suggests that to set someone up for life, one should focus on developing "liberating arts" rather than just career-specific skills. These liberating arts are:

*   **Logic:** How to derive truth from known facts.
*   **Statistics:** How to understand the implications of data.
*   **Rhetoric:** How to persuade and spot persuasion tactics.
*   **Research:** How to gather information on an unknown subject.
*   **(Practical) Psychology:** How to discern and understand the true motives of others.
*   **Investment:** How to manage and grow existing assets.
*   **Self-development:** Cultivating a valuable mindset and skillset that can help others expand beyond their limits.
*   **Self-reliance:** How to get what you want by taking responsibility for the outcome of your life.
*   **Self-education:** The ability to gather, make sense of, and utilize information on an unknown subject.
*   **Self-sufficiency:** The ability to sustain one’s ideal lifestyle and acquire th