# AI-Powered "Warren Buffett" Investment Advisor RAG System

Many investors admire Warren Buffett’s investment philosophy but lack the expertise to analyze stocks the way he does. This AI-powered RAG system retrieves historical Buffett investment decisions, Berkshire Hathaway shareholder letters, and company financials to provide Buffett-style insights on modern stocks. This allows us to more closely emulate and learn Buffett's signature "value investment" style.

In [1]:
import pdfplumber
import wordninja
import re
import spacy
from sentence_transformers import SentenceTransformer
import pickle
import numpy as np
import faiss
import openai

  from .autonotebook import tqdm as notebook_tqdm


## Text Preprocessing

Data Source: https://www.berkshirehathaway.com/letters/letters.html

One setback is that the shareholder's letters are all in PDF format, not markdown, making the text poorly structured. Some pre-defined rules have been created to clean the text, however it is not perfect. Additionally, tables could not be read and extracted properly.

When reading PDFs the spacing words sometimes join together. I used the wordninja library to separate them.

In [None]:
def extract_text_and_tables(pdf_path):
    full_text = ""
    tables_data = []

    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            # Extract text
            page_text = page.extract_text()
            if page_text:
                full_text += page_text + "\n\n"
            # Extract tables (doesn't actually work as tables are not well formatted)
            tables = page.extract_tables()
            for table in tables:
                if table:
                    tables_data.append(table)

    full_text = " ".join(wordninja.split(full_text))

    return full_text, tables_data



def clean_text(text):
    # Initialize list to store cleaned lines
    cleaned_lines = []
    # Split text into lines
    lines = text.split("\n")

    # Iterate over lines
    for line in lines:
        # Remove page numbers (if the line is just a number)
        if re.fullmatch(r"\d+", line):  
            continue  
        # Replace page breaks (\f often represents a new page in PDFs)
        line = line.replace("\f", " ")
        # Remove section dividers (e.g., * * * * * * * *)
        line = re.sub(r'(\* *\*){3,}', ' ', line)
        # Remove section dividers (e.g., ---------------)
        line = re.sub(r'(\-){3,}', ' ', line) 
        # Remove section dividers (e.g., ===============)
        line = re.sub(r'(\=){3,}', ' ', line) 
        # Remove long sequences of dots (e.g., ...................................)
        line = re.sub(r'\.{5,}', ' ', line)
        # Append cleaned line if not empty
        if line:
            cleaned_lines.append(line)
    
    # Join lines, ensuring sentences are reconstructed properly
    cleaned_text = " ".join(cleaned_lines)
    # Fix spaces before punctuation (caused by broken lines)
    cleaned_text = re.sub(r'\s+([.,!?;])', r'\1', cleaned_text)
    # Fix missing spaces in camelCase-like words
    cleaned_text = re.sub(r'([a-z])([A-Z])', r'\1 \2', cleaned_text)
    # Trim whitespace
    cleaned_text = cleaned_text.strip()

    return cleaned_text

# Test the function
print("Clean text test:", clean_text(
"""
************
This is the first section..........................................................................
12
cashflowIsImportant butBerkshireHathaway's strategyIsUnique.
\f
Next page starts here.
"""))

# Preprocess all shareholder letters
for i in range(1977, 2025):
    print(f"Processing {i}...")
    text, tables = extract_text_and_tables(f"./data/BRK.A Chairman Letters/Chairman's Letter - {i}.pdf")
    processed_text = clean_text(text)
    with open(f"./data/BRK.A Chairman Letters/Chairman's Letter - {i}.txt", "w", encoding='utf-8') as f:
        f.write(processed_text)

shareholder_text = {}
for i in range(1977, 2025):
    with open(f"./data/BRK.A Chairman Letters/Chairman's Letter - {i}.txt", "r", encoding='utf-8') as f:
        shareholder_text[i] = f.read()


Clean text test: This is the first section  cashflow Is Important but Berkshire Hathaway's strategy Is Unique.   Next page starts here.


## Chunking

Chunking is done at a sentence level with sliding window using spaCy. 15 sentences form 1 chunk with an overlap of 5 sentences ensuring that each chunk has a reasonable amount of context from the previous chunk, without being excessively redundant.

Ideally, chunking would be done on a section or paragraph level. However,
1. The format of the letter changes over time and the section headers may not be consistent. E.g., 1977 - "Insurance Investments", 1981 - "General Acquisition Behavior"
2. Due to the PDF format of the letters, pdfplumber is unable to detect paragraph breaks.

Metadata for the year is also added.

In [None]:
# Load spaCy's English model
nlp = spacy.load("en_core_web_sm")

# Define the chunk size and overlap size
chunk_size = 15  # 15 sentences per chunk
overlap_size = 5  # 5 sentences overlap between chunks

# Function to create chunks with sliding window
def sliding_window_chunking(text, chunk_size, overlap_size, year_metadata):
    # Process the text with spaCy
    doc = nlp(text)
    
    # Create a list of sentences
    sentences = list(doc.sents)
    
    # Create chunks with sliding window
    chunks = []
    for i in range(0, len(sentences) - chunk_size + 1, chunk_size - overlap_size):
        chunk = sentences[i:i+chunk_size]
        chunk_text = " ".join([sent.text for sent in chunk])
        chunks.append({"text": chunk_text, "year": year_metadata})
    
    return chunks

# Process the shareholder letters and create chunks for each year
chunks_with_metadata = []

for year, letter_text in shareholder_text.items():
    # Generate chunks for each shareholder letter with year metadata
    chunks = sliding_window_chunking(letter_text, chunk_size, overlap_size, year)
    chunks_with_metadata.extend(chunks)

# Save the chunks to a file
with open("chunks_with_metadata.pkl", "wb") as f:
    pickle.dump(chunks_with_metadata, f)

# Display the first 5 chunks
chunks_with_metadata[:5]

[{'text': 'BERKSHIRE HATHAWAY INC. To the Stockholders of Berkshire Hathaway Inc.: Operating earnings in 1977 of $21,904,000, or $22.54 per share, were moderately better than anticipated a year ago. Of these earnings, $1.43 per share resulted from substantial realized capital gains by Blue Chip Stamps which, to the extent of our proportional interest in that company, are included in our operating earnings figure. Capital gains or losses realized directly by Berkshire Hathaway Inc. or its insurance subsidiaries are not included in our calculation of operating earnings. While too much attention should not be paid to the figure for any single year, over the longer term the record regarding aggregate capital gains or losses obviously is of significance. Textile operations came in well below forecast, while the results of the Illinois National Bank as well as the operating earnings attributable to our equity interest in Blue Chip Stamps were about as anticipated. However, insurance operatio

## Generate Embeddings

Embedding Model: bge-large-en

1. Large transformer architecture, more accurate than smaller models such as all-miniLM-L6-V2.
2. High dimensional embeddings (1024-dimensional) captures rich semantic information especially for financial text.
3. Works well with long form data such as shareholder letters.

In [None]:
# Load the SentenceTransformer model
model = SentenceTransformer("BAAI/bge-large-en")

# Read the chunks with metadata
with open("chunks_with_metadata.pkl", "rb") as f:
    chunks_with_metadata = pickle.load(f)

# Function to generate embeddings for the text
def generate_embeddings(data):
    embeddings = []
    
    for entry in data:
        text = entry['text']
        year = entry['year']
        
        # Generate embedding for the text
        embedding = model.encode(text)  # This returns a vector of fixed size
        embeddings.append({'year': year, 'embedding': embedding})
        
    return embeddings

# Generate embeddings for all the letters
embeddings_data = generate_embeddings(chunks_with_metadata)

# Optionally, save the embeddings using pickle
with open('embeddings.pkl', 'wb') as f:
    pickle.dump(embeddings_data, f)  # Save the embeddings with metadata (e.g., year)

## RAG

In [None]:
# Load the SentenceTransformer model, FAISS index, OpenAI API
model = SentenceTransformer("BAAI/bge-large-en")
client = openai.OpenAI(api_key="OPENAI_API_KEY")

In [5]:
# Function to query the FAISS index
def query_faiss_index(query, index, years, sentence_chunks, k=10):
    # Convert query to embedding
    query_embedding = model.encode([query]).astype('float32')
    
    # Search the FAISS index for the k nearest neighbors
    distances, indices = index.search(query_embedding, k)
    
    # Retrieve the metadata (years) and relevant sentence chunks based on the indices
    retrieved_metadata = [years[i] for i in indices[0]]
    retrieved_chunks = [sentence_chunks[i] for i in indices[0]]  # Retrieve the actual document chunks
    
    return retrieved_metadata, retrieved_chunks, distances[0]



# Function to generate the prompt dynamically for chunked sentences
def generate_llm_prompt(retrieved_metadata, retrieved_chunks, distances, query):
    # Start with the query
    prompt = f"Pretend that you are Warren Buffett and answer in first person. Given the following document chunks from Berkshire Hathaway's annual letters, answer the query:\n\nQuery: {query}\n\n"
    
    # Add an introductory explanation
    prompt += "The following document chunks are relevant to the query. Use them to answer the question based on Warren Buffett's investment strategies:\n\n"
    
    # Process the retrieved data and add chunk sentences to the prompt with relevance scores
    for i in range(len(retrieved_metadata)):
        year = retrieved_metadata[i]
        relevant_chunk = retrieved_chunks[i]  # This would be the chunk from the document
        score = distances[i]  # The similarity score for the chunk
        
        # Add the chunk to the prompt with inline citation and relevance score
        prompt += f"Year: {year} (Score: {score:.4f})\n"
        prompt += f"Relevant Document Chunk: {relevant_chunk}\n\n"
    
    # Ask the model to reason about its answer
    prompt += "Based on the provided document chunks, please explain how Warren Buffett's investment strategies relate to the query, and reason through step by step each part of the answer. Include any insights from the document chunks that support your response."

    prompt += "Include inlined citations for each part of the answer to show the source of your reasoning in the form of quotes and (year). Additionally, include a disclaimer than you are just an AI and not actually Warren Buffett and that this should not be taken as investment advice."
    
    return prompt



# Full function for querying the system
def query_berkshire_bot(query, index, years, sentence_chunks, k=10):
    # Step 1: Query FAISS index
    retrieved_metadata, retrieved_chunks, distances = query_faiss_index(query, index, years, sentence_chunks, k)
    
    # Step 2: Craft the prompt for the LLM
    llm_prompt = generate_llm_prompt(retrieved_metadata, retrieved_chunks, distances, query)
    print(llm_prompt)
    
    # Step 3: Send the prompt to the LLM
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "system", "content": "You are an AI version of Warren Buffett answering investment-related questions."},
                  {"role": "user", "content": llm_prompt}]
    )
    response = response.model_dump()

    # Step 4: Generate the base response
    base_response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "system", "content": "You are an AI version of Warren Buffett answering investment-related questions."},
                  {"role": "user", "content": query}]
    )
    base_response = base_response.model_dump()

    return response["choices"][0]["message"]["content"], base_response["choices"][0]["message"]["content"]

On testing, I find that the RAG system is pulling chunks from the wrong years. E.g., the query is asking for 2005, the retrieval finds chunks from 1992. This is because the retrieval does not account for the metadata, only focusing on finding semantically similar chunks. The metadata is only added as context for prompt injection after retrieval.

I considered using re-ranking techniques. However, it is not guaranteed that out of the top k chunks, the retriever even finds chunks from the correct years.

I decided to implement a pre-retrieval filter to more strictly follow the specified years in the query.

This involves filtering the embeddings and chunks then creating a new FAISS index for each query. A concern might be that this may be inefficient on a larger scale. On local testing, it seems to perform decently taking <1s to create a new FAISS index.

I decided to use an LLM to extract the relevant years due to the dynamic nature of querying. E.g., "compare between 2004 and 2014" vs "how has Buffett's style changed between 2004 and 2014".

In [6]:
def get_relevant_years(query):
    prompt = f'''
    Given the following investment-related question: {query}, identify all relevant years mentioned or implied in the question.
    Return the years as a comma-separated list in the format 'YYYY, YYYY, YYYY, etc.'. Return only in the specified format with no text. Consider the following when identifying years:
    1. Explicit years mentioned in the query (e.g., '2000', '2024').
    2. Time periods or phrases that imply specific years (e.g., 'early 2000s' should be interpreted as '2000, 2001, 2002, 2003, 2004, 2005').
    3. If the word 'today' is mentioned, assume the current year is 2024 unless otherwise specified.
    4. If no years are explicitly mentioned or implied, return 'NO RELEVANT YEARS'.
    Pay close attention to point 2.
    '''
    
    # Use LLM to extract relevant years
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "system", "content": "You are an AI that extracts relevant years from investment-related questions."},
                  {"role": "user", "content": prompt}]
    )
    response = response.model_dump()

    return response["choices"][0]["message"]["content"].strip()


def filtered_data(query):
    # Get relevant years from the query
    relevant_years = get_relevant_years(query)

    # Load the embeddings and chunks
    loaded_embeddings = pickle.load(open("embeddings.pkl", "rb"))
    chunks_with_metadata = pickle.load(open("chunks_with_metadata.pkl", "rb"))

    # If query does not mention specific years, use all data
    if relevant_years == "NO RELEVANT YEARS":
        embeddings = np.array([item["embedding"] for item in loaded_embeddings]).astype("float32")
        years = [item["year"] for item in loaded_embeddings]  # Store metadata separately
        sentence_chunks = [item["text"] for item in chunks_with_metadata]

        # Create a FAISS index (L2 similarity)
        dimension = embeddings.shape[1]  # 1024 for BGE
        index = faiss.IndexFlatL2(dimension)
        index.add(embeddings)  # Add all embeddings to FAISS

        return index, years, sentence_chunks
    
    # Filter the data based on relevant years
    else:
        relevant_years = relevant_years.strip().split(", ")  # Convert to list
        
        # Filter embeddings and years
        filtered_embeddings = [item for item in loaded_embeddings if str(item["year"]) in relevant_years]
        embeddings = np.array([item["embedding"] for item in filtered_embeddings]).astype("float32")
        years = [item["year"] for item in filtered_embeddings]  # Store metadata separately

        # Filter sentence chunks
        filtered_sentence_chunks = [item for item in chunks_with_metadata if str(item["year"]) in relevant_years]
        sentence_chunks = [item["text"] for item in filtered_sentence_chunks]

        # Create a filtered FAISS index (L2 similarity)
        dimension = embeddings.shape[1]  # 1024 for BGE
        index = faiss.IndexFlatL2(dimension)
        index.add(embeddings)  # Add all embeddings to FAISS

        return index, years, sentence_chunks

## Testing the system

The idea is that you can address the system directly in first person as if it's Warren Buffett. However, for the purposes of fair testing against the base response, the test queries are formatted in third person.

One difficulty faced for evaluation is the lack of labelled chunks to calculate recall@k, precision@k, ROUGE score, BLEU score, BERTscore. I will use groundedness to evaluate the retriever and LLM as a judge to evaluate the generator. However, on testing, including the retrieved chunks into the evaluation makes the query far too long. I will use the LLM's general knowledge to evaluate the generated response. I note that this is not ideal as it does not properly contain the context of the RAG system.

I also included the base response without RAG to check that the generative power is not coming solely from the strength of the base LLM used for generation, but rather improved with the RAG system.

Inline citations have also been instructed to be included.

In [9]:
def evaluate_generated_response(query, response, sentence_chunks):
    prompt = f""" 
    Given the following query, and generated response, please evaluate the response based on relevance, accuracy, fluency, and informativeness.

    Query: {query}

    Generated Response:
    {response}

    Please provide a score from 1 to 10 for each criterion and a brief explanation for each score.

    Relevance:
    Accuracy:
    Fluency:
    Informativeness:
    """
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "system", "content": "You are an expert judge evaluating the performance of a Retrieval-Augmented Generation (RAG) system about Warren Buffett and Berkshire Hathaway."},
                  {"role": "user", "content": prompt}]
    )
    response = response.model_dump()

    return response["choices"][0]["message"]["content"]

### How has Buffett’s investment philosophy evolved over the years?

In [11]:
query = "How has Buffett’s investment philosophy evolved over the years?"
index, years, sentence_chunks = filtered_data(query)
response, base_response = query_berkshire_bot(query, index, years, sentence_chunks)
print("\nQuery:", query)
print("\nRAG Response:", response)
print("\nBase Response:", base_response)
print("\n RAG Response Evaluation:", evaluate_generated_response(query, response, sentence_chunks))

Pretend that you are Warren Buffett and answer in first person. Given the following document chunks from Berkshire Hathaway's annual letters, answer the query:

Query: How has Buffett’s investment philosophy evolved over the years?

The following document chunks are relevant to the query. Use them to answer the question based on Warren Buffett's investment strategies:

Year: 2014 (Score: 0.2923)
Relevant Document Chunk: Almost every element was chosen because Buffett believed that, under him, it would help maximize Berkshire’s achievement. He was not trying to create a one-type-fits-all system for other corporations. Indeed, Berkshire’s subsidiarieswerenotrequiredtousethe Berkshiresystemintheirownoperations. Andsomeflourishedwhileusing differentsystems. Whatwas Buffettaimingatashedesignedthe Berkshiresystem? Well,overtheyears Idiagnosedseveralimportantthemes: (1) He particularly wanted continuous maximization of the rationality, skills, and devotion of the most importantpeopleinthesyst

### How does Warren Buffett explain his decision to invest in companies like Coca-Cola and Apple?

In [12]:
query = "How does Warren Buffett explain his decision to invest in companies like Coca-Cola and Apple?"
index, years, sentence_chunks = filtered_data(query)
response, base_response = query_berkshire_bot(query, index, years, sentence_chunks)
print("\nQuery:", query)
print("\nRAG Response:", response)
print("\nBase Response:", base_response)
print("\n RAG Response Evaluation:", evaluate_generated_response(query, response, sentence_chunks))

Pretend that you are Warren Buffett and answer in first person. Given the following document chunks from Berkshire Hathaway's annual letters, answer the query:

Query: How does Warren Buffett explain his decision to invest in companies like Coca-Cola and Apple?

The following document chunks are relevant to the query. Use them to answer the question based on Warren Buffett's investment strategies:

Year: 1996 (Score: 0.2784)
Relevant Document Chunk: This investor would get a similar result if he followed a policy of purchasing an interest in, say, 20% of the future earnings of a number of outstanding college basketball stars. A handful of these would go on to achieve NBA stardom, and the investor's take from them would soon dominate his royalty stream. To suggest that this investor should sell off portions of his most successful investments simply because they have come to dominate his portfolio is akin to suggesting that the Bulls trade Michael Jordan because he has become so importan

### What insights about Berkshire Hathaway’s long-term strategy can be drawn from comparing Warren Buffett’s 2004 and 2014 letters?

In [13]:
query = "What insights about Berkshire Hathaway’s long-term strategy can be drawn from comparing Warren Buffett’s 2004 and 2014 letters?"
index, years, sentence_chunks = filtered_data(query)
response, base_response = query_berkshire_bot(query, index, years, sentence_chunks)
print("\nQuery:", query)
print("\nRAG Response:", response)
print("\nBase Response:", base_response)
print("\n RAG Response Evaluation:", evaluate_generated_response(query, response, sentence_chunks))

Pretend that you are Warren Buffett and answer in first person. Given the following document chunks from Berkshire Hathaway's annual letters, answer the query:

Query: What insights about Berkshire Hathaway’s long-term strategy can be drawn from comparing Warren Buffett’s 2004 and 2014 letters?

The following document chunks are relevant to the query. Use them to answer the question based on Warren Buffett's investment strategies:

Year: 2014 (Score: 0.2342)
Relevant Document Chunk: A note to readers: Fifty years ago, today’s management took charge at Berkshire. For this Golden Anniversary, Warren Buffett and Charlie Munger each wrote his views of what has happened at Berkshire during the past 50 years and what each expects during the next 50. Neither changed a word of his commentary after reading what the other had written. Warren’s thoughts begin on page 24 and Charlie’s on page 39. Shareholders, particularly new ones,mayfinditusefultoreadthoselettersbeforereadingthereporton2014,whic

### If Warren Buffett were to make an investment in the tech industry today, how might his strategy differ from his approach in the early 2000s?

In [14]:
query = "If Warren Buffett were to make an investment in the tech industry today, how might his strategy differ from his approach in the early 2000s?"
index, years, sentence_chunks = filtered_data(query)
response, base_response = query_berkshire_bot(query, index, years, sentence_chunks)
print("\nQuery:", query)
print("\nRAG Response:", response)
print("\nBase Response:", base_response)
print("\n RAG Response Evaluation:", evaluate_generated_response(query, response, sentence_chunks))

Pretend that you are Warren Buffett and answer in first person. Given the following document chunks from Berkshire Hathaway's annual letters, answer the query:

Query: If Warren Buffett were to make an investment in the tech industry today, how might his strategy differ from his approach in the early 2000s?

The following document chunks are relevant to the query. Use them to answer the question based on Warren Buffett's investment strategies:

Year: 2000 (Score: 0.3693)
Relevant Document Chunk: * A bit of nostalgia: It was exactly 50 years ago that I entered Ben Graham’s class at Columbia. During the decade before, I had enjoyed (cid:190) make that loved (cid:190) analyzing, buying and selling stocks. But my results were no better than average. Beginning in 1951 my performance improved. No, I hadn’t changed my diet or taken up exercise. The only new ingredient was Ben’s ideas. Quite simply, a few hours spent at the feet of the master proved far more valuable to me than had ten years o

### What are some key takeaways from your 2019 letter on international investments?

In [15]:
query = "What are some key takeaways from your 2019 letter on international investments?"
index, years, sentence_chunks = filtered_data(query)
response, base_response = query_berkshire_bot(query, index, years, sentence_chunks)
print("\nQuery:", query)
print("\nRAG Response:", response)
print("\nBase Response:", base_response)
print("\n RAG Response Evaluation:", evaluate_generated_response(query, response, sentence_chunks))

Pretend that you are Warren Buffett and answer in first person. Given the following document chunks from Berkshire Hathaway's annual letters, answer the query:

Query: What are some key takeaways from your 2019 letter on international investments?

The following document chunks are relevant to the query. Use them to answer the question based on Warren Buffett's investment strategies:

Year: 2019 (Score: 0.3681)
Relevant Document Chunk: As we stated in last year’s letter, neither Charlie Munger, my partner in managing Berkshire, nor I agree withthatrule. The adoption of the rule by the accounting profession, in fact, was a monumental shift in its own thinking. Before 2018, GAAP insisted – with an exception for companies whose business was to trade securities – that unrealized gains within a portfolio of stocks were never to be included in earnings and unrealized losses were to be included only if they were deemed “other than temporary.” Now, Berkshire must enshrine in each quarter’s bot