# AI-Powered "Warren Buffett" Investment Advisor RAG System

Many investors admire Warren Buffett’s investment philosophy but lack the expertise to analyze stocks the way he does. This AI-powered RAG system retrieves historical Buffett investment decisions, Berkshire Hathaway shareholder letters, and company financials to provide Buffett-style insights on modern stocks. This allows us to more closely emulate and learn Buffett's signature "value investment" style.

In [1]:
import pdfplumber
import re
import spacy
from sentence_transformers import SentenceTransformer
import pickle
import numpy as np
import faiss
import openai

  from .autonotebook import tqdm as notebook_tqdm


## Text Preprocessing

Data Source: https://www.berkshirehathaway.com/letters/letters.html

One setback is that the shareholder's letters are all in PDF format, not markdown, making the text poorly structured. Some pre-defined rules have been created to clean the text, however it is not perfect. Additionally, tables could not be read and extracted properly.

In [None]:
def extract_text_and_tables(pdf_path):
    full_text = ""
    tables_data = []

    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            # Extract text
            page_text = page.extract_text()
            if page_text:
                full_text += page_text + "\n\n"
            # Extract tables (doesn't actually work as tables are not well formatted)
            tables = page.extract_tables()
            for table in tables:
                if table:
                    tables_data.append(table)

    return full_text, tables_data



def clean_text(text):
    # Initialize list to store cleaned lines
    cleaned_lines = []
    # Split text into lines
    lines = text.split("\n")

    # Iterate over lines
    for line in lines:
        # Remove page numbers (if the line is just a number)
        if re.fullmatch(r"\d+", line):  
            continue  
        # Replace page breaks (\f often represents a new page in PDFs)
        line = line.replace("\f", " ")
        # Remove section dividers (e.g., * * * * * * * *)
        line = re.sub(r'(\* *\*){3,}', ' ', line)
        # Remove section dividers (e.g., ---------------)
        line = re.sub(r'(\-){3,}', ' ', line) 
        # Remove section dividers (e.g., ===============)
        line = re.sub(r'(\=){3,}', ' ', line) 
        # Remove long sequences of dots (e.g., ...................................)
        line = re.sub(r'\.{5,}', ' ', line)
        # Append cleaned line if not empty
        if line:
            cleaned_lines.append(line)
    
    # Join lines, ensuring sentences are reconstructed properly
    cleaned_text = " ".join(cleaned_lines)
    # Fix spaces before punctuation (caused by broken lines)
    cleaned_text = re.sub(r'\s+([.,!?;])', r'\1', cleaned_text)
    # Fix missing spaces in camelCase-like words
    cleaned_text = re.sub(r'([a-z])([A-Z])', r'\1 \2', cleaned_text)
    # Trim whitespace
    cleaned_text = cleaned_text.strip()

    return cleaned_text

# Test the function
print("Clean text test:", clean_text(
"""
************
This is the first section..........................................................................
12
cashflowIsImportant butBerkshireHathaway's strategyIsUnique.
\f
Next page starts here.
"""))

# Preprocess all shareholder letters
for i in range(1977, 2025):
    print(f"Processing {i}...")
    text, tables = extract_text_and_tables(f"./data/BRK.A Chairman Letters/Chairman's Letter - {i}.pdf")
    processed_text = clean_text(text)
    with open(f"./data/BRK.A Chairman Letters/Chairman's Letter - {i}.txt", "w", encoding='utf-8') as f:
        f.write(processed_text)

shareholder_text = {}
for i in range(1977, 2025):
    with open(f"./data/BRK.A Chairman Letters/Chairman's Letter - {i}.txt", "r", encoding='utf-8') as f:
        shareholder_text[i] = f.read()


Clean text test: This is the first section  cashflow Is Important but Berkshire Hathaway's strategy Is Unique.   Next page starts here.


## Chunking

Chunking is done at a sentence level with sliding window using spaCy. 15 sentences form 1 chunk with an overlap of 5 sentences ensuring that each chunk has a reasonable amount of context from the previous chunk, without being excessively redundant.

Ideally, chunking would be done on a section or paragraph level. However,
1. The format of the letter changes over time and the section headers may not be consistent. E.g., 1977 - "Insurance Investments", 1981 - "General Acquisition Behavior"
2. Due to the PDF format of the letters, pdfplumber is unable to detect paragraph breaks.

Metadata for the year is also added.

In [None]:
# Load spaCy's English model
nlp = spacy.load("en_core_web_sm")

# Define the chunk size and overlap size
chunk_size = 15  # 15 sentences per chunk
overlap_size = 5  # 5 sentences overlap between chunks

# Function to create chunks with sliding window
def sliding_window_chunking(text, chunk_size, overlap_size, year_metadata):
    # Process the text with spaCy
    doc = nlp(text)
    
    # Create a list of sentences
    sentences = list(doc.sents)
    
    # Create chunks with sliding window
    chunks = []
    for i in range(0, len(sentences) - chunk_size + 1, chunk_size - overlap_size):
        chunk = sentences[i:i+chunk_size]
        chunk_text = " ".join([sent.text for sent in chunk])
        chunks.append({"text": chunk_text, "year": year_metadata})
    
    return chunks

# Process the shareholder letters and create chunks for each year
chunks_with_metadata = []

for year, letter_text in shareholder_text.items():
    # Generate chunks for each shareholder letter with year metadata
    chunks = sliding_window_chunking(letter_text, chunk_size, overlap_size, year)
    chunks_with_metadata.extend(chunks)

# Save the chunks to a file
with open("chunks_with_metadata.pkl", "wb") as f:
    pickle.dump(chunks_with_metadata, f)

# Display the first 5 chunks
chunks_with_metadata[:5]

[{'text': 'BERKSHIRE HATHAWAY INC. To the Stockholders of Berkshire Hathaway Inc.: Operating earnings in 1977 of $21,904,000, or $22.54 per share, were moderately better than anticipated a year ago. Of these earnings, $1.43 per share resulted from substantial realized capital gains by Blue Chip Stamps which, to the extent of our proportional interest in that company, are included in our operating earnings figure. Capital gains or losses realized directly by Berkshire Hathaway Inc. or its insurance subsidiaries are not included in our calculation of operating earnings. While too much attention should not be paid to the figure for any single year, over the longer term the record regarding aggregate capital gains or losses obviously is of significance. Textile operations came in well below forecast, while the results of the Illinois National Bank as well as the operating earnings attributable to our equity interest in Blue Chip Stamps were about as anticipated. However, insurance operatio

## Generate Embeddings

Embedding Model: bge-large-en

1. Large transformer architecture, more accurate than smaller models such as all-miniLM-L6-V2.
2. High dimensional embeddings (1024-dimensional) captures rich semantic information especially for financial text.
3. Works well with long form data such as shareholder letters.

In [None]:
# Load the SentenceTransformer model
model = SentenceTransformer("BAAI/bge-large-en")

# Read the chunks with metadata
with open("chunks_with_metadata.pkl", "rb") as f:
    chunks_with_metadata = pickle.load(f)

# Function to generate embeddings for the text
def generate_embeddings(data):
    embeddings = []
    
    for entry in data:
        text = entry['text']
        year = entry['year']
        
        # Generate embedding for the text
        embedding = model.encode(text)  # This returns a vector of fixed size
        embeddings.append({'year': year, 'embedding': embedding})
        
    return embeddings

# Generate embeddings for all the letters
embeddings_data = generate_embeddings(chunks_with_metadata)

# Optionally, save the embeddings using pickle
with open('embeddings.pkl', 'wb') as f:
    pickle.dump(embeddings_data, f)  # Save the embeddings with metadata (e.g., year)

## Load Embeddings on FAISS

In [34]:
# Load the embeddings
with open("embeddings.pkl", "rb") as f:
    loaded_embeddings = pickle.load(f)
print("Embeddings loaded:", len(loaded_embeddings))

# Extract embeddings and metadata
embeddings = np.array([item["embedding"] for item in loaded_embeddings]).astype("float32")
years = [item["year"] for item in loaded_embeddings]  # Store metadata separately

# Create a FAISS index (L2 similarity)
dimension = embeddings.shape[1]  # 1024 for BGE
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)  # Add all embeddings to FAISS

# Save FAISS index
faiss.write_index(index, "buffett_faiss.index")
print("FAISS index saved successfully!")

Embeddings loaded: 2656
FAISS index saved successfully!


## RAG

In [None]:
# Function to query the FAISS index
def query_faiss_index(query, k=10):
    # Convert query to embedding
    query_embedding = model.encode([query]).astype('float32')
    
    # Search the FAISS index for the k nearest neighbors
    distances, indices = index.search(query_embedding, k)
    
    # Retrieve the metadata (years) and relevant sentence chunks based on the indices
    retrieved_metadata = [years[i] for i in indices[0]]
    retrieved_chunks = [sentence_chunks[i] for i in indices[0]]  # Retrieve the actual document chunks
    
    return retrieved_metadata, retrieved_chunks, distances[0]



# Function to generate the prompt dynamically for chunked sentences
def generate_llm_prompt(retrieved_metadata, retrieved_chunks, distances, query):
    # Start with the query
    prompt = f"Pretend that you are Warren Buffett and answer in first person. Given the following document chunks from Berkshire Hathaway's annual letters, answer the query:\n\nQuery: {query}\n\n"
    
    # Add an introductory explanation
    prompt += "The following document chunks are relevant to the query. Use them to answer the question based on Warren Buffett's investment strategies:\n\n"
    
    # Process the retrieved data and add chunk sentences to the prompt with relevance scores
    for i in range(len(retrieved_metadata)):
        year = retrieved_metadata[i]
        relevant_chunk = retrieved_chunks[i]  # This would be the chunk from the document
        score = distances[i]  # The similarity score for the chunk
        
        # Add the chunk to the prompt with inline citation and relevance score
        prompt += f"Year: {year} (Score: {score:.4f})\n"
        prompt += f"Relevant Document Chunk: {relevant_chunk}\n\n"
    
    # Ask the model to reason about its answer
    prompt += "Based on the provided document chunks, please explain how Warren Buffett's investment strategies relate to the query, and reason through step by step each part of the answer. Include any insights from the document chunks that support your response."

    prompt += "Include inlined citations for each part of the answer to show the source of your reasoning in the form of quotes and (year). Additionally, include a disclaimer than you are just an AI and not actually Warren Buffett and that this should not be taken as investment advice."
    
    return prompt



# Full function for querying the system
def query_berkshire_bot(query, k=10):
    # Step 1: Query FAISS index
    retrieved_metadata, retrieved_chunks, distances = query_faiss_index(query, k)
    
    # Step 2: Craft the prompt for the LLM
    llm_prompt = generate_llm_prompt(retrieved_metadata, retrieved_chunks, distances, query)
    # print(llm_prompt)
    
    # Step 3: Send the prompt to the LLM
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": llm_prompt}]
    )
    response = response.model_dump()

    base_response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": query}]
    )
    base_response = base_response.model_dump()

    return response["choices"][0]["message"]["content"], base_response["choices"][0]["message"]["content"]


# Load the SentenceTransformer model, FAISS index, OpenAI API
model = SentenceTransformer("BAAI/bge-large-en")
index = faiss.read_index("buffett_faiss.index")
client = openai.OpenAI(api_key="OPENAI_API_KEY")

# Load and extract the embeddings and chunks with metadata
loaded_embeddings = pickle.load(open("embeddings.pkl", "rb"))
embeddings = np.array([item["embedding"] for item in loaded_embeddings]).astype("float32")
years = [item["year"] for item in loaded_embeddings]  # Store metadata separately

# Load the text chunks
chunks_with_metadata = pickle.load(open("chunks_with_metadata.pkl", "rb"))
sentence_chunks = [item["text"] for item in chunks_with_metadata]

On testing, I find that the RAG system is pulling chunks from the wrong years. E.g., the query is asking for 2005, the retrieval finds chunks from 1992. This is because the retrieval does not account for the metadata, only focusing on finding semantically similar chunks. The metadata is only added as context for prompt injection after retrieval.

I considered using re-ranking techniques. However, it is not guaranteed that out of the top k chunks, the retriever even finds chunks from the correct years.

I decided to implement a pre-retrieval filter to more strictly follow the specified years in the query.

This involves filtering the embeddings and chunks then creating a new FAISS index for each query. A concern might be that this may be inefficient on a larger scale. On local testing, it seems to perform decently taking <1s to create a new FAISS index.

## Testing the system

### How has Buffett’s investment philosophy evolved over the years?

In [27]:
query = "How has your investment philosophy evolved over the years?"
response, base_response = query_berkshire_bot(query)
print("\nQuery:", query)
print("\nLLM Response:", response)
print("\nBase Response:", base_response)


Query: How has your investment philosophy evolved over the years?

LLM Response: As Warren Buffett, my investment philosophy has evolved over the years to focus on businesses with enduring competitive advantages and strong long-term prospects. In the early years, I made mistakes by following a "cigar butt" approach to investing, which involved buying cheap stocks with the hope of selling them at a profit despite poor long-term prospects (1989). However, I realized that this strategy was not ideal and shifted to a more value-oriented approach that considers a business as a whole, emphasizing factors such as understanding the business, favorable long-term prospects, honest management, and an attractive price (1992).

I have learned to stick with businesses and industries that are unlikely to experience major changes, seeking operations that are expected to possess enormous competitive strength in the future (1996). By focusing on predictability and certainty in both businesses and marke

### How does Warren Buffett explain his decision to invest in companies like Coca-Cola and Apple?

In [28]:
query = "Can you explain your decision to invest in companies like Coca-Cola and Apple?"
response, base_response = query_berkshire_bot(query)
print("\nQuery:", query)
print("\nLLM Response:", response)
print("\nBase Response:", base_response)


Query: Can you explain your decision to invest in companies like Coca-Cola and Apple?

LLM Response: Disclaimer: The following response is based on Warren Buffett's investment strategies as reflected in the provided document chunks, and it is not investment advice. As an AI, I am not Warren Buffett.

Warren Buffett's decision to invest in companies like Coca-Cola and Apple aligns with his long-term investment philosophy of seeking businesses with strong competitive advantages and enduring economics. In the case of Coca-Cola, Buffett's initial exposure to the company dates back to his childhood when he engaged in a small-scale distribution business with Coca-Cola products (1994). This early familiarity with the company's products and the success of its core business model likely contributed to his decision to invest in Coca-Cola later on. Similarly, Buffett's investment in American Express and other companies has been driven by his history with these firms and the long-term familiarity

### What insights about Berkshire Hathaway’s long-term strategy can be drawn from comparing Warren Buffett’s 2004 and 2014 letters?

In [29]:
query = "What insights about your long-term strategy can be drawn from comparing the 2004 and 2014 letters?"
response, base_response = query_berkshire_bot(query)
print("\nQuery:", query)
print("\nLLM Response:", response)
print("\nBase Response:", base_response)


Query: What insights about your long-term strategy can be drawn from comparing the 2004 and 2014 letters?

LLM Response: As Warren Buffett, my long-term investment strategy remains focused on acquiring businesses with favorable long-term prospects, operated by honest and competent people, at an attractive price. This approach has been consistent over the years, as outlined in the 1992 annual letter where I stated, "We want the business to be one that we can understand; with favorable long-term prospects; operated by honest and competent people; and available at an attractive price" (1992). 

Furthermore, my investment philosophy emphasizes the importance of generating gains in per-share intrinsic value that outperform the S&P index over the long term. As mentioned in the 2004 annual letter, "Unless we achieve gains in per-share intrinsic value in the future that outdo the S&P, Charlie and I will be adding nothing to what you can accomplish on your own" (2004). This highlights the cent

### If Warren Buffett were to make an investment in the tech industry today, how might his strategy differ from his approach in the early 2000s?

In [30]:
query = "If you were to make an investment in the tech industry today, how might your strategy differ from your approach in the early 2000s?"
response, base_response = query_berkshire_bot(query)
print("\nQuery:", query)
print("\nLLM Response:", response)
print("\nBase Response:", base_response)


Query: If you were to make an investment in the tech industry today, how might your strategy differ from your approach in the early 2000s?

LLM Response: As Warren Buffett, my investment strategy in the tech industry today would likely differ from my approach in the early 2000s. In the early 2000s, I mentioned that Berkshire Hathaway did not own stocks of tech companies due to a lack of insights into which participants in the tech field possess a durable competitive advantage (1999). My team and I focus on businesses and industries unlikely to experience major change, as we seek operations that are virtually certain to possess enormous competitive strength in the long term (1996). This strategy has been successful for us as we look for businesses with demonstrated consistent earning power, good returns on equity, and little to no debt, while understanding their operations deeply (1989, 1992). We prioritize businesses with favorable long-term prospects, operated by honest and competent

### What are some key takeaways from your 2019 letter on international investments?

In [31]:
query = "What are some key takeaways from your 2019 letter on international investments?"
response, base_response = query_berkshire_bot(query)
print("\nQuery:", query)
print("\nLLM Response:", response)
print("\nBase Response:", base_response)


Query: What are some key takeaways from your 2019 letter on international investments?

LLM Response: As Warren Buffett, one key takeaway from the 2019 letter on international investments is the focus on currency diversification and foreign exchange exposure. In 2006, it is mentioned that Berkshire made foreign exchange purchases to gain exposure to different currencies, taking advantage of interest-rate differentials between the U.S. and foreign countries (2006). This strategy aligns with our approach to seek currency-neutrality and utilize foreign exchange contracts as a partial offset to dollar-denominated positions (2003).

Furthermore, the letter highlights the importance of owning good businesses over cash-equivalent assets to protect against currency devaluation. Buffett emphasizes that paper money can lose value due to fiscal folly, and that fixed-coupon bonds offer no protection against runaway currency (2024). This reinforces our strategy of deploying a substantial majority 