# AI-Powered "Warren Buffett" Investment Advisor RAG System

Many investors admire Warren Buffett’s investment philosophy but lack the expertise to analyze stocks the way he does. This AI-powered RAG system retrieves historical Buffett investment decisions, Berkshire Hathaway shareholder letters, and company financials to provide Buffett-style insights on modern stocks. This allows us to more closely emulate and learn Buffett's signature "value investment" style.

In [68]:
import pdfplumber
import re
import spacy
from sentence_transformers import SentenceTransformer
import pickle
import numpy as np
import faiss
import openai

## Text Preprocessing

Data Source: https://www.berkshirehathaway.com/letters/letters.html

One setback is that the shareholder's letters are all in PDF format, not markdown, making the text poorly structured. Some pre-defined rules have been created to clean the text, however it is not perfect. Additionally, tables could not be read and extracted properly.

In [None]:
def extract_text_and_tables(pdf_path):
    full_text = ""
    tables_data = []

    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            # Extract text
            page_text = page.extract_text()
            if page_text:
                full_text += page_text + "\n\n"
            # Extract tables (doesn't actually work as tables are not well formatted)
            tables = page.extract_tables()
            for table in tables:
                if table:
                    tables_data.append(table)

    return full_text, tables_data



def clean_text(text):
    # Initialize list to store cleaned lines
    cleaned_lines = []
    # Split text into lines
    lines = text.split("\n")

    # Iterate over lines
    for line in lines:
        # Remove page numbers (if the line is just a number)
        if re.fullmatch(r"\d+", line):  
            continue  
        # Replace page breaks (\f often represents a new page in PDFs)
        line = line.replace("\f", " ")
        # Remove section dividers (e.g., ***********)
        line = re.sub(r'\*{5,}', ' ', line)
        # Remove long sequences of dots (e.g., ...................................)
        line = re.sub(r'\.{5,}', ' ', line)
        # Append cleaned line if not empty
        if line:
            cleaned_lines.append(line)
    
    # Join lines, ensuring sentences are reconstructed properly
    cleaned_text = " ".join(cleaned_lines)
    # Fix spaces before punctuation (caused by broken lines)
    cleaned_text = re.sub(r'\s+([.,!?;])', r'\1', cleaned_text)
    # Fix missing spaces in camelCase-like words
    cleaned_text = re.sub(r'([a-z])([A-Z])', r'\1 \2', cleaned_text)
    # Trim whitespace
    cleaned_text = cleaned_text.strip()

    return cleaned_text

# Test the function
print("Clean text test:", clean_text(
"""
************
This is the first section..........................................................................
12
cashflowIsImportant butBerkshireHathaway's strategyIsUnique.
\f
Next page starts here.
"""))

# Preprocess all shareholder letters
for i in range(1977, 2025):
    print(f"Processing {i}...")
    text, tables = extract_text_and_tables(f"./data/BRK.A Chairman Letters/Chairman's Letter - {i}.pdf")
    processed_text = clean_text(text)
    with open(f"./data/BRK.A Chairman Letters/Chairman's Letter - {i}.txt", "w", encoding='utf-8') as f:
        f.write(processed_text)

shareholder_text = {}
for i in range(1977, 2025):
    with open(f"./data/BRK.A Chairman Letters/Chairman's Letter - {i}.txt", "r", encoding='utf-8') as f:
        shareholder_text[i] = f.read()


Clean text test: This is the first section  cashflow Is Important but Berkshire Hathaway's strategy Is Unique.   Next page starts here.


## Chunking

Chunking is done at a sentence level with sliding window using spaCy. 15 sentences form 1 chunk with an overlap of 5 sentences ensuring that each chunk has a reasonable amount of context from the previous chunk, without being excessively redundant.

Ideally, chunking would be done on a section or paragraph level. However,
1. The format of the letter changes over time and the section headers may not be consistent. E.g., 1977 - "Insurance Investments", 1981 - "General Acquisition Behavior"
2. Due to the PDF format of the letters, pdfplumber is unable to detect paragraph breaks.

Metadata for the year is also added.

In [None]:
# Load spaCy's English model
nlp = spacy.load("en_core_web_sm")

# Define the chunk size and overlap size
chunk_size = 15  # 15 sentences per chunk
overlap_size = 5  # 5 sentences overlap between chunks

# Function to create chunks with sliding window
def sliding_window_chunking(text, chunk_size, overlap_size, year_metadata):
    # Process the text with spaCy
    doc = nlp(text)
    
    # Create a list of sentences
    sentences = list(doc.sents)
    
    # Create chunks with sliding window
    chunks = []
    for i in range(0, len(sentences) - chunk_size + 1, chunk_size - overlap_size):
        chunk = sentences[i:i+chunk_size]
        chunk_text = " ".join([sent.text for sent in chunk])
        chunks.append({"text": chunk_text, "year": year_metadata})
    
    return chunks

# Process the shareholder letters and create chunks for each year
chunks_with_metadata = []

for year, letter_text in shareholder_text.items():
    # Generate chunks for each shareholder letter with year metadata
    chunks = sliding_window_chunking(letter_text, chunk_size, overlap_size, year)
    chunks_with_metadata.extend(chunks)

# Save the chunks to a file
with open("chunks_with_metadata.pkl", "wb") as f:
    pickle.dump(chunks_with_metadata, f)

# Display the first 5 chunks
chunks_with_metadata[:5]

[{'text': 'BERKSHIRE HATHAWAY INC. To the Stockholders of Berkshire Hathaway Inc.: Operating earnings in 1977 of $21,904,000, or $22.54 per share, were moderately better than anticipated a year ago. Of these earnings, $1.43 per share resulted from substantial realized capital gains by Blue Chip Stamps which, to the extent of our proportional interest in that company, are included in our operating earnings figure. Capital gains or losses realized directly by Berkshire Hathaway Inc. or its insurance subsidiaries are not included in our calculation of operating earnings. While too much attention should not be paid to the figure for any single year, over the longer term the record regarding aggregate capital gains or losses obviously is of significance. Textile operations came in well below forecast, while the results of the Illinois National Bank as well as the operating earnings attributable to our equity interest in Blue Chip Stamps were about as anticipated. However, insurance operatio

## Generate Embeddings

Embedding Model: bge-large-en

1. Large transformer architecture, more accurate than smaller models such as all-miniLM-L6-V2.
2. High dimensional embeddings (1024-dimensional) captures rich semantic information especially for financial text.
3. Works well with long form data such as shareholder letters.

In [None]:
# Load the SentenceTransformer model
model = SentenceTransformer("BAAI/bge-large-en")

# Read the chunks with metadata
with open("chunks_with_metadata.pkl", "rb") as f:
    chunks_with_metadata = pickle.load(f)

# Function to generate embeddings for the text
def generate_embeddings(data):
    embeddings = []
    
    for entry in data:
        text = entry['text']
        year = entry['year']
        
        # Generate embedding for the text
        embedding = model.encode(text)  # This returns a vector of fixed size
        embeddings.append({'year': year, 'embedding': embedding})
        
    return embeddings

# Generate embeddings for all the letters
embeddings_data = generate_embeddings(chunks_with_metadata)

# Optionally, save the embeddings using pickle
with open('embeddings.pkl', 'wb') as f:
    pickle.dump(embeddings_data, f)  # Save the embeddings with metadata (e.g., year)

## Load Embeddings on FAISS

In [52]:
# Load the embeddings
with open("embeddings.pkl", "rb") as f:
    loaded_embeddings = pickle.load(f)
print("Embeddings loaded:", len(loaded_embeddings))

# Extract embeddings and metadata
embeddings = np.array([item["embedding"] for item in loaded_embeddings]).astype("float32")
years = [item["year"] for item in loaded_embeddings]  # Store metadata separately

# Create a FAISS index (L2 similarity)
dimension = embeddings.shape[1]  # 1024 for BGE
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)  # Add all embeddings to FAISS

# Save FAISS index
faiss.write_index(index, "buffett_faiss.index")
print("FAISS index saved successfully!")

Embeddings loaded: 2656
FAISS index saved successfully!


## RAG

In [None]:
# Load the SentenceTransformer model, FAISS index
model = SentenceTransformer("BAAI/bge-large-en")
index = faiss.read_index("buffett_faiss.index")

# Load and extract the embeddings and chunks with metadata
loaded_embeddings = pickle.load(open("embeddings.pkl", "rb"))
embeddings = np.array([item["embedding"] for item in loaded_embeddings]).astype("float32")
years = [item["year"] for item in loaded_embeddings]  # Store metadata separately

# Load the text chunks
chunks_with_metadata = pickle.load(open("chunks_with_metadata.pkl", "rb"))
sentence_chunks = [item["text"] for item in chunks_with_metadata]



# Function to query the FAISS index
def query_faiss_index(query, k=5):
    # Convert query to embedding
    query_embedding = model.encode([query]).astype('float32')
    
    # Search the FAISS index for the k nearest neighbors
    distances, indices = index.search(query_embedding, k)
    
    # Retrieve the metadata (years) and relevant sentence chunks based on the indices
    retrieved_metadata = [years[i] for i in indices[0]]
    retrieved_chunks = [sentence_chunks[i] for i in indices[0]]  # Retrieve the actual document chunks
    
    return retrieved_metadata, retrieved_chunks, distances[0]



# Function to generate the prompt dynamically for chunked sentences
def generate_llm_prompt(retrieved_metadata, retrieved_chunks, distances, query):
    # Start with the query
    prompt = f"Given the following document chunks from Berkshire Hathaway's annual letters, answer the query:\n\nQuery: {query}\n\n"
    
    # Add an introductory explanation
    prompt += "The following document chunks are relevant to the query. Use them to answer the question based on Warren Buffett's investment strategies:\n\n"
    
    # Process the retrieved data and add chunk sentences to the prompt with relevance scores
    for i in range(len(retrieved_metadata)):
        year = retrieved_metadata[i]
        relevant_chunk = retrieved_chunks[i]  # This would be the chunk from the document
        score = distances[i]  # The similarity score for the chunk
        
        prompt += f"Year: {year} (Score: {score:.4f})\nRelevant Document Chunk: {relevant_chunk}\n\n"
    
    # End the prompt with the question to the LLM
    prompt += "Based on these document chunks, explain Warren Buffett's investment strategies and how he might approach the current financial landscape based on these past letters."
    
    return prompt



# Full function for querying the system
def query_berkshire_bot(query, k=5):
    # Step 1: Query FAISS index
    retrieved_metadata, retrieved_chunks, distances = query_faiss_index(query, k)
    
    # Step 2: Craft the prompt for the LLM
    llm_prompt = generate_llm_prompt(retrieved_metadata, retrieved_chunks, distances, query)
    
    # Step 3: Send the prompt to the LLM
    client = openai.OpenAI(api_key="OPENAI_API_KEY")
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": llm_prompt}]
    )
    response = response.model_dump()

    return response["choices"][0]["message"]["content"]

# Example usage
query = "What are Warren Buffett's views on cryptocurrency?"
response = query_berkshire_bot(query)
print("Query:", query)
print("LLM Response:", response)


Query: What are Warren Buffett's views on cryptocurrency?
LLM Response: Based on the document chunks provided, Warren Buffett's investment strategies emphasize owning businesses that deliver goods and services efficiently and consistently over time. He values productive assets over nonproductive or currency-based assets, believing that owning first-class businesses will be the best long-term investment choice.

Regarding cryptocurrency, based on the document chunks provided, there is no direct mention of Warren Buffett's views on cryptocurrency. However, his focus on tangible assets and productive businesses suggests that he may be cautious or skeptical about investing in cryptocurrency, which is a form of digital currency not backed by physical assets.

Warren Buffett's investment philosophy seems to prioritize long-term value creation and sustainable growth through ownership of quality businesses. He emphasizes the importance of market economics, the rule of law, and equality of oppo