# RAG for 10-Q Reports
Use Pinecone's new Resin project to download, chunk and upload vector embeddings to Pinecone. Feedback on the "dev" branch of context-engine(AKA: Resin) is included the markdown snippets.

**Notebook last tested on 9/21/2023**

## Key Technologies
1. Canopy
2. SEC Edgar API
3. Parquet/Pandas/Pyarrow
4. Beautiful Soup

# Step #1 - Install dependencies

In [None]:
!pip install -U sec-edgar-downloader pandas pyarrow bs4 openpyxl unstructured llama-hub networkx canopy-sdk python-dotenv

# Step 2 - Download 10Q filings for a list of tickers

This code has nothing to do with context-engine. Downloading SEC filings is a bit of a pain. This notebook pulls down SPY 500 10-Q filings. 

In [None]:
from sec_edgar_downloader import Downloader
import pandas as pd
import time
import random

sleep_time = random.uniform(0.10, 0.30)

stocks_frame = pd.read_excel('stock_universe.xlsx', sheet_name="TDA_SCREEN")

ticker_list = stocks_frame['ticker'].values.tolist()

for t in ticker_list:
    try:
        dl = Downloader(company_name="Pinecone", 
                        email_address="williamsj@pinecone.io")
        dl.get("10-Q", t, limit=10, download_details=True)
    
        print(f"Downloaded 10Q for {t}")
        time.sleep(sleep_time)
    except Exception as e:
        print(f"Error downloading {t}")
        print(e)

# Step #3 - Move nested html files into a source directory

This code has nothing to do with context-engine. Downloading SEC filings is a bit of a pain.

In [None]:
import os
import shutil

dest_data_dir = "./10Q-raw"

def find_files(base_path, file_name):
    file_list = []
    for root, dirs, files in os.walk(base_path):
        if file_name in files:
            file_list.append(os.path.join(root, file_name))
    return file_list

def move_file(src, dst):
    if src:
        shutil.move(src, dst)
    else:
        print("File not found.")

# To display the batches:
for t in ticker_list: 
    base_path = f"./sec-edgar-filings/{t}/10-Q"
    file_name = "primary-document.html"
    file_paths = find_files(base_path, file_name)
    cnt = 1
    for fp in file_paths:
        os.makedirs(dest_data_dir, exist_ok=True)
        move_file(fp, f"{dest_data_dir}/{t}-{cnt}.html")
        cnt += 1
    
    print(f"Moved {t} files to {dest_data_dir}")

#shutil.rmtree(f"./sec-edgar-filings")

# Step #4 - Generate parquet files from HTML source files

1. Use `UnstructuredReader` from llama_hub to parse the HTML into text
1. Obtain end date report period from 10-Q report text
1. Add metadata for canopy and end user interface
1. Save metadata and report text to parquet files

In [None]:
from pathlib import Path
from llama_hub.file.unstructured import UnstructuredReader
import os
import re
from dateutil import parser
import pandas as pd

dest_data_dir = "./10Q-raw"
parquet_data_dir = f"./10Q-parquet/"
os.makedirs(parquet_data_dir, exist_ok=True)

pattern = re.compile(r'(?i)(\w+\s+\d+,\s+20[1-2][0-3]).*')

def get_file_list(directory):
    """Get a list of files in the given directory"""
    return [f for f in os.listdir(directory) if os.path.isfile(os.path.join(directory, f))]

def get_quarter_report_period(doc_text):
    if pattern.search(doc_text):
        start = pattern.search(doc_text).start()
        end = pattern.search(doc_text).end()
        text_snippet = doc_text[start:end]
        cleaned_text = pattern.sub(r'\1', text_snippet)
        date = parser.parse(cleaned_text, fuzzy=True).strftime("%Y-%m-%d")
        return date
    else:
        return "????-??-??"

def write_to_parquet(id, text, source, metadata):
    parquet_path = f"{parquet_data_dir}/{id}.parquet"
    data = {
        'id': [id],
        'text': [text],
        'source': [source],
        'metadata': [metadata]
    }
    df = pd.DataFrame(data)
    df.to_parquet(path=parquet_path, engine='pyarrow')

loader = UnstructuredReader()
documents = get_file_list(dest_data_dir)
for d in documents:
    doc = loader.load_data(file=Path(f"{dest_data_dir}/{d}"))[0]
    doc_text = doc.text
    quarter_period = get_quarter_report_period(doc_text)
    ticker = d.split("-")[0]
    id = doc.doc_id
    source = f"{ticker} - Form 10-Q for the quarterly period ended {quarter_period}"
    metadata = {'ticker': ticker, 
                'quarter_period_end': quarter_period, 
                'doc_type': '10-Q', 
                'source_api': 'sec_edgar_downloader'}
    write_to_parquet(id, doc_text, source, metadata)
    print(f"id is: {id}, source: {source}, metadata: {metadata} written to parquet")
    


# Step #5 - Run context-engine to update parquet files 


In [None]:
%load_ext dotenv
%dotenv ./.env
!canopy new <<< 'y'

In [None]:
%load_ext dotenv
%dotenv ./.env
!canopy upsert ./10Q-parquet <<< 'y'

# Step #6 - OPTIONAL - Delete raw 10-Q HTML and Parquet files

Self explanatory. External to context-engine but it would be nice if we provided:

A way to specify S3/GCS bucket as an upsert source

Automatically convert text files to parquet format. It will demo well but may not be useful for real world use. Meta-data handling/config seems to be the biggest issue. 

It would be nice if the user could specify a meta-data mapping like the following:

S3 Bucket Path
```
--bucket YOUR_BUCKET_NAME --key year/sec_filing_type/exchange/ticker
```

Local File Path
```
${DOC_BASE}/year/sec_filing_type/exchange/ticker
```

| year | sec_filing_type | exchange | ticker |
|------|-----------------|----------|--------|
| 2023 | 10-Q            | SPY      | AAPL   | 




In [None]:
import shutil

shutil.rmtree("./10Q-parquet")
shutil.rmtree("./10Q-raw")