## SEC Insights

## What it is?
* Chat application
* RAG technique
* Answers questions about SEC 10k and 10Q documents
* Production-ready
* Full-stack repo
* Ready for you to fork

## Try it out
[secinsights.ai](secinsights.ai)

## Product Features
* QA chat grounded in source-of-truth SEC documents
* PDF viewer
* Token-level streaming of chat responses
* Streaming of reasoning steps (sub-questions)
* Citation of source data
* Use of API-based tools (in addition to semantic search)

![alt text](images/architecture-v1.png)

## Architecture
* Backend
    * Render.com: hosting most of the backend.
        * Similar to AWS but easier to use.
    * FastAPI Backend Service.
        * Traditional load-balanced API Service.
        * Load balancing requests to service instances.
        * Auto-scaling
    * Postgres 15 database
    * Cron job service
    * All of the above prepared for us in the file render.yaml

* AWS S3: **you will have to setup this yourself**
    * Private StorageContext Bucket: metadata from the llamaindex library
    * Public PDF Bucket

* Frontend
    * NextJS
    * Hosted in Vercel
    * Interacts with the FastAPI backend for some of the chat endpoints

* External services:
    * Polygon.io (Financial Data API, Numeric Data). Good example on how to integrate tools in your chat agent.
    * SEC's Edgar API
    * OpenAI Service (LLM)
        * The cron service is the main worker that calls the OpenAI Embedding API to get the embeddings for the given SEC documents.
        * The cron service calls the Edgar API from the SEC, get the PDFs, store them in the AWS Public PDF Bucket, run the embeddings on them and store the embeddings in the Postgres database (we use the PG Vectorstore integration).
        * The cron job can run at whatever schedule you set in the render.yaml file.
    * Sentry.io (Production-level Monitoring Service. It will ping you whenever there is an error in the backend service, or threshold errors, etc. You can also do Performance Monitoring: what sections of the code are taking more time in your service)

All the setup is open source and is easy to deploy on Vercel and Render.com

## Dev Environment

* It's recommended to use the config included for a Github Codespace in the devcontiner.json file.

* Create a Github Codespace.
* cd frontend
* ls
* This is a basic vercel app
* npm install
* source the .env.example folder, load the environment variables that are present here
* the url there is the local one of the backend (localhost/8000), will have to be changed when we use the backend in the cloud.
* set -a
* source .env.example
* npm run dev
* that starts running the app in localhost/3000
* It comes with live reload, so if you edit any UI file it will show immediately. For example, you can edit the title in components/landing-page/TitleAndDropdown.tsx
* With the app open in one terminal, open a second terminal and cd backend
* This is a fastAPI python backend app. Most of this app is based in the templates fastAPI offers.

## Backend

* In the github folder, go to /backend and read the readMe file.
* cd backend
* No need to install pyenv nor docker if you are running from the devcontainer image in Github Codespaces
* cat .python-version: confirms you have 3.11.3
* poetry shell

The command `poetry shell` is used in the context of Python programming, specifically when managing Python projects with the tool Poetry. Poetry is a tool for dependency management and packaging in Python, allowing developers to declare, manage, and install dependencies of Python projects.

Here's what `poetry shell` does:

1. **Activates the Virtual Environment**: When you run `poetry shell`, it activates the project's virtual environment. A virtual environment is a self-contained directory that contains a Python installation for a particular version of Python, along with a number of additional packages.

2. **Isolation of Project Dependencies**: This isolation ensures that the dependencies of the project do not interfere with the system-wide Python installation or other Python projects. It's a key practice in Python development to avoid dependency conflicts and maintain project consistency.

3. **Interactive Shell**: Once the virtual environment is activated, you are placed into an interactive shell (like bash or Command Prompt) that is configured to use the project's Python interpreter. This means any Python commands you run in this shell will use the project's Python version and have access to its dependencies.

4. **Convenience for Development**: The `poetry shell` command is convenient for development purposes. You can run Python scripts, start a Python interactive session, or use command-line tools that are part of your project's dependencies without needing to manually activate the virtual environment or adjust your system's `PATH`.

5. **Temporary Activation**: The activation of the virtual environment using `poetry shell` is temporary. Once you exit the shell, the environment is deactivated, and your terminal returns to its previous state.

In summary, `poetry shell` is a command to activate the virtual environment associated with your Poetry-managed Python project, providing an isolated and consistent development environment for that project.

* Now in the github codespaces terminal you see (llama-app-backend-py3.11) as virtual environment
* poetry install
* Now you have to create the backend/.env file
    * cp .env.development .env
    * set -a
    * source .env

The command `set -a` in a Unix/Linux shell environment is used to change the behavior of the shell with respect to how it handles variables and their visibility (exporting) to child processes. Here's what it does:

1. **Auto-Export Variables**: When you use `set -a`, any variable that you subsequently define or modify in your shell session will be automatically exported. This means that these variables become environment variables and are inherited by any child processes or sub-shells spawned from your shell.

2. **Child Process Inheritance**: Normally, when you create a variable in a shell, it's only known to that particular shell session. Child processes or scripts invoked from that shell don't have access to those variables unless they are explicitly exported using the `export` command. However, with `set -a`, this export is implicit for all variables set after the command.

3. **Use Cases**: This command is particularly useful in scripts where you need to ensure that all defined variables are available to sub-processes without having to explicitly export each one. It's often used in startup scripts or in scripts that configure environment variables for a particular application or service.

4. **Reversing the Effect**: If you want to revert to the normal behavior where variables are not automatically exported, you can use `set +a`. This command will stop the automatic export of variables defined after it.

5. **Scope of Effect**: It's important to note that the effect of `set -a` is limited to the current shell session or script in which it is run. It does not affect other shell sessions or globally change the behavior of the shell.

In summary, `set -a` is a shell command used to automatically export all variables set in the current shell session, making them available to any child processes. This can be useful in scripting scenarios where environment variable propagation is desired.

* source .env

The command `source .env` in a Unix/Linux shell is used to execute the contents of a file (in this case, a file named `.env`) in the current shell session. Here's what this command specifically does:

1. **Loads Environment Variables**: The `.env` file typically contains environment variables. These are often key-value pairs that are used to configure the behavior of an application or script. By using `source .env`, you are effectively loading these variables into your current shell environment.

2. **Executes in Current Shell**: The `source` command (which can also be represented as a dot `.` in some shells) executes the file in the context of the current shell, rather than starting a new shell to run the script. This means any changes made to the environment, such as setting variables, changing directories, etc., will persist in the current shell after the script completes.

3. **Use in Application Configuration**: This is a common practice in application development, especially in web development, where `.env` files are used to set configuration variables that should not be hard-coded into the application, such as database passwords, API keys, and other sensitive information.

4. **Security Note**: It's important to be cautious with `.env` files, especially regarding sensitive information. These files should not be included in version control (like Git) if they contain sensitive data.

5. **Portability and Convenience**: This approach allows for easy customization of application behavior in different environments (like development, testing, production) by simply changing the contents of the `.env` file, rather than altering the application code.

In summary, `source .env` is used to execute the contents of the `.env` file in the current shell, typically for the purpose of setting environment variables that configure the behavior of an application or script. This allows for a flexible and secure way to manage configuration settings.

## Start the backend server

* make migrate (Run the database migrations)
* make run (start the server locally)
    * This spins up the Postgres 15 DB & Localstack in their own docker containers.
    * The server will not run in a container but will instead run directly on your OS.
    * This is to allow for use of debugging tools like pdb

## Enter your OpenAI API Key in the .env file

* open your .env file and replace the placeholder value for the OPENAI_API_KEY with your own OpenAI API key
* At some point you will want to do the same for the other secret keys in here like POLYGON_IO_API_KEY, AWS_KEY, & AWS_SECRET
* To follow the SEC's Internet Security Policy, make sure to also replace the SEC_EDGAR_COMPANY_NAME & SEC_EDGAR_EMAIL values in the .env file with your own values.
* Source the file again with set -a then source .env

## Populate your local database with some sample SEC filings

* Run make seed_db_local
    * If this step fails, you may find it helpful to run make refresh_db to wipe your local database and re-start with emptied tables.
* Done 🏁! You can run make run again and you should see some documents loaded at [http://localhost:8000/api/document](http://localhost:8000/api/document)

## For any issues
* For any issues in setting up the above or during the rest of your development, you can check for solutions in the following places:

    * [backend/troubleshooting.md](https://github.com/run-llama/sec-insights/blob/main/backend/troubleshooting.md)
    * [Open & already closed Github Issues](https://github.com/run-llama/sec-insights/issues?q=is%3Aissue+is%3Aclosed)
    * [The #sec-insights discord channel](https://discord.com/channels/1059199217496772688/1150942525968879636)

## SEC Document Downloader

We have a script to easily download SEC 10-K & 10-Q files! This is a single step of the larger seed script described in the next section. Unless you have some use for just running this step on it's own, you probably want to stick to the Seed script described in the section below. However, the setup instructions for this script are a pre-requisite for running the seed script.

No API keys are needed to use this, it calls the SEC's free to use Edgar API.

The instructions below explain a process to use the script to download the SEC filings, convert the to PDFs, and store them in an S3 bucket.

## Setup / Usage Instructions
Pre-requisite setup steps to use the downloader script to load the SEC PDFs directly into an S3 bucket.

These steps assume you've already followed the steps above for setting up your dev workspace!

#### Setup AWS CLI
* Install AWS CLI
    * This step can be skipped if you're running from the devcontainer image in Github Codespaces
    * Steps:
        * curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
        * unzip awscliv2.zip
        * sudo ./aws/install
* Configure AWS CLI
    * This is mainly to set the AWS credentials that will later be used by s3fs
    * Run aws configure and enter the access key & secret key for a AWS IAM user that has access to the PDFs where you want to store the SEC files.
        * set the default AWS region to us-east-1 (what we're primarily using).

#### Setup s3fs
* Install s3fs
    * This step can be skipped if you're running from the devcontainer image in Github Codespaces
    * sudo apt install s3fs
* Setup a s3fs mounted folder
    * Create the mounted folder locally mkdir ~/mounted_folder
    * s3fs llama-app-web-assets-preview ~/mounted_folder
        * You can replace llama-app-web-assets-preview with the name of the S3 bucket you want to upload the files to.

#### Install wkhtmltopdf
* This step can be skipped if you're running from the devcontainer image in Github Codespaces
* Steps:
    * sudo apt-get update
    * sudo apt-get install wkhtmltopdf

#### Get into your poetry shell with poetry shell from the project's root directory.

#### Run the script! 
* python scripts/download_sec_pdf.py -o ~/mounted_folder --file-types="['10-Q','10-K']"
* Take a 🚽 break while it's running, it'll take a while!

#### Go to AWS Console and verify you're seeing the SEC files in the S3 bucket.

## Seed DB Script

There are a collection of scripts we have for seeding the database with a set of documents. The script in scripts/seed_db.py is an attempt at consolidating those disparate scripts into one unified command.

This script will:

* Download a set of SEC 10-K & 10-Q documents to a local temp directory
* Upload those SEC documents to the S3 folder specified by $S3_ASSET_BUCKET_NAME
* Crawl through all the PDF files in the S3 folder and upsert a database row into the Document table based on the path of the file within the bucket

#### Use Cases
This is useful for times when:

* You want to setup a local environment with your local Postgres DB to have a set of documents in the documents table
    * When running locally, this will use localstack to store the documents into a local S3 bucket instead of a real one.

* You want to update the documents present in either Prod or Preview DBs
    * In fact, this is the very script that is run by the llama-app-cron cron job service that gets setup by the render.yaml blueprint when deploying this service to Render.com.

#### Usage
To run the script, make sure you've:

* Activated your Python virtual environment using poetry shell
* Installed all the pre-requisite dependencies for the SEC Document Downloader script.
* Defined all the environment variables from .env.development within your shell environment according to the environment you want to execute the seed script (e.g. local, preview, prod environments)

After that you can run python scripts/seed_db.py to start the seed process.

To make things easier, the Makefile has some shorthand commands.

* make seed_db
    * Just runs the seed_db.py script with no CLI args, so just based on what env vars you've set

* make seed_db_preview
    * Same as make seed_db but only loads SEC documents from Amazon & Meta
    * We don't need to load that many company documents for Preview environments.

* make seed_db_local
    * To be used for local database seeding
    * Runs seed_db.py just for `$AMZN & $META` documents
    * Sets up the localstack bucket to actually serve the documents locally as well, so you can load them in your local browser.

* make seed_db_based_on_env
    * Automatically calls one of the above shorthands based on the RENDER & IS_PREVIEW_ENV environment variables