## Gigadataset Database and Ingestion Guide

This document explains the database, data flow, how to run ingestion, how embeddings work, and how anyone (e.g., a partner) can connect and use the database, even if it’s running locally in Docker.

---

### What’s in the stack

- Database: PostgreSQL + `pgvector` (Docker; extension initialized by `db/init/01-init-pgvector.sql`)
- ORM: SQLAlchemy (`database/schema.py` models)
- Data ingestion: `ingest.py`, `scheduler.py`, `r2k_ingest.py`, and `ingestion_runner.ipynb`
- Text chunking + embeddings: `TextChunks` table with 384‑d vectors; embeddings via Transformers or Modal
- External clients:
  - Alpaca (equities, options, news)
  - Finnhub (equities candles)
  - FRED (macro)
  - SEC EDGAR (filings metadata + content)

---

## Connect to the database

You can connect from psql, pgAdmin, or any client that supports PostgreSQL.

- Default env (from `.env`):
  - `DB_HOST=localhost`
  - `DB_PORT=5432`
  - `DB_NAME=trading_agent`
  - `DB_USER=postgres`
  - `DB_PASSWORD=postgres`

- psql:
```bash
psql "postgresql://$DB_USER:$DB_PASSWORD@$DB_HOST:$DB_PORT/$DB_NAME"
```

- SQLAlchemy (Python):
```python
from sqlalchemy import create_engine
engine = create_engine("postgresql://postgres:postgres@localhost:5432/trading_agent")
with engine.connect() as conn:
    version = conn.execute("select version()").scalar()
    print(version)
```

- pgAdmin (if running in Docker compose):
  - Open pgAdmin in your browser (compose typically maps it to a local port, e.g., 5050).
  - Add a new server → host: `host.docker.internal` or your machine IP, port: `5432`, user/password from `.env`.

### If your partner is on another machine

Your partner can connect to your local Docker Postgres using one of these approaches:

- SSH tunnel (recommended):
  - On partner machine:
  ```bash
  ssh -N -L 5433:localhost:5432 your_user@your_host_or_ip
  ```
  - Then connect to `postgresql://postgres:postgres@localhost:5433/trading_agent`.

- Tailscale / VPN:
  - Install Tailscale on both machines. Use your tailnet IP as `DB_HOST`.

- Port forward on your router (not recommended for prod). If used, firewall restrict the inbound IPs.

Security tips:
- Create a limited DB role for partners:
```sql
create role readonly login password 'strongpass';
grant connect on database trading_agent to readonly;
grant usage on schema public to readonly;
grant select on all tables in schema public to readonly;
alter default privileges in schema public grant select on tables to readonly;
```

---

## Environment variables (core)

- DB: `DB_HOST`, `DB_PORT`, `DB_NAME`, `DB_USER`, `DB_PASSWORD`
- Alpaca: `APCA_API_KEY_ID`, `APCA_API_SECRET_KEY`, optional `APCA_API_BASE_URL`, `APCA_DATA_URL`
- Finnhub: `FINNHUB_API_KEY`
- SEC (polite UA): `SEC_USER_AGENT="Gigadataset Ingestor (you@domain.com)"`
- Embedding:
  - `USE_MODAL_EMBEDDING=1` to enable Modal
  - optional `EMBED_MODEL_384` (default `sentence-transformers/all-MiniLM-L6-v2`)

Load from `.env` (the code uses `python-dotenv`).

---

## Database schema (key tables)

From `database/schema.py`:

- `market_data`
  - OHLCV bars for equities/crypto (mainly Alpaca, Finnhub)
  - Columns: `symbol`, `timestamp`, `open/high/low/close`, `volume`, etc.

- `sec_filings`
  - Recent filings metadata (10‑K/10‑Q/8‑K …)
  - Columns: `cik`, `ticker`, `form_type`, `filing_date`, `accession_number` (unique), `primary_document`

- `macro_data`
  - FRED series data (`series_id`, `date`, `value`, …)

- `crypto_data`
  - Binance klines (if used)

- `options_contracts`, `options_bars`