# Database Pipeline

Load embeddings from JSON and store them in PostgreSQL with pgvector support.

**Features:**
- Load embeddings from chunks_primitive_embedded.json
- Create tables with pgvector extension
- Insert embeddings with vector indexing (IVF)
- Query similar chunks by vector similarity
- Database statistics and management

## Database Setup

Before using this module, ensure PostgreSQL is configured:

```sql
-- Create pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create database
CREATE DATABASE marigold_rag;
```

## Implementation

In [6]:
import json, psycopg2
from dataclasses import dataclass

@dataclass
class DB:
    host: str = "localhost"
    port: int = 5432
    database: str = "marigold_rag"
    user: str = "postgres"
    password: str = "postgres"
    
    def __post_init__(self):
        self.connection = psycopg2.connect(host=self.host, port=self.port, database=self.database, user=self.user, password=self.password)
        self.cursor = self.connection.cursor()
    
    def setup(self):
        self.cursor.execute("CREATE EXTENSION IF NOT EXISTS vector")
        self.cursor.execute("CREATE TABLE IF NOT EXISTS chunks (id SERIAL PRIMARY KEY, component VARCHAR(255), section_title VARCHAR(500), section_path TEXT, content TEXT, demo_files TEXT, images TEXT, token_count INT, embedding vector(768))")
        self.cursor.execute("CREATE INDEX IF NOT EXISTS chunks_idx ON chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists=100)")
        self.connection.commit()
    
    def insert(self, **chunk):
        emb = "[" + ",".join(str(x) for x in chunk['embedding']) + "]"
        self.cursor.execute("INSERT INTO chunks (component, section_title, section_path, content, embedding, demo_files, images, token_count) VALUES (%s,%s,%s,%s,%s,%s,%s,%s)", 
            (chunk.get('component'), chunk.get('section_title'), chunk.get('section_path'), chunk.get('content'), emb, json.dumps(chunk.get('demo_files', [])), json.dumps(chunk.get('images', [])), chunk.get('token_count', 0)))
        self.connection.commit()
    
    def close(self):
        self.cursor.close()
        self.connection.close()

## Section 2: Load and Store Embeddings

Load embeddings from JSON file and insert into database.

In [7]:
from pathlib import Path
import os, json

chunks_file = Path.cwd().parent.parent.parent / 'etl/data/chunks/chunks_primitive_embedded.json'
with open(chunks_file) as f:
    chunks = json.load(f)
print(f"Loaded {len(chunks)} chunks")

Loaded 186 chunks


## Environment Variables

Configure database connection via environment:

In [9]:
db = DB(
    host=os.getenv("DB_HOST", "localhost"),
    port=int(os.getenv("DB_PORT", 5432)),
    database=os.getenv("DB_NAME", "marigold_rag"),
    user=os.getenv("DB_USER", "postgres"),
    password=os.getenv("DB_PASSWORD", "postgres")
)

db.setup()
for i, chunk in enumerate(chunks, 1):
    db.insert(**chunk)
print(f"Done: {len(chunks)} chunks stored")
db.close()

Done: 186 chunks stored
