### Setup

Rename compose.yaml.default to compose.yaml and edit it if needed. As it stands, it will launch the vectorizer, the database, and ollama.

Execute this to start the containers:
```sh
docker compose up -d
```

---

### By the way, if you wreck your install and need to start over:
```sh
docker compose down
docker volume rm pgai_data
docker rm pgai_db_1 pgai_vectorizer-worker_1 pgai_ollama_1
```

### Setup your PSQL instance

```sh
docker compose exec db psql
```

---

Install pgai
```sql
CREATE EXTENSION IF NOT EXISTS ai CASCADE;
```

---

While you're there, create the database table
```sql
CREATE TABLE books (
    id          TEXT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
    filename    TEXT,
    author      TEXT,
    title       TEXT NOT NULL,
    text        TEXT NOT NULL
);
```

---

And, before you quit, create the vectorizer
```sql
SELECT ai.create_vectorizer(
     'texts'::regclass,
     destination => 'texts_embeddings',
     embedding => ai.embedding_ollama('nomic-embed-text', 768),
     chunking => ai.chunking_recursive_character_text_splitter('text')
);
```

*Keep in mind that for other texts, you can customize the chunking and embedding models.*

---

If you've got a spare terminal window, you can tail the vectorizer logs
```sh
docker compose logs -f vectorizer-worker
```

---
Let's get our imports out of the way. Run this.

In your virtual environment, run the following to install the necessary libraries.
```sh
python -m pip install numpy ollama langchain_text_splitters python-dotenv pandas psycopg2-binary Jinja2
```


In [1]:
import json
import os.path
from os import listdir
from os.path import isfile, join
import re
from numpy.linalg import norm
from ollama import Client
import time
import numpy as np
from langchain_text_splitters import RecursiveCharacterTextSplitter
from dotenv import load_dotenv



In [3]:
import psycopg2
from dotenv import load_dotenv
import os

load_dotenv()
OLLAMA_HOST=os.environ.get("OLLAMA_HOST")
DB_String=os.environ.get("DATABASE_CONNECTION_STRING")

def connect_db():
    return psycopg2.connect(DB_String)

with connect_db() as connection:
    with connection.cursor() as cursor:
        cursor.execute("""
                       CREATE EXTENSION IF NOT EXISTS ai CASCADE;
                       """)

        cursor.execute("""
                       CREATE TABLE texts (
                        id          BIGINT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
                        filename    TEXT,
                        author      TEXT,
                        title       TEXT NOT NULL,
                        text        TEXT NOT NULL
                        );
                       """)

        cursor.execute("""
                       SELECT ai.create_vectorizer(
                           'texts'::regclass,
                            destination => 'texts_embeddings',
                            embedding => ai.embedding_ollama('nomic-embed-text', 768),
                            chunking => ai.chunking_recursive_character_text_splitter('text', 500, 10, separators => array[E'\n;', ' '])
                        );
                       """)


KeyboardInterrupt: 

Now, load some data into the **texts** table

In [4]:
from pathlib import Path
import re

class Text:
    def __init__(self, filename):
        self.title = re.match(r".+\\(.+)\.txt$", filename).group(1)
        self.author = re.match(r".+\\(\w+)\\.+.txt", filename).group(1)
        self.filename = f"{self.author}-{self.title}"

        with open(filename, encoding="utf-8-sig") as f:
            file_contents = f.read()
            self.contents = re.sub(r'[^\S\r\n]+', " ", file_contents)
        # print(f"{self.title} by {self.author}")
        # print(self.contents[:30])

def load_directory(dirname, filter):
    return list(Path(dirname).rglob(filter))

# This method has an intentional lazy flaw in that if a record is already in the database, we just leave it alone for now.
def add_text_record(_text):
    print(_text.filename)
    with connect_db() as connection:
        with connection.cursor() as cursor:
            cursor.execute("SELECT filename from texts where filename = %s", [_text.filename])
            if(len(cursor.fetchall()) == 0):
                cursor.execute("INSERT into texts (filename, author, title, text)" \
                "VALUES (%s, %s, %s, %s)", (_text.filename, _text.author, _text.title, _text.contents))

files = load_directory("texts", "*.txt")
# Modify this to limit how many texts to vectorize
for file in files:
    text = Text(str(file))
    # We aren't creating an array of Text objects because it could consume an
    # outrageous amount of memory if there are tons of texts.
    add_text_record(text)



barrie-peterpan
god-bible
god-world192
stoker-dracula
shakespeare-a lover's complaint
shakespeare-all's well that ends well
shakespeare-antony and cleopatra
shakespeare-as you like it
shakespeare-comedy of errors
shakespeare-coriolanus
shakespeare-cymbeline
shakespeare-hamlet
shakespeare-julius caesar
shakespeare-king henry iv, part 1
shakespeare-king henry iv, part 2
shakespeare-king henry v
shakespeare-king henry vi, part 1
shakespeare-king henry vi, part 2
shakespeare-king henry vi, part 3
shakespeare-king henry viii
shakespeare-king john
shakespeare-king lear
shakespeare-king richard ii
shakespeare-king richard iii
shakespeare-love's labour's lost
shakespeare-lucrece
shakespeare-macbeth
shakespeare-measure for measure
shakespeare-merchant of venice
shakespeare-merry wives of windsor
shakespeare-midsummer night's dream
shakespeare-much ado about nothing
shakespeare-othello
shakespeare-pericles, prince of tyre
shakespeare-romeo and juliet
shakespeare-sonnets
shakespeare-taming of the

In [5]:
def gather_appropriate_chunks_by_author(author, prompt, count):
    with connect_db() as connection:
        with connection.cursor() as cursor:
            cursor.execute("""
                           SELECT 
                           chunk,
                           embedding <=> ai.ollama_embed('nomic-embed-text', %s) as distance
                           FROM texts_embeddings
                           WHERE author = %s
                           ORDER BY distance
                           LIMIT %s;
                           """, (prompt, author, count))
            return cursor.fetchall()    

def gather_appropriate_chunks_by_title(title, prompt, count):
    with connect_db() as connection:
        with connection.cursor() as cursor:
            cursor.execute("""
                           SELECT 
                           chunk,
                           embedding <=> ai.ollama_embed('nomic-embed-text', %s) as distance
                           FROM texts_embeddings
                           WHERE title = %s
                           ORDER BY distance
                           LIMIT %s;
                           """, (prompt, title, count))
            return cursor.fetchall() 

def gather_inappropriate_chunks(prompt, count):
    with connect_db() as connection:
        with connection.cursor() as cursor:
            cursor.execute("""
                           SELECT 
                           chunk,
                           embedding <=> ai.ollama_embed('nomic-embed-text', %s) as distance
                           FROM texts_embeddings
                           ORDER BY distance
                           LIMIT %s;
                           """, (prompt, count))
            return cursor.fetchall() 
            
chunks = gather_appropriate_chunks_by_author('god', 'who is the king?', 5)
for chunk in chunks:
    print(len(chunk[0]))
print("---")

chunks = gather_appropriate_chunks_by_title('bible', 'who is the king?', 5)
for chunk in chunks:
    print(len(chunk[0]))
print("---")

chunks = gather_inappropriate_chunks('who is the king?', 5)
for chunk in chunks:
    print(len(chunk[0]))

499
496
496
497
495
---
499
496
496
497
495
---
499
496
496
499
494


In [6]:
client = Client(
    host='http://192.168.137.117:11434',
)
prompt_size=8000
chunking_size=500
chunk_count = prompt_size // chunking_size
embed_model = "nomic-embed-text"
generate_model = "gemma3:12b"
SYSTEM_PROMPT = """
You are a helpful reading assistant who answers questions
based on snippets of text provided in context. Answer only using the context provided,
being as concise as possible. If the answer isn't in the context, simply say so.
Context:
"""

def generate_response(prompt, most_similar_chunks, model):
    # client.pull(generate_model) # You really should pre-pull your models
    
    # most_similar_chunks = gather_inappropriate_chunks(prompt)
    # for item in most_similar_chunks:
    #     print(item[0])
    # print("\n\n\n")
    system_prompt = SYSTEM_PROMPT + "\n".join(item[0] for item in most_similar_chunks)
    # print(f"{system_prompt}\n\n")
    
    response = client.chat(
        model,
        messages = [
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
        stream = True
    )
    return response

def stream_response(stream):
    for chunk in stream:
        print(chunk['message']['content'], end='', flush=True)

In [7]:
prompt = "who is peter and what was his role on the island?"

most_similar_chunks = gather_appropriate_chunks_by_title("peterpan", prompt, 7)
stream_response(generate_response(prompt, most_similar_chunks, generate_model))


Peter is the "Great White Father," and he is the one who forbade the boys from looking like him. He leads the island, sets rules, and "thins" out the lost boys when they threaten to grow up.

In [None]:
prompt = "who is peter and what was his role on the island?"

most_similar_chunks = gather_inappropriate_chunks(prompt, 6)
for chunk in most_similar_chunks:
    print(chunk)
    
stream_response(generate_response(prompt, most_similar_chunks, generate_model))

In [None]:
prompt = "who ate Dracula's food??"

most_similar_chunks = gather_inappropriate_chunks(prompt, 7)
for chunk in most_similar_chunks:
    print(chunk)
    
stream_response(generate_response(prompt, most_similar_chunks, generate_model))