### Setup

Rename compose.yaml.default to compose.yaml and edit it if needed. As it stands, it will launch the vectorizer, the database, and ollama.

Execute this to start the containers:
```sh
docker compose up -d
```

---

### By the way, if you wreck your install and need to start over:
```sh
docker compose down
docker volume rm pgai_data
docker rm pgai_db_1 pgai_vectorizer-worker_1 pgai_ollama_1
```

### Setup your PSQL instance

```sh
docker compose exec db psql
```

---

Install pgai
```sql
CREATE EXTENSION IF NOT EXISTS ai CASCADE;
```

---

While you're there, create the database table
```sql
CREATE TABLE books (
    id          TEXT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
    filename    TEXT,
    author      TEXT,
    title       TEXT NOT NULL,
    text        TEXT NOT NULL
);
```

---

And, before you quit, create the vectorizer
```sql
SELECT ai.create_vectorizer(
     'texts'::regclass,
     destination => 'texts_embeddings',
     embedding => ai.embedding_ollama('nomic-embed-text', 768),
     chunking => ai.chunking_recursive_character_text_splitter('text')
);
```

*Keep in mind that for other texts, you can customize the chunking and embedding models.*

---

If you've got a spare terminal window, you can tail the vectorizer logs
```sh
docker compose logs -f vectorizer-worker
```

---
Let's get our imports out of the way. Run this.

In your virtual environment, run the following to install the necessary libraries.
```sh
python -m pip install numpy ollama langchain_text_splitters python-dotenv pandas psycopg2-binary Jinja2
```


In [None]:
import json
import os.path
from os import listdir
from os.path import isfile, join
import re
from ollama import Client
import time
import numpy as np
from langchain_text_splitters import RecursiveCharacterTextSplitter
from dotenv import load_dotenv



In [28]:
import psycopg2
from dotenv import load_dotenv
import os

load_dotenv()
OLLAMA_HOST=os.environ.get("OLLAMA_HOST")
DB_String=os.environ.get("DATABASE_CONNECTION_STRING")

def connect_db():
    return psycopg2.connect(DB_String)

def create_table(tablename):
    with connect_db() as connection:
        with connection.cursor() as cursor:
            cursor.execute("""
                        CREATE EXTENSION IF NOT EXISTS ai CASCADE;
                        """)

            cursor.execute(f"""
                        CREATE TABLE IF NOT EXISTS {tablename} (
                            id          BIGINT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
                            filename    TEXT,
                            author      TEXT,
                            title       TEXT NOT NULL,
                            text        TEXT NOT NULL
                            )
                        """)

            cursor.execute(f"""
                        SELECT count(*) from {tablename}_embeddings;
                            """)
            cursor.execute(f"""
                            select count(*) from {tablename}_embeddings_store;
                           """)
            if(len(cursor.fetchall()) == 0):
                cursor.execute(f"""
                            SELECT ai.create_vectorizer(
                                '{tablename}'::regclass,
                                    destination => '{tablename}_embeddings',
                                    embedding => ai.embedding_ollama('nomic-embed-text', 768),
                                    chunking => ai.chunking_recursive_character_text_splitter('text', 500, 10, separators => array[E'\n;', ' '])
                                )
                            """)


Now, load some data into the **texts** table

In [30]:
from pathlib import Path
import re
content_directory = "texts"

class Text:
    def __init__(self, filename):
        self.title = re.match(r".+\\(.+)\.txt$", filename).group(1)
        self.author = re.match(r".+\\(\w+)\\.+.txt", filename).group(1)
        self.filename = f"{self.author}-{self.title}"

        with open(filename, encoding="utf-8-sig") as f:
            file_contents = f.read()
            self.contents = re.sub(r'[^\S\r\n]+', " ", file_contents)
        # print(f"{self.title} by {self.author}")
        # print(self.contents[:30])

def load_directory(dirname, filter):
    return list(Path(dirname).rglob(filter))

# This method has an intentional lazy flaw in that if a record is already in the database, we just leave it alone for now.
def add_text_record(_text):
    print(_text.filename)
    with connect_db() as connection:
        with connection.cursor() as cursor:
            cursor.execute("SELECT filename from texts where filename = %s", [_text.filename])
            if(len(cursor.fetchall()) == 0):
                cursor.execute("INSERT into texts (filename, author, title, text)" \
                "VALUES (%s, %s, %s, %s)", (_text.filename, _text.author, _text.title, _text.contents))

create_table(content_directory)
files = load_directory(content_directory, "*.txt")

# Modify this to limit how many texts to vectorize
for file in files:
    text = Text(str(file))
    # We aren't creating an array of Text objects because it could consume an
    # outrageous amount of memory if there are tons of texts.
    add_text_record(text)



barrie-peterpan
god-bible
god-world192
stoker-dracula
shakespeare-a lover's complaint
shakespeare-all's well that ends well
shakespeare-antony and cleopatra
shakespeare-as you like it
shakespeare-comedy of errors
shakespeare-coriolanus
shakespeare-cymbeline
shakespeare-hamlet
shakespeare-julius caesar
shakespeare-king henry iv, part 1
shakespeare-king henry iv, part 2
shakespeare-king henry v
shakespeare-king henry vi, part 1
shakespeare-king henry vi, part 2
shakespeare-king henry vi, part 3
shakespeare-king henry viii
shakespeare-king john
shakespeare-king lear
shakespeare-king richard ii
shakespeare-king richard iii
shakespeare-love's labour's lost
shakespeare-lucrece
shakespeare-macbeth
shakespeare-measure for measure
shakespeare-merchant of venice
shakespeare-merry wives of windsor
shakespeare-midsummer night's dream
shakespeare-much ado about nothing
shakespeare-othello
shakespeare-pericles, prince of tyre
shakespeare-romeo and juliet
shakespeare-sonnets
shakespeare-taming of the

In [None]:
def gather_appropriate_chunks_by_author_metadata(author, prompt, count):
    with connect_db() as connection:
        with connection.cursor() as cursor:
            cursor.execute("""
                           SELECT 
                           chunk,
                           embedding <=> ai.ollama_embed('nomic-embed-text', %s) as distance
                           FROM texts_embeddings
                           WHERE author = %s
                           ORDER BY distance
                           LIMIT %s;
                           """, (prompt, author, count))
            return cursor.fetchall()    

def gather_appropriate_chunks_by_title_metadata(title, prompt, count):
    with connect_db() as connection:
        with connection.cursor() as cursor:
            cursor.execute("""
                           SELECT 
                           chunk,
                           embedding <=> ai.ollama_embed('nomic-embed-text', %s) as distance
                           FROM texts_embeddings
                           WHERE title = %s
                           ORDER BY distance
                           LIMIT %s;
                           """, (prompt, title, count))
            return cursor.fetchall() 

def gather_inappropriate_chunks(prompt, count):
    with connect_db() as connection:
        with connection.cursor() as cursor:
            cursor.execute("""
                           SELECT 
                           chunk,
                           embedding <=> ai.ollama_embed('nomic-embed-text', %s) as distance
                           FROM texts_embeddings
                           ORDER BY distance
                           LIMIT %s;
                           """, (prompt, count))
            return cursor.fetchall() 
            

499
496
496
497
495
---
499
496
496
497
495
---
499
496
496
499
494


In [33]:
client = Client(
    host='http://192.168.137.117:11434',
)
prompt_size=8000
chunking_size=500
chunk_count = prompt_size // chunking_size
embed_model = "nomic-embed-text"
generate_model = "gemma3:12b"
SYSTEM_PROMPT = """
You are a helpful reading assistant who answers questions
based on snippets of text provided in context. Answer only using the context provided,
being as concise as possible. If the answer isn't in the context, simply say so.
Context:
"""

def generate_response(prompt, most_similar_chunks, model):
    # client.pull(generate_model) # You really should pre-pull your models
    
    # most_similar_chunks = gather_inappropriate_chunks(prompt)
    # for item in most_similar_chunks:
    #     print(item[0])
    # print("\n\n\n")
    system_prompt = SYSTEM_PROMPT + "\n".join(item[0] for item in most_similar_chunks)
    # print(f"{system_prompt}\n\n")
    
    response = client.chat(
        model,
        messages = [
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
        stream = True
    )
    return response

def stream_response(stream):
    for chunk in stream:
        print(chunk['message']['content'], end='', flush=True)

In [35]:
prompt = "who is peter and what was his role on the island?"

most_similar_chunks = gather_appropriate_chunks_by_title_metadata("peterpan", prompt, 14) # 14 seems to be the sweetspot. any more and you over-run the context buffer and bad things happen.
stream_response(generate_response(prompt, most_similar_chunks, generate_model))


Peter is described as "youth, joy," and "a little bird that has broken out of the egg." His role on the island includes:

*   He is a leader, and his band follows his instructions.
*   He is the "Great White Father" who is prostrated before.
*   He "thins them out" amongst the lost boys when they seem to be growing up.
*   He leads adventures and saved Tiger Lily from a dreadful fate.

In [39]:
prompt = "who is peter and what was his role on the island?"

most_similar_chunks = gather_appropriate_chunks_by_title_metadata("peterpan", prompt, 2)
for chunk in most_similar_chunks:
    print(chunk)
print("---\n")
stream_response(generate_response(prompt, most_similar_chunks, generate_model))

('When last we saw him he was stealing across the\nisland with one finger to his lips and his dagger at the ready. He had\nseen the crocodile pass by without noticing anything peculiar about it,\nbut by and by he remembered that it had not been ticking. At first he\nthought this eerie, but soon concluded rightly that the clock had run\ndown.\n\nWithout giving a thought to what might be the feelings of a\nfellow-creature thus abruptly deprived of its closest companion, Peter\nbegan to consider how he could', 0.31855671688404563)
('did not compete. For one thing he despised all mothers except\nWendy, and for another he was the only boy on the island who could\nneither write nor spell; not the smallest word. He was above all that\nsort of thing.\n\nBy the way, the questions were all written in the past tense. What was\nthe colour of Mother’s eyes, and so on. Wendy, you see, had been\nforgetting, too.\n\nAdventures, of course, as we shall see, were of daily occurrence; but\nabout this time

In [40]:
prompt = "who is peter and what was his role on the island?"

most_similar_chunks = gather_appropriate_chunks_by_title_metadata("peterpan", prompt, 17)
stream_response(generate_response(prompt, most_similar_chunks, generate_model))


Okay, let's break down who Peter is and his incredibly complex role in the TV series *Lost*. **Be warned: This explanation contains MAJOR spoilers.**

**Who is Peter?**

Peter is the son of John Locke. He's a character who initially appears quite ordinary, almost clumsy and pathetic. He's known for his poor social skills, his reliance on his father, and a general feeling of helplessness. However, as the series progresses, it's revealed that he is *far* from ordinary. He's incredibly powerful, and his true nature is the biggest mystery of the show.

**His Role on the Island (and Beyond): A Multi-Layered Explanation**

Peter's role is the most convoluted and significant in *Lost*. Here's a breakdown of his functions, progressing from his apparent role on the island to the staggering truth about his identity and purpose:

**1. The Protector/Guardian (Initial Appearance):**

*   **Obsessed with John:** In the early seasons, Peter's primary concern is his father, John Locke. He's intensely 

In [42]:
prompt = "who ate Dracula's food??"

most_similar_chunks = gather_inappropriate_chunks(prompt, 7)
for chunk in most_similar_chunks:
    print(chunk)
print("---\n") 
   
stream_response(generate_response(prompt, most_similar_chunks, generate_model))

("priest buy any soul with his money, he shall eat of it, and he that is born in his house: they shall eat of his meat. \nIf the priest's daughter also be married unto a stranger, she may not eat of an offering of the holy things. \nBut if the priest's daughter be a widow, or divorced, and have no child, and is returned unto her father's house, as in her youth, she shall eat of her father's meat: but there shall be no stranger eat thereof. \nAnd if a man eat of the holy thing unwittingly, then he", 0.34392842957334924)
('and the most\ncunning, as well as the bravest of the sons of the ‘land beyond the\nforest.’ That mighty brain and that iron resolution went with him to his\ngrave, and are even now arrayed against us. The Draculas were, says\nArminius, a great and noble race, though now and again were scions who\nwere held by their coevals to have had dealings with the Evil One. They\nlearned his secrets in the Scholomance, amongst the mountains over Lake\nHermanstadt, where the devil 

In [None]:
prompt = "who ate Dracula's food??"

most_similar_chunks = gather_appropriate_chunks_by_title_metadata("dracula", prompt, 20)
stream_response(generate_response(prompt, most_similar_chunks, generate_model))

This is a fun question tied to the recent "What We Do in the Shadows" TV series! Here's the breakdown of who ate Dracula's food and why it's such a running gag:

**It was Guillermo de la Cruz!**

Here's the story:

*   **Dracula's Pickiness:** Dracula is incredibly particular about his food. He only eats very specific, rare, and ancient delicacies.
*   **Guillermo's Secret Snacking:** Guillermo, Dracula's familiar (who desperately wants to *be* Dracula), would often sneak into the kitchen and eat Dracula's food when he wasn't looking. He did this for a long time, motivated by a mix of hunger and a subconscious desire to connect with his master.
*   **The Revelation:**  The running joke escalated when it was revealed (in a hilarious flashback) that Guillermo had been eating Dracula's food for decades! This was a major betrayal in Dracula's eyes.
*   **The Ongoing Conflict:** Dracula is constantly discovering evidence of Guillermo's food theft, leading to dramatic confrontations and a lo

Let's use a smaller number of chunks so we fit in context.

In [None]:
most_similar_chunks = gather_appropriate_chunks_by_title_metadata("dracula", prompt, 15) 
stream_response(generate_response(prompt, most_similar_chunks, generate_model))

The narrator ate Dracula’s food, which was "robber steak" - bits of bacon, onion, and beef, seasoned with red pepper, and strung on sticks.

Let's use a tiny LLM with the same chunks as before.

In [55]:
stream_response(generate_response(prompt, most_similar_chunks[:14], "gemma3:12b"))

Jonathan Harker ate Dracula’s food, specifically "robber steak"—bits of bacon, onion, and beef seasoned with red pepper, strung on sticks, and roasted over a fire.

In [57]:
stream_response(generate_response(prompt, most_similar_chunks[:10], "gemma3:1b"))

“It was my own brother, Jonathan Harker, who ate Draculas’s food. He was consumed by a profound and unsettling restlessness, seeking to understand the nature of the vampire. He was compelled to taste it, driven by a morbid curiosity and a sense of desperate need for knowledge. He did not willingly eat it, but he did consume it, driven by a compulsion that he couldn’t explain.”

In [58]:
prompt = "Did Jonathan have anything to drink with the meals that Dracula prepared?"

most_similar_chunks = gather_appropriate_chunks_by_title_metadata("dracula", prompt, 15)
stream_response(generate_response(prompt, most_similar_chunks, "gemma3:12b"))

The text states, "He eat not as others. Even friend Jonathan, who lived with him for weeks, did never see him to eat, never!"