### Setup

Rename compose.yaml.default to compose.yaml and edit it if needed. As it stands, it will launch the vectorizer, the database, and ollama.

Execute this to start the containers:
```sh
docker compose up -d
```

---

### By the way, if you wreck your install and need to start over:
```sh
docker compose down
docker volume rm pgai_data
docker rm pgai_db_1 pgai_vectorizer-worker_1 pgai_ollama_1
```

### Setup your PSQL instance

```sh
docker compose exec db psql
```

---

Install pgai
```sql
CREATE EXTENSION IF NOT EXISTS ai CASCADE;
```

---

While you're there, create the database table
```sql
CREATE TABLE books (
    id          TEXT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
    filename    TEXT,
    author      TEXT,
    title       TEXT NOT NULL,
    text        TEXT NOT NULL
);
```

---

And, before you quit, create the vectorizer
```sql
SELECT ai.create_vectorizer(
     'texts'::regclass,
     destination => 'texts_embeddings',
     embedding => ai.embedding_ollama('nomic-embed-text', 768),
     chunking => ai.chunking_recursive_character_text_splitter('text')
);
```

*Keep in mind that for other texts, you can customize the chunking and embedding models.*

---

If you've got a spare terminal window, you can tail the vectorizer logs
```sh
docker compose logs -f vectorizer-worker
```

---
Let's get our imports out of the way. Run this.

In your virtual environment, run the following to install the necessary libraries.
```sh
python -m pip install numpy ollama langchain_text_splitters python-dotenv pandas psycopg2-binary Jinja2
```


In [2]:
import json
import os.path
from os import listdir
from os.path import isfile, join
import re
from ollama import Client
import time
import numpy as np
from langchain_text_splitters import RecursiveCharacterTextSplitter
from dotenv import load_dotenv



In [1]:
import psycopg2
from dotenv import load_dotenv
import os

load_dotenv()
OLLAMA_HOST=os.environ.get("OLLAMA_HOST")
DB_String=os.environ.get("DATABASE_CONNECTION_STRING")

def connect_db():
    return psycopg2.connect(DB_String)

def create_table(tablename):
    with connect_db() as connection:
        with connection.cursor() as cursor:
            cursor.execute("""
                        CREATE EXTENSION IF NOT EXISTS ai CASCADE;
                        """)

            cursor.execute(f"""
                        CREATE TABLE IF NOT EXISTS {tablename} (
                            id          BIGINT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
                            filename    TEXT,
                            author      TEXT,
                            title       TEXT NOT NULL,
                            text        TEXT NOT NULL
                            )
                        """)

            cursor.execute(f"""
                        SELECT count(*) from {tablename}_embeddings;
                            """)
            cursor.execute(f"""
                            select count(*) from {tablename}_embeddings_store;
                           """)
            if(len(cursor.fetchall()) == 0):
                cursor.execute(f"""
                            SELECT ai.create_vectorizer(
                                '{tablename}'::regclass,
                                    destination => '{tablename}_embeddings',
                                    embedding => ai.embedding_ollama('nomic-embed-text', 768),
                                    chunking => ai.chunking_recursive_character_text_splitter('text', 500, 10, separators => array[E'\n;', ' '])
                                )
                            """)


Now, load some data into the **texts** table

In [3]:
from pathlib import Path
import re
content_directory = "texts"

class Text:
    def __init__(self, filename):
        self.title = re.match(r".+\\(.+)\.txt$", filename).group(1)
        self.author = re.match(r".+\\(\w+)\\.+.txt", filename).group(1)
        self.filename = f"{self.author}-{self.title}"

        with open(filename, encoding="utf-8-sig") as f:
            file_contents = f.read()
            self.contents = re.sub(r'[^\S\r\n]+', " ", file_contents)
        # print(f"{self.title} by {self.author}")
        # print(self.contents[:30])

def load_directory(dirname, filter):
    return list(Path(dirname).rglob(filter))

# This method has an intentional lazy flaw in that if a record is already in the database, we just leave it alone for now.
def add_text_record(_text):
    print(_text.filename)
    with connect_db() as connection:
        with connection.cursor() as cursor:
            cursor.execute("SELECT filename from texts where filename = %s", [_text.filename])
            if(len(cursor.fetchall()) == 0):
                cursor.execute("INSERT into texts (filename, author, title, text)" \
                "VALUES (%s, %s, %s, %s)", (_text.filename, _text.author, _text.title, _text.contents))

create_table(content_directory)
files = load_directory(content_directory, "*.txt")

# Modify this     vv    to limit how many texts to vectorize
for file in files[:4]:
    text = Text(str(file))
    # We aren't creating an array of Text objects because it could consume an
    # outrageous amount of memory if there are tons of texts.
    add_text_record(text)



OperationalError: connection to server at "192.168.137.117", port 5432 failed: Connection timed out (0x0000274C/10060)
	Is the server running on that host and accepting TCP/IP connections?


Let's talk a bit about *Vectorization*

In [None]:
def gather_appropriate_chunks_by_author_metadata(author, prompt, count):
    with connect_db() as connection:
        with connection.cursor() as cursor:
            cursor.execute("""
                           SELECT 
                           chunk,
                           embedding <=> ai.ollama_embed('nomic-embed-text', %s) as distance
                           FROM texts_embeddings
                           WHERE author = %s
                           ORDER BY distance
                           LIMIT %s;
                           """, (prompt, author, count))
            return cursor.fetchall()    

def gather_appropriate_chunks_by_title_metadata(title, prompt, count):
    with connect_db() as connection:
        with connection.cursor() as cursor:
            cursor.execute("""
                           SELECT 
                           chunk,
                           embedding <=> ai.ollama_embed('nomic-embed-text', %s) as distance
                           FROM texts_embeddings
                           WHERE title = %s
                           ORDER BY distance
                           LIMIT %s;
                           """, (prompt, title, count))
            return cursor.fetchall() 

def gather_inappropriate_chunks(prompt, count):
    with connect_db() as connection:
        with connection.cursor() as cursor:
            cursor.execute("""
                           SELECT 
                           chunk,
                           embedding <=> ai.ollama_embed('nomic-embed-text', %s) as distance
                           FROM texts_embeddings
                           ORDER BY distance
                           LIMIT %s;
                           """, (prompt, count))
            return cursor.fetchall() 
            

499
496
496
497
495
---
499
496
496
497
495
---
499
496
496
499
494


In [33]:
client = Client(
    host='http://192.168.137.117:11434',
)
prompt_size=8000
chunking_size=500
chunk_count = prompt_size // chunking_size
embed_model = "nomic-embed-text"
generate_model = "gemma3:12b"
SYSTEM_PROMPT = """
You are a helpful reading assistant who answers questions
based on snippets of text provided in context. Answer only using the context provided,
being as concise as possible. If the answer isn't in the context, simply say so.
Context:
"""

def generate_response(prompt, most_similar_chunks, model):
    # client.pull(generate_model) # You really should pre-pull your models
    
    # most_similar_chunks = gather_inappropriate_chunks(prompt)
    # for item in most_similar_chunks:
    #     print(item[0])
    # print("\n\n\n")
    system_prompt = SYSTEM_PROMPT + "\n".join(item[0] for item in most_similar_chunks)
    # print(f"{system_prompt}\n\n")
    
    response = client.chat(
        model,
        messages = [
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
        stream = True
    )
    return response

def stream_response(stream):
    for chunk in stream:
        print(chunk['message']['content'], end='', flush=True)

In [60]:
prompt = "who is peter and what was his role on the island?"

most_similar_chunks = gather_appropriate_chunks_by_title_metadata("peterpan", prompt, 14) # 14 seems to be the sweetspot. any more and you over-run the context buffer and bad things happen.
stream_response(generate_response(prompt, most_similar_chunks, generate_model))


Peter is "youth, joy," and "a little bird that has broken out of the egg." On the island, he is the "Great White Father," who is looked up to and obeyed. He leads a band of lost boys and has the help of Tiger Lily and her braves. He also thins out the boys who are growing up and enforces rules they must follow.

In [61]:
prompt = "who is peter and what was his role on the island?"

most_similar_chunks = gather_appropriate_chunks_by_title_metadata("peterpan", prompt, 2)
for chunk in most_similar_chunks:
    print(chunk)
print("---\n")
stream_response(generate_response(prompt, most_similar_chunks, generate_model))

('When last we saw him he was stealing across the\nisland with one finger to his lips and his dagger at the ready. He had\nseen the crocodile pass by without noticing anything peculiar about it,\nbut by and by he remembered that it had not been ticking. At first he\nthought this eerie, but soon concluded rightly that the clock had run\ndown.\n\nWithout giving a thought to what might be the feelings of a\nfellow-creature thus abruptly deprived of its closest companion, Peter\nbegan to consider how he could', 0.31846014408478407)
('did not compete. For one thing he despised all mothers except\nWendy, and for another he was the only boy on the island who could\nneither write nor spell; not the smallest word. He was above all that\nsort of thing.\n\nBy the way, the questions were all written in the past tense. What was\nthe colour of Mother’s eyes, and so on. Wendy, you see, had been\nforgetting, too.\n\nAdventures, of course, as we shall see, were of daily occurrence; but\nabout this time

In [62]:
prompt = "who is peter and what was his role on the island?"

most_similar_chunks = gather_appropriate_chunks_by_title_metadata("peterpan", prompt, 17)
stream_response(generate_response(prompt, most_similar_chunks, generate_model))


Okay, let's break down who Peter is and his incredibly complex role on the island in the TV series "Lost." **Be warned: this will contain major spoilers!**

**Who is Peter?**

Peter Labek (played by actor Hector Ramirez) is a seemingly ordinary man who lives on the island with his wife, Emily.  Initially, he's presented as a simple, somewhat hapless, and emotionally stunted individual. He's known for his awkwardness, his obsession with hunting, and his peculiar relationship with his wife.

However, Peter is *far* from ordinary. He's actually a crucial element in the island's existence and purpose. Here's the breakdown of his true identity:

*   **He is the Island's "Guardian":** Peter is not a native islander, but he was brought to the island by Jacob, one of the island's protectors, long before the crash of Oceanic Flight 815.
*   **He was created as a "Vessel":** Jacob discovered that Peter was a person with unique capabilities, most importantly an unusual level of physical resilienc

In [63]:
prompt = "who ate Dracula's food??"

most_similar_chunks = gather_inappropriate_chunks(prompt, 7)
for chunk in most_similar_chunks:
    print(chunk)
print("---\n") 
   
stream_response(generate_response(prompt, most_similar_chunks, generate_model))

("priest buy any soul with his money, he shall eat of it, and he that is born in his house: they shall eat of his meat. \nIf the priest's daughter also be married unto a stranger, she may not eat of an offering of the holy things. \nBut if the priest's daughter be a widow, or divorced, and have no child, and is returned unto her father's house, as in her youth, she shall eat of her father's meat: but there shall be no stranger eat thereof. \nAnd if a man eat of the holy thing unwittingly, then he", 0.3438387809356128)
('and the most\ncunning, as well as the bravest of the sons of the ‘land beyond the\nforest.’ That mighty brain and that iron resolution went with him to his\ngrave, and are even now arrayed against us. The Draculas were, says\nArminius, a great and noble race, though now and again were scions who\nwere held by their coevals to have had dealings with the Evil One. They\nlearned his secrets in the Scholomance, amongst the mountains over Lake\nHermanstadt, where the devil c

In [64]:
prompt = "who ate Dracula's food??"

most_similar_chunks = gather_appropriate_chunks_by_title_metadata("dracula", prompt, 17)
stream_response(generate_response(prompt, most_similar_chunks, generate_model))

This is a surprisingly complex question with a few different answers depending on which version of the Dracula story you's referring to! Here's a breakdown:

**1. In Bram Stoker's *Dracula* (the original novel):**

*   **Jonathan Harker:** He's the primary person who eats Dracula's food. He's a solicitor sent to Transylvania to finalize a property transaction, and he's held captive in Dracula's castle. Dracula insists he eats, and he's served elaborate meals. Harker is initially reluctant, finding the food strange and unsettling, but he eventually eats to stay alive.  He describes the food as being mostly meat, and often served in a very unsettling way (e.g., strangely presented and seemingly lifeless).
*   **Dracula's Servants:**  The novel implies that Dracula's three vampire brides and other servants also partake in the food served to Harker.  However, it's never explicitly stated. 
*   **Dracula himself:** He rarely eats, and when he does, it's not described as normal food, but rat

Let's use a smaller number of chunks so we fit in context.

In [66]:
most_similar_chunks = gather_appropriate_chunks_by_title_metadata("dracula", prompt, 15) 
stream_response(generate_response(prompt, most_similar_chunks, generate_model))

The narrator ate Dracula’s food, which was "robber steak" - bits of bacon, onion, and beef, seasoned with red pepper, and strung on sticks.

Even so, two runs in a row can result in different answers.

In [69]:
most_similar_chunks = gather_appropriate_chunks_by_title_metadata("dracula", prompt, 15) 
stream_response(generate_response(prompt, most_similar_chunks, generate_model))

The narrator ate Dracula’s food, which was "robber steak" - bits of bacon, onion, and beef, seasoned with red pepper, and strung on sticks.

Let's use same model and the same chunks as before, but 1 fewer because because it matters.

In [71]:
stream_response(generate_response(prompt, most_similar_chunks[:14], "gemma3:12b"))

Jonathan Harker ate Dracula’s food, which was "robber steak"--bits of bacon, onion, and beef, seasoned with red pepper, and strung on sticks.

And, now a smaller model and fewer chunks.

In [72]:
stream_response(generate_response(prompt, most_similar_chunks[:10], "gemma3:1b"))

“It was… it was the Captain,” Arthur said, his voice low, “He was the one who brought it to Lucy.”

Let's try to be specific about what we're asking and use a huge model.

In [75]:
prompt = "Did Jonathan have anything to drink with his succulent meals?"

most_similar_chunks = gather_appropriate_chunks_by_title_metadata("dracula", prompt, 15)
stream_response(generate_response(prompt, most_similar_chunks, "gemma3:27b"))

The text notes that Renfield suddenly stopped speaking at the mention of the word "drink" twice, suggesting it is a sensitive topic, but doesn't explicitly state if Jonathan had anything to drink with his meals.

In [77]:
prompt = "What did jonathan think of the chicken?"

most_similar_chunks = gather_appropriate_chunks_by_title_metadata("dracula", prompt, 15)
stream_response(generate_response(prompt, most_similar_chunks, "gemma3:27b"))

The text does not mention anything about Jonathan's thoughts on chickens.

In [79]:
prompt = "Who gave Jonathan excellent roast chicken and what else was served?"

most_similar_chunks = gather_appropriate_chunks_by_title_metadata("dracula", prompt, 15)
stream_response(generate_response(prompt, most_similar_chunks, "gemma3:27b"))

The Count gave Jonathan excellent roast chicken, along with cheese, salad, and a bottle of old Tokay (of which he had two glasses).

In [82]:
stream_response(generate_response(prompt, most_similar_chunks, "gemma3:12b"))

The Count gave Jonathan excellent roast chicken. It was served with cheese, a salad, and a bottle of old Tokay.

In [93]:
stream_response(generate_response(prompt, most_similar_chunks[:5], "gemma3:1b"))

According to the text, the Count himself gave Jonathan excellent roast chicken and a salad.

In [94]:
stream_response(generate_response(prompt, most_similar_chunks[:6], "gemma3:1b"))

According to the text, the Count himself came forward and took off the cover of a dish, and he fell to on an excellent roast chicken.

In [95]:
stream_response(generate_response(prompt, most_similar_chunks[:7], "gemma3:1b"))

The text states: “By this time I had finished my supper, and by **He** took us to his house, where there were rooms for us all nice and comfortable, and we dined together.”