## Contextual Embedding

> For while man strives he errs.  

&mdash; [Goethe, *Faust, Part 1* (Kline translation).](https://www.gutenberg.org/files/14591/14591-h/14591-h.htm#PROLOGUE_IN_HEAVEN)

It's great fun to chat with a large language model about a book you have both read.  But as the LLM is scaled down in size, the quality of the conversation diminishes proportionally.  This project is an experiment to see how a smaller size LLM will perform in this task if  retrieval augmented generation with contextual retrieval techinques are applied.  This Anthropic [blog post](https://www.anthropic.com/news/contextual-retrieval) and [guide](https://github.com/anthropics/anthropic-cookbook/blob/main/skills/contextual-embeddings/guide.ipynb) are used as a reference, but altered to work locally on my old computer.

How to get started:
```bash
docker compose up -d
python3.9 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```

In [6]:
import json
import warnings
from lxml import etree
import ollama
import psycopg
from tqdm import tqdm

warnings.filterwarnings("ignore", category=DeprecationWarning) 

MODEL='llama3.2:1b'
DB_DSN='host=127.0.0.1 port=5433 dbname=postgres user=postgres password=password'

In [7]:
llm = ollama.Client(host='http://127.0.0.1:11435')

In [18]:
llm.pull(MODEL)

{'status': 'success'}

In [9]:
def llm_generate(prompt, chunks):
    # https://github.com/ollama/ollama-python
    data = '\n'.join([c[0] for c in chunks])
    stream = llm.generate(
        model=MODEL, 
        prompt=f'Using this data: "{data}". Respond to this prompt: {prompt}', 
        stream=True
    )
    for chunk in stream:
        response = chunk['response']
        print(response, end='', flush=True)

### No RAG
**Grade: D**

It is confusing an American Western writer with Goethe.  The details of the deal are what makes it interesting and this answer is too vague.

In [52]:
PROMPT_BARGAIN = "In Goethe's Faust what was the bargain he made with the devil?"

In [55]:
llm_generate(PROMPT_BARGAIN, [])

In Friedrich Schiller's translation of Goethe's Faust, and later in Goethe's original German text, Faust makes a bargain with the Devil (also known as Mephistopheles) to sell his soul for 24 years of earthly pleasure and knowledge.

However, it is worth noting that there are slight variations in different translations and interpretations of Faust's bargain. But in general, the core idea remains the same: Faust agrees to give up his mortality and earthly concerns in exchange for a lifetime of earthly pleasures and knowledge gained through his studies with Mephistopheles.

Faust's bargain is often seen as a symbol of the human desire for knowledge, power, and spiritual growth, while also acknowledging the dangers and consequences of pursuing these things at the expense of one's mortality and inner being.

In [22]:
def read_faust():
    'Read the play and chunk it based on part, act, scene, and character speaking.'

    # https://lxml.de/api.html#iteration
    # https://shallowsky.com/blog/programming/parsing-html-python.html
    
    faust = 'Faust, Goethe & A. S. Kline.html'
    iter = etree.iterparse(faust, html=True, events=('start', 'end'))

    # skip header
    for event, element in iter:
        if event == 'start' and element.tag == 'body':
            break

    unwanted = {
        'line-number',
        'small-font'     # picture description
    }

    acc = ''
    h1 = ''
    h2 = ''
    h3 = ''
    character = ''

    def chunk():
        if acc.strip():
            yield (h1, h2, h3, character), acc

    for event, element in iter:
        classes = set(element.attrib.get('class', '').split(' '))
        text = (element.text or '').strip()
        tail = (element.tail or '').strip()
        if text and not classes & unwanted: 
            if event == 'start' and element.tag == 'h1':
                yield from chunk()
                acc = ''
                h1 = text
                h2 = ''
                h3 = ''
                character = ''
            elif event == 'start' and element.tag == 'h2':
                yield from chunk()
                acc = ''
                h2 = text
                h3 = ''
                character = ''
            elif event == 'start' and element.tag == 'h3':
                yield from chunk()
                acc = ''
                h3 = text
                character = ''
            elif event == 'start' and 'play-char' in classes:
                yield from chunk()
                acc = ''
                character = text
            elif event == 'start':
                acc += text
            elif event == 'end':
                acc += tail
                acc += '\n'
    yield from chunk()

In [27]:
def location_context(context, acc):
    return f','.join(x for x in context if x) + '\n' + acc

In [28]:
# Debugging purposes
with open('faust-chunks.txt', 'w') as f:
    for context, acc in read_faust():
        f.write(location_context(context, acc))
        f.write('================================================\n')
        f.flush()

In [92]:
# Create the vector database table
with psycopg.connect(DB_DSN) as conn:
    with conn.cursor() as cursor:
        cursor.execute("""
            CREATE EXTENSION vector;
            CREATE TABLE embed (
                id bigserial,
                technique smallint,
                value vector(2048),
                chunk text
            );""")
        conn.commit()

In [40]:
def insert_embeddings(technique, read_fn, add_context, skip):
    'Insert the play embeddings and chunks into the vector database'
    # https://ollama.com/blog/embedding-models
    # https://github.com/pgvector/pgvector/tree/master?tab=readme-ov-file#storing
    num_embeddings = sum(1 for _ in read_faust())
    with psycopg.connect(DB_DSN) as conn:
        with conn.cursor() as cursor:
            for i, chunk in tqdm(enumerate(read_fn()), total=num_embeddings):
                if i >= skip:
                    context_chunk = add_context(chunk)
                    resp = llm.embed(MODEL, context_chunk)
                    value = json.dumps(resp['embeddings'][0])
                    cursor.execute('INSERT INTO embed(technique, value, chunk) VALUES (%s, %s, %s);', (technique, value, chunk))
                    conn.commit()

In [51]:
RAG = 0

In [158]:
insert_embeddings(0, read_faust, location_context, 0)

100%|██████████| 261/261 [00:00<00:00, 1457.71it/s]


In [43]:
def retrieve_chunks(technique, prompt):
    resp = llm.embed(MODEL, prompt)
    query_param = json.dumps(resp['embeddings'][0])
    with psycopg.connect(DB_DSN) as conn:
        with conn.cursor() as cursor:
            cursor.execute('SELECT chunk FROM embed WHERE technique=%s ORDER BY value <-> %s LIMIT 20;', [technique, query_param])
            return cursor.fetchall()

In [53]:
chunks = retrieve_chunks(RAG, PROMPT_BARGAIN)
chunks

[('Faust: Parts I & II,Act V,Scene VI: The Great Outer Court of the Palace,Chorus\n\n\xa0\n\n\n It’s past.\n\xa0\n\n\n',),
 ('Faust: Parts I & II,Act V,Scene I: Open Country,The Wanderer\n\n\n\n\n\n\xa0\n\n\n Yes! Here are the dusky lindens,\nStanding round, in mighty age.\n\nAnd here am I, returning to them, \n\nAfter so long a pilgrimage!\n\nIt still appears the same old place:\n\nHere’s the hut that sheltered me,\n\nWhen the storm-uplifted wave,\n\nHurled me shore-wards from the sea! \n\nMy hosts are those I would bless,\n\nA brave, a hospitable pair,\n\nWho if I meet them, I confess,\n\nMust already be white haired.\n\nAh! They were pious people! \n\nShall I call, or knock? – Greetings,\n\nIf, as open-hearted, you still\n\nEnjoy good luck, in meetings!\n\n\xa0\n\n\n',),
 ('Faust: Parts I & II,Part II,Scene III: A Spacious Hall with Adjoining Rooms,The Boy Charioteer\n Let’s hear more! Go on: go on,\nFind the riddle’s bright solution.\n\n\xa0\n\n',),
 ('Faust: Parts I & II,Part I,Sc

### RAG
**Grade: C-**

A little better than last time, at least it's not confusing the author.  There are a lot more details, but unfortunately most are wrong.

In [54]:
llm_generate(PROMPT_BARGAIN, chunks)

In Johann Wolfgang von Goethe's epic drama "Faust", the bargain Faust makes with the devil, Mephistopheles, is as follows:

At the end of Act II, Scene III, Faust and Mephistopheles agree to a deal: Faust will serve Mephistopheles for seven years and then return to Earth to live out his mortal life. In exchange, Faust will become more beautiful and successful, while Mephistopheles will ensure that Faust's time on earth is filled with joy and pleasure.

The bargain is sealed when Mephistopheles presents Faust with a ring containing the seven deadly sins, and Faust accepts it. This act symbolizes Faust's surrender of his moral integrity and his willingness to abandon his former life of virtue for material success and earthly pleasures.

Throughout the play, Faust's negotiations with Mephistopheles serve as a metaphor for the human desire for knowledge, power, and pleasure. Faust is drawn into a world of illusions and temptations, but he is ultimately unable to resist the allure of the de

In [34]:
def read_faust_scene_context(debug=False):
    "Faust is really long and exceeds ollama3.2's context window, so adding the context per scene."
    current_scene = (None, None, None)
    dialog = []

    def chunk():
        dialog_body = '\n'.join(dialog)
        if debug:
            dialog_body = dialog_body[:20] + '...' + dialog_body[-20:]
        scene_body = 'In ' + f','.join(x for x in current_scene if x) + '\n' + dialog_body
        for quote in dialog:
            yield scene_body, quote

    for ((document, part_or_act, scene, character), quote) in read_faust():
        if current_scene != (document, part_or_act, scene):
            yield from chunk()
            dialog = []
        
        if character:
            quote = f'{character} says:\n{quote}'

        dialog += [quote]
        current_scene = (document, part_or_act, scene)
    yield from chunk()

In [30]:
# Debugging purposes
with open('faust-context-chunks.txt', 'w') as f:
    for scene_body, quote in read_faust_scene_context(debug=True):
        f.write(scene_body)
        f.write('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n')
        f.write(quote)
        f.write('================================================\n')
        f.flush()

In [37]:
CONTEXT_PROMPT = '''
<scene> 
{} 
</scene> 
Here is the chunk we want to situate within the whole scene.
<chunk> 
{} 
</chunk> 
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else. 
'''

def scene_context(chunk):
    (scene_body, quote) = chunk
    context = llm.generate(
        model=MODEL, 
        prompt=CONTEXT_PROMPT.format(scene_body, quote)
    )['response']
    return f'{quote}\n\n{context}'

In [41]:
CONTEXTUAL_EMBEDDING = 1
insert_embeddings(CONTEXTUAL_EMBEDDING, read_faust_scene_context, scene_context, 408)

100%|██████████| 2082/2082 [59:02:05<00:00, 102.08s/it]   


In [47]:
context_chunks = retrieve_chunks(CONTEXTUAL_EMBEDDING, PROMPT_BARGAIN)
context_chunks

[('("In Faust: Parts I & II,Part II,Scene IV: A Pleasure Garden in the Morning Sun\n\n(The Emperor, his Court, Noblemen and Ladies: Faust and Mephistopheles dressed fashionably but not ostentatiously, both kneel.)\n\n\nFaust says:\nSire, forgive the fiery conjuring tricks?\n\nThe Emperor says:\n(\nBeckoning to him to rise.)\nMore fun, in that vein, would be my wish. –\nAt once, I saw myself in a glowing sphere,\nIt seemed as if I were divine Pluto, there.\nA rocky depth of mine, and darkness, lay\nGlowing with flame: out of each vent played\nA thousand wild and whirling fires,\nAnd flickered in the vault together, higher,\nLicking upwards to the highest dome,\nThat now seemed there, and now was gone.\nThrough a far space wound with fiery pillars,\nI saw a long line of people approach us,\nCrowding till they formed a circle near,\nAnd paid me homage, as they do forever.\nFrom Court, I knew one face, and then another’s,\nI seemed the Prince of a thousand salamanders.\n\nMephistopheles sa

### RAG with Scene Context Embedding
**Grade: B-**

Improved again, but still confusing details like the 24 years which is from Mann's Doctor Faustus instead of Goethe.  Less here is outright wrong, but still not nearly as nice as larger LLMs.

In [48]:
llm_generate(PROMPT_BARGAIN, context_chunks)

In Goethe's Faust, Faust's bargain with the devil is a central theme throughout the play. After making his deal with Mephistopheles on Walpurgis Night, Faust agrees to renounce his humanity and become "enlightened" in exchange for 24 years of immortal life.

Specifically, Faust makes the following bargains:

1. **Immortal Life**: In return for not being able to die until he is 64 years old (since Faust's death occurred before this was a concern), Faust is granted eternal life.
2. **Knowledge and Power**: Faust agrees to surrender his natural human emotions, including love, desire, and compassion, in exchange for all the knowledge and power of humanity.

However, Mephistopheles' ultimate goal is not just to grant Faust immortality but also to corrupt him through his newfound power and knowledge. Faust's bargain comes with a terrible price: he loses his humanity, his sense of self, and ultimately, his soul.

Throughout the play, Goethe explores the complexities of Faust's bargain, highli