# RAG and tatters

Threadbare implementation of RAG.

## setup

In [6]:
%pip install html2text

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [7]:
import torch as t

import requests
from bs4 import BeautifulSoup
import html2text

### utils

In [12]:
def wikipedia_to_markdown(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    content = soup.find('div', {'class': 'mw-parser-output'})
    return html2text.html2text(str(content))

## document chunking

One option is to ingest full documents, another is to chunk them into smaller pieces. Let's compare different schemes of document chunking.

In [17]:
document = wikipedia_to_markdown('https://en.wikipedia.org/wiki/Magician_(fantasy)')

### fixed size

Split the text into fixed size chunks. For simplicity I'll fix the size in characters, but it would be smarter to split by amount of tokens instead.

In [26]:
def chunk_fixed_size(text, size):
    return [text[i:i+size] for i in range(0, len(text), size)]

chunks = chunk_fixed_size(document, 100)
chunks[:4]

['Magicians appearing in fantasy fiction\n\nFor other uses, see [Magician\n(disambiguation)](/wiki/Magici',
 'an_\\(disambiguation\\) "Magician\n\\(disambiguation\\)") and [Magi (disambiguation)](/wiki/Magi_\\(disamb',
 'iguation\\)\n"Magi \\(disambiguation\\)").\n\n"Wizard (fantasy)" redirects here. For other uses, see [Wiza',
 'rd\n(disambiguation)](/wiki/Wizard_\\(disambiguation\\) "Wizard\n\\(disambiguation\\)").\n\n[![](//upload.wi']

In [29]:
def chunk_fixed_size_overlap(text, size, overlap):
    return [text[i:i+size] for i in range(0, len(text), size - overlap)]

chunks = chunk_fixed_size_overlap(document, 100, 10)
chunks[:4]

['Magicians appearing in fantasy fiction\n\nFor other uses, see [Magician\n(disambiguation)](/wiki/Magici',
 'iki/Magician_\\(disambiguation\\) "Magician\n\\(disambiguation\\)") and [Magi (disambiguation)](/wiki/Mag',
 '(/wiki/Magi_\\(disambiguation\\)\n"Magi \\(disambiguation\\)").\n\n"Wizard (fantasy)" redirects here. For o',
 'ere. For other uses, see [Wizard\n(disambiguation)](/wiki/Wizard_\\(disambiguation\\) "Wizard\n\\(disambi']

### recursive character split

Split on a hierarchy of specific landmarks (e.g. `'\n\n'`, `'\n'`, `' '`) until we reach the desired size. This is meant to preserve more structure than simple fixed size split.

In [46]:
'abc def ghij'.rfind('')
'abc def ghij'[12]

IndexError: string index out of range

In [48]:
def chunk_recursive_character_split(text, size):
    separators = ['\n\n', '\n', ' ', ''] # Note: '' is important as a fallback
    if len(text) <= size: return [text]
    for separator in separators:
        if (index := text[:size].rfind(separator)) != -1:
            index += len(separator)
            return [text[:index]] + chunk_recursive_character_split(text[index:], size)

chunks = chunk_recursive_character_split(document, 100)
chunks[:4]

['Magicians appearing in fantasy fiction\n\n',
 'For other uses, see [Magician\n(disambiguation)](/wiki/Magician_\\(disambiguation\\) "Magician\n',
 '\\(disambiguation\\)") and [Magi (disambiguation)](/wiki/Magi_\\(disambiguation\\)\n',
 '"Magi \\(disambiguation\\)").\n\n']