Stuart 🛎

(Super Talkative Understanding Artificial Response Technology)

Stuart 🛎

Stuart: You rang 🛎️ ?
Ask me anything or enter 'q' to exit. Enter 'r' to restart our conversation.
> What is Stuart?
 Stuart is a chatbot that assists the customer care team of Open Data Hub in
 solving tickets. It uses the Open Data Hub Wiki, past tickets history, and the
 readme files of all repositories as inputs to help answer customer inquiries.
>

Changelog of this document

2024-03-27 added note about llama-cpp-python compile options
2024-03-25 first release - Chris Mair chris@1006.org

Background

Stuart uses RAG (retrieval-augmented generation). RAG improves the quality of responses by combining the capabilities of two main components: a retrieval system and a generative model.

The retrieval system searches a database of documents, specifically the Open Data Hub wiki, the past tickets history and the readme files of all related repositories to find information that is relevant to the user's question. This step is crucial as it allows the generative model to access knowledge that is not contained in its pre-trained parameters.

The generative model receives a prompt that is constructed from the retrieved information and the user's input. It then generates a coherent, natural text based on that prompt.

Stuart is a proof-of-concept system built with a few guiding principles:

the system should run locally (no proprietary APIs)
it should only rely on Free models
it should be able to run on modest hardware (no expensive datacenter GPUs)

Currently, there are a few well known Python packages to build RAG systems such as LlamaIndex and LangChain. These packages are basically glue code that abstracts away details about the underlying models and software components. An early prototype of Stuart used LlamaIndex. However, these systems appear to be in very quick evolution, are somewhat black-boxy and the integration between their components and the documentation is sometimes lagging their quick progress.

To better understand the underlying technology and to keep things stable and simple, we opted to not rely on any of these frameworks and rather implement a few functions, such as text chunking and database access from scratch. It turned out that the resulting code was not much longer, but easier to understand with way less dependencies.

Installation

Stuart is best run on a *nix OS.

The installation has been tested on macOS 13 with the command line developer tools and on Linux (Debian 12) with the developer tools (packages build-essential, git and python3-venv must be installed). The developer tools are required, as one Python library does not (yet) offer binary releases and needs to be compiled.

You need about 15 GiB of free space for the model files and Python libraries. Additionally, some space for the PostgreSQL database (for the installation at NOI that's less than 200 MiB).

Stuart

Stuart needs a Python 3 environment with venv. The third-party Python libraries are listed in requirements.txt:

llama-cpp-python==0.2.56
psycopg2-binary==2.9.9
sentence-transformers==2.6.0

That being said, installation is as simple as running the following commands as a normal user:

cd ~
git clone https://github.com/noi-techpark/stuart-chatbot
cd stuart-chatbot
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

That's it!

Almost. We also need to install the LLM itself (more below). Download the model file into the home directory (4.8 GiB):

cd ~
curl -LO https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q5_K_M.gguf

PostgreSQL

Stuart needs a PostgreSQL database server with the pgvector extension.

This can be basically any installation, local or managed.

To install PostgreSQL locally, just follow the steps listed on the official download page. Once the server is up, define a role and a database and activate pgvector:

su - postgres
psql

postgres=# create role rag login password '********';
 CREATE ROLE
postgres=# create database ragdb owner rag;
 CREATE DATABASE
postgres=# \c ragdb
 You are now connected to database "ragdb" as user "postgres".
ragdb=# create extension vector;
 CREATE EXTENSION
postgres=# \q

Once the database is up, load the table definition (here I assume PostgreSQL is running on 127.0.0.1):

cd ~/stuart-chatbot/
psql -h 127.0.0.1 -U rag ragdb < rag/schema.sql

and edit the file with the credentials:

cd ~/stuart-chatbot/
vim rag/secrets_pg.json

Ready!

Preparing the Data

Scraping the Documents

Before the chatbot can be used for the first time, we need to scrape the documents:

the readme markdown files from the relevant NOI Techpark repositories,
the wiki markdown files from the ODH-Docs wiki,
the tickets from the ODH Request Tracker installation.

For each category, there is a custom scraper in scrapers/.

Two scrapers are specially crafted for Stuart:

scrape_readme.sh scrapes the readme markdown files from the NOI Techpark repositories on GitHub that are relevant to the Open Data Hub. The links are read from a hand-crafted file (scrape_readme_urls.txt).

scrape_wiki.sh scrapes the wiki markdown files from the ODH-Docs wiki.

Additionally, there is a more generic scraper: scrape_rt.py scrapes tickets from an installation of Best Practice' Request Tracker. Or more precisely: it scrapes transactions of type 'Ticket created', 'Correspondence added' or 'Comments added'. Remember to set up the location and credentials of the Request Tracker installation in 'scrape_rt.json'!

The documents are stored in the ~/stuart-chatbot/data_* directories.

Currently, scraping the readmes and the wiki just takes a few seconds, but scraping the tickets takes a few hours. Luckily scrape_rt.py works incrementally, but it still needs about 20 minutes to check each ticket for new transactions.

The easiest way to run all these scripts is to set up a cronjob that runs cron/cron-scrape.sh that will take care of everything.

RAGging the Documents

Before the chatbot can be used for the first time, we also need to "RAG" the documents. RAGging means:

read all the files from the ~/stuart-chatbot/data_* directories
chunk them into overlapping chunks of roughly equal size
call a sentence embedding model (see Wikipedia) to encode meaningful semantic information from each chunk into a point in a high-dimensional vector space
store the file name, chunk and vector into PostgreSQL

That the job for rag/load.py.

When it runs for the first time, it will automatically download the sentence embedding model (2.1 GiB) and put it into ~/.cache.

rag/load.py runs the model using the sentence-transformers library that is based on PyTorch. The run time very much depends on the capabilities of your hardware. On a system with a single CPU core this also might take a few hours. Luckily also load.py works incrementally, so if there are no new documents, that are not yet loaded into PostgreSQL, it will exit after a few seconds.

Note that load.py never deletes or updates documents from the database, it just adds new ones. For ticket transactions this is fine. However, wiki pages and readme change, so it is a good idea to delete these from time to time, so they can be RAGged again. This can be done in Postgres with delete from ragdata where tag in ('readme', 'wiki');.

Again, there is a handy script that can be called from crontab: cron/cron-load.sh.

Running the Chatbot

Currently, the chatbot is available as an interactive command line interface only.

Run it with these commands:

cd ~/stuart-chatbot/
source .venv/bin/activate
cd rag/
python query.py

This will get you into an easy to use endless loop with the chatbot. Here is a sample session!

Let's break down the components.

The user asks "Come posso ottenere informazioni sui mercatini di Natale di Bolzano?".
This piece of text is embedded and transformed into a vector. A query is run to find the closest vector stored in PostgreSQL and the top-5 matches are shown (lines starting with {meta} are debug output). The best match is actually the right document: it's a wiki page talking about the Christmas markets (here).

Pause a moment to think about how powerful semantic search is! We use a multi-language embedding model, so the question is close to the wiki document because both refer to the meaning "Christmas markets" regardless the fact that the text is completely different. It's not even the same language (!).

The code proceeds to build a prompt using the original question and the chunk from the wiki document and inputs into the LLM.
The LLM answers with a (presumably) correct text.
The user asks a follow up question "E qual'è il TourismVereinId di Bolzano?"
Now the LLMs answer plus the new question is again embedded and searched for (leading to the same document found as best match).
A new prompt is built using the follow-up question and the same chunk and input again to the LLM.
The LLMs answers with the information. "5228229451CA11D18F1400A02427D15E" is indeed correct.

It is important to point out, that LLMs - as is well known - tend to hallucinate. So any information should be double-checked!

A Note about the Models used

The sentence embedding model is bge-m3 (license: MIT). We pinned the version to 5a212480c9a75bb651bcb894978ed409e4c47b82 (2024-03-21).

The model is quite large (2.1 GiB) for sentence embedding models, but performs very well, can embed a variety of text sizes from short sentences to longer documents (8192 tokens) and has been trained on many languages.

The model is instantiated in rag/librag.py.

The LLM is Mistral-7B-Instruct-v0.2 (license: Apache 2). We use a version in GGUF format with parameters quantized to ~ 5 bits (here) and run the inference using llama-cpp-python.

The model performs very well given it's relatively small size of 7E9 parameters (4.8 GiB in the quantized version). Besides English, it understands also German and Italian, but doesn't speak them well.

From the same company, Mistral AI, a second model is available under the Apache 2 license: Mixtral-8x7B-Instruct-v0.1. This model is about 6 times larger, but only twice as slow.

Some quick tests suggest the quality of Stuart very much depends on the search, not so much the LLM. So the extra size of Mixtral might not be worth it.

The model is instantiated in rag/query.py.

A Note about Performance

Stuart requires at least 16 GB of RAM.

A single CPU core is enough to run it, but answers take 1-5 minutes with a single core. More cores improve the performance up to a point where LLM inference becomes memory-bandwidth-bound.

When scaling up che core count, for example using a cloud-based VM, check whether the additional cores are not starved by insufficient memory bandwidth. This typically starts to happen between 4 and 16 cores, depending on per floating point performance and memory bandwidth.

When the llama-cpp-python package is installed, the underlying inference code (llama-cpp) is compiled for the effective hardware using a number of default settings. One of those settings indicates the maximum number of threads to use: the default is to use half the number of logical cores, so to match the number of physical cores.

VMs in the cloud normally expose logical cores, but the underlying host might match all logical cores present in the VM to physical cores on the host system. So the default of using a number of threads equal to only half the number of logical cores leaves some performance on the table.

We've found that for VMs with a small number of logical cores ("vCPU"), such as 2, performance can be improved by compiling llama-cpp to use the OpenBLAS backend which spawns as many threads as there are logical cores.

It's very easy to change an existing installation of Stuart to make use of this. You need to install additional packages first (on Debian: libopenblas-dev and pkg-config) and force a re-installation of llama-cpp-python:

cd ~/stuart-chatbot/
source .venv/bin/activate
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install --force-reinstall --no-cache-dir llama-cpp-python==0.2.56

For VMs with larger numbers of logical cores, this is not true anymore as the host can't map as many physical cores into the VM. In that case performance with the OpenBLAS backend might be worse. You can get back to a default installation with:

cd ~/stuart-chatbot/
source .venv/bin/activate
pip install --force-reinstall --no-cache-dir llama-cpp-python==0.2.56

You can explore even more backends (see llama-cpp-python supported backends), such as cuBLAS (for Nvidia GPUs) or Metal (for Mac GPUs). Typically, response times go down to a few to ten seconds when GPUs are used.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
cron		cron
data_readme		data_readme
data_rt		data_rt
data_wiki		data_wiki
rag		rag
scrapers		scrapers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cli-chatbot.png		cli-chatbot.png
requirements.txt		requirements.txt

License

noi-techpark/stuart-chatbot

Folders and files

Latest commit

History

Repository files navigation

Stuart 🛎

Background

Installation

Stuart

PostgreSQL

Preparing the Data

Scraping the Documents

RAGging the Documents

Running the Chatbot

A Note about the Models used

A Note about Performance

About

Resources

License

Stars

Watchers

Forks

Languages