### gpt-toolbox 🧰 modules: retrieval

*Utilities for "chatting your data"*

In [1]:
%load_ext autoreload
%autoreload 2

import json
import os
import sys
from tqdm.notebook import tqdm

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..", "src"))) # hack for importing src/

the `retrieval` module exports `Retrievers`. These classes are essentially wrappers over a data source that internally extract meaningful chunks from the ingest data before storing it. They can then provide useful higher-level search functions over that data. This pattern is meant to accommodate the "chat my data" use case. As we'll show later with `PythonRetriever`, this pattern neatly abstracts the operations of parsing python (extraction), storing it, and making it queryable. 

To start, we'll look at the basic `DocumentRetriever`. Let's give it an in memory `Chroma` db (our own leight-weight wrapper that implements our interface), index a few documents, and then search on them with natural language:

In [2]:
from db import Chroma

from retrieval import DocumentRetriever

db = Chroma('basic-doc')

retriever = DocumentRetriever(db)

retriever.index("example one")
retriever.index("ejemplo dos")
retriever.index("exemple trois")

Using embedded DuckDB with persistence: data will be stored in: .chromadb


Adding documents to store:   0%|          | 0/1 [00:00<?, ?it/s]

Adding documents to store:   0%|          | 0/1 [00:00<?, ?it/s]

Adding documents to store:   0%|          | 0/1 [00:00<?, ?it/s]

Searches that have nothing in common with the exact document text produce the expected results - neat!

In [3]:
print(retriever.query("written in french", max_results=1))
print(retriever.query("written in spanish", max_results=1))
print(retriever.query("first item", max_results=1))

[QueryResult(_id='a85ab21c-4f9f-4b51-87a1-5f113d966268', document='exemple trois', metadata={'created_at': 1683840697}, distance=0.3593006432056427)]
[QueryResult(_id='3d763964-8444-4e0e-8e0e-336a0d99e998', document='ejemplo dos', metadata={'created_at': 1683840696}, distance=0.3905145227909088)]
[QueryResult(_id='c4466fe9-49b4-487d-aa0c-4109fd73568e', document='example one', metadata={'created_at': 1683840696}, distance=0.3457540273666382)]


#### 🐍 Parsing code and getting metadata with `PythonExtractor`

Before looking at the `PythonRetriever` exported by the top-level `retrieval` module, let's look at the more internal `PythonExtractor`. This is a configurable utility class that parses Python code and extracts meaningful chunks along with their associated metadata. The metadata includes where to locate the chunk (the file path and line number) and more. The result is something that is hopefully more useful for embedding to facilitate code search and "chatting your data" in general. As a bonus, the `PythonRetriever` module is able to provide useful higher-level search methods because of the metadata produced by the extractor.

The default configuration of `PythonExtractor` produces a lot of redundant chunks. The chunks extracted include: entire modules, classes, functions (class methods each count as their own function), block comments (for docstrings), function/method calls, and variable assignments. So a docstring in a method, for instance, will get extracted many times. This is by design! However in your own production code, you can easily customize exactly what you want to extract by using the DI patterns (through constructor params) that are used in `retrieval` (and throughout gpt-toolbox!)

In [4]:
from retrieval.python import PythonExtractor

extractor = PythonExtractor()

target_dir = os.path.join(os.getcwd(), "..", "src") # extract our own src/

print(f"diving into {os.path.abspath(target_dir)}...")

items = extractor.extract(target_dir)

print('done. total items extracted:', len(items))

print(json.dumps(items[0].metadata, indent=2))

diving into /Users/jmn/Projects/gpt-toolbox/src...
done. total items extracted: 520
{
  "node_type": "module",
  "output_type": "code",
  "node_name": "",
  "lineno": "",
  "loc": 9,
  "lloc": 9,
  "sloc": 8,
  "file_name": "main.py",
  "file_path": "/Users/jmn/Projects/gpt-toolbox/src/main.py",
  "last_modified_time": 1683324229
}


#### 📜 Indexing with `Chroma`

**Warning: The following operations cost money! (~$0.02)**

Now we'll step out of `retrieval` for a second and manually index everything we just extracted into Chroma to see how that looks (`retrievers` do this interally)

Even though we are indexing a lot of redundant text (as mentioned before), it should only cost a few pennies (ada-002 is currently 1/50th the cost of gpt3.5, and there's no completion to worry about)

In [5]:
db = Chroma('python-example-1')

Using embedded DuckDB with persistence: data will be stored in: .chromadb


In [6]:
for item in tqdm(items, desc="Adding documents to store"):
    db.add_document(item.document, item.metadata)

db.client.persist() # only necessary in notebook context

print('total documents in store:', db.collection.count())

Adding documents to store:   0%|          | 0/520 [00:00<?, ?it/s]

total documents in store: 520


#### 🔎 Searching

 With the chunks embedded, you can search the code using natural language or with exact references to symbols.  You can also filter on metadata for special searches, e.g. calls to a function, or searching only within docstrings.

In [7]:
def print_results_summary(results):
    # using the metadata, we can show where, when, and how the chunk was extracted
    lines = [
        f"{result.metadata['file_path']}:{result.metadata['lineno']} " 
        f"{result.metadata['node_name']} ({result.metadata['node_type']}-{result.metadata['output_type']})"
        #f"last modified:{result.metadata['last_modified_time']} "
        #f"doc_id:{result._id}"
        for result in results
    ]
    print(json.dumps(lines, indent=2))

print_results_summary(db.query("system_prompt", max_results=3))

print_results_summary(db.query('schema/ShellRequest', max_results=3))

print_results_summary(db.query("count tokens", max_results=5))

[
  "/Users/jmn/Projects/gpt-toolbox/src/agents/web/agent.py:28 prompt (method-code)",
  "/Users/jmn/Projects/gpt-toolbox/src/agents/few_shot/agent.py:23 prompt (method-code)",
  "/Users/jmn/Projects/gpt-toolbox/src/agents/few_shot/agent.py:14 system_prompt (method-ast)"
]
[
  "/Users/jmn/Projects/gpt-toolbox/src/plugin/api/schema.py:35 ShellRequest (class-code)",
  "/Users/jmn/Projects/gpt-toolbox/src/plugin/api/schema.py:35 ShellRequest (class-ast)",
  "/Users/jmn/Projects/gpt-toolbox/src/plugin/api/routes/shell.py:7 _shell (function-ast)"
]
[
  "/Users/jmn/Projects/gpt-toolbox/src/llm/count_tokens.py:8 count_tokens (function-code)",
  "/Users/jmn/Projects/gpt-toolbox/src/llm/chat_completion.py:44 chat_completion_token_counts (function-code)",
  "/Users/jmn/Projects/gpt-toolbox/src/llm/count_tokens.py:8 count_tokens (function-ast)",
  "/Users/jmn/Projects/gpt-toolbox/src/llm/chat_session.py:37 token_counts (method-code)",
  "/Users/jmn/Projects/gpt-toolbox/src/llm/count_tokens.py:  (

#### 🧰 Putting it all together with `PythonRetriever`

The "retriever" classes wrap the extractor and database operations we just manually did above. They provide a simple outward interface of `index` that extracts everything and stores it all in a database. Specific  retrievers can then provide their own specialized methods for higher-level searching.

Here is `PythonRetriever`. Because it wraps `PythonExtractor`, it can provider higher-level convenience methods for searching on and around the metadata. Compare these to the same searches above!

In [8]:
from retrieval import PythonRetriever

# create a new database. we could easily re-use the same one as before, but we want to demo index()
db = Chroma('python-example-2')

retriever = PythonRetriever(db)

# re-index the same stuff as before, but now through the PythonRetriever interface
retriever.index(os.path.join(os.getcwd(), "..", "src")) # again, our own src/

# use the high-level search methods:
print_results_summary(retriever.search_for_method("system_prompt"))

print_results_summary(retriever.search_for_class("ShellRequest"))

print_results_summary(retriever.search_comments("schema/ShellRequest"))

print_results_summary(retriever.search_in_file("/Users/jmn/Projects/gpt-toolbox/src/plugin/api/schema.py", "result"))

# or the basic:
print_results_summary(retriever.query("count tokens", max_results=5))


Using embedded DuckDB with persistence: data will be stored in: .chromadb


Adding documents to store:   0%|          | 0/520 [00:00<?, ?it/s]

[
  "/Users/jmn/Projects/gpt-toolbox/src/agents/few_shot/agent.py:14 system_prompt (method-ast)",
  "/Users/jmn/Projects/gpt-toolbox/src/agents/few_shot/agent.py:14 system_prompt (method-code)"
]
[
  "/Users/jmn/Projects/gpt-toolbox/src/plugin/api/schema.py:35 ShellRequest (class-ast)",
  "/Users/jmn/Projects/gpt-toolbox/src/plugin/api/schema.py:35 ShellRequest (class-code)"
]
[
  "/Users/jmn/Projects/gpt-toolbox/src/plugin/api/routes/shell.py:8  (comment-comment)",
  "/Users/jmn/Projects/gpt-toolbox/src/plugin/api/routes/search.py:20  (comment-comment)",
  "/Users/jmn/Projects/gpt-toolbox/src/plugin/api/routes/url.py:13  (comment-comment)"
]
[
  "/Users/jmn/Projects/gpt-toolbox/src/plugin/api/schema.py:38 ShellResult (class-code)",
  "/Users/jmn/Projects/gpt-toolbox/src/plugin/api/schema.py:38 ShellResult (class-ast)",
  "/Users/jmn/Projects/gpt-toolbox/src/plugin/api/schema.py:32 UrlResult (class-ast)"
]
[
  "/Users/jmn/Projects/gpt-toolbox/src/llm/count_tokens.py:8 count_tokens (fun