
# Example: Building code search with Lexy

In this example, we'll create a semantic search engine that allows us to run similarity search using **only the comments and docstrings** in code from a given GitHub repository. Below is an overview of the steps involved.

**Part 1: Ingesting raw code from GitHub**. Right now, this step requires LlamaIndex (but won't in the future).

**Part 2: Extracting comments and docstrings**. We'll write a function to parse comments and docstrings from code files. We'll use Lexy to run our function on our documents, and to write the output to an index for querying.

**Part 3: Running similarity search queries**. We'll use Lexy to run similarity search queries against our newly created index of comments and docstrings.

TODO: Add a diagram here showing an example file, e.g., `main.py`, split into multiple rows of comments and docstrings.

## Part 1: Ingesting data

For now, we're going to clone our repo locally and import the files using `llama_index.SimpleDirectoryReader`. This step requires **LlamaIndex**.

TODO: Once the `Collection` model supports custom PKs, include an explanation of how to choose your PK based on how you want your application and update logic.

In [None]:
# check if llama-index is installed
! pip freeze | grep -i llama-index

In [None]:
# install llama-index if not installed
! pip install llama-index

In [None]:
import os
import subprocess
from llama_index import SimpleDirectoryReader

In [None]:
# use this for the event loop to work in Jupyter notebooks
# import nest_asyncio
# nest_asyncio.apply()

### Clone a repo locally

In [None]:
# this is the relative path to the directory where we git clone our repos
local_repo_dir = "../tmp"

# create the temporary directory if it doesn't exist
os.makedirs(local_repo_dir, exist_ok=True)


def clone_repo(repo_url, repo_path_prefix=local_repo_dir, n_top_ext=30):
    """ Clone a GitHub repo and print some stats about it. """
    repo_name = repo_url.split("/")[-1].replace(".git", "")
    repo_path = os.path.join(repo_path_prefix, repo_name)
    
    # clone the repo if it doesn't exist
    if os.path.exists(repo_path):
        print(f"Repo '{repo_url}' already exists at {repo_path}")
    else:
        subprocess.run(["git", "clone", "--depth", "1", repo_url], cwd=repo_path_prefix)
        print(f"Repo '{repo_url}' cloned to {repo_path}")

    # count total files
    n_total_files = subprocess.check_output(["git", "ls-files", "-z"], cwd=repo_path).decode("utf-8").count("\0")
    print(f"Total files in {repo_name}: {int(n_total_files)}")
    
    # get all file extensions and count them
    files = subprocess.check_output(["git", "ls-files"], cwd=repo_path).decode("utf-8").splitlines()
    ext_counts = {}
    for file in files:
        ext = os.path.splitext(file)[1]
        if ext:
            ext_counts[ext] = ext_counts.get(ext, 0) + 1
    
    # sort extensions by count and print top n_top_ext
    sorted_ext_counts = sorted(ext_counts.items(), key=lambda item: item[1], reverse=True)[:n_top_ext]
    print(f"Top {n_top_ext} extensions:")
    for ext, count in sorted_ext_counts:
        print(f"\t{ext}: {count}")
    
    return repo_name, repo_path

In [None]:
repo_name, repo_path = clone_repo("https://github.com/ray-project/ray.git")

In [None]:
# specify the file extensions we want to include when ingesting the repo
include_extensions = [
    ".py", 
    # ".ipynb",  # errors out in llama_index, skipping for now
    ".java", 
    ".c", ".h", 
    ".cpp", ".cc", ".hpp", 
    ".go", 
    ".rs", 
    ".rb", ".erb",
    ".js", ".jsx", 
    ".ts", ".tsx", 
    ".html", 
    ".css", 
    ".sh",
    # ".md", ".rst",  # llama_index splits these into nodes, skipping for now
    # ".txt",  # no comments or docstrings in text files
    # ".jpg"  # just testing
]

In [None]:
reader = SimpleDirectoryReader(repo_path, 
                               filename_as_id=True,
                               recursive=True,
                               required_exts=include_extensions)
llama_docs = reader.load_data()

### Convert a llama index doc to a lexy doc

In [None]:
from lexy_py import Document, LexyClient

In [None]:
def llama_to_lexy(llama_doc, repo_path_prefix=local_repo_dir) -> Document:
    """ Convert a llama index document to a lexy document """
    lexy_doc = Document(content=llama_doc.get_text(), meta=llama_doc.dict().get("metadata", {}))
    # remove the file path prefix
    if "file_path" in lexy_doc.meta:
        lexy_doc.meta["file_path"] = os.path.relpath(lexy_doc.meta.get("file_path"), repo_path_prefix)
    # add file extension
    if "file_name" in lexy_doc.meta:
        _, file_ext = os.path.splitext(lexy_doc.meta["file_name"])
        lexy_doc.meta["file_ext"] = file_ext
    # add repo name? naw, that's a user space thing, separate from converting to lexy document
    # TODO: if an image, upload to lexy as image document
    return lexy_doc


In [None]:
llama_doc = llama_docs[0]
llama_doc

In [None]:
lexy_doc = llama_to_lexy(llama_doc)
lexy_doc.dict()

In [None]:
# convert all llama docs to lexy docs
lexy_docs = [llama_to_lexy(doc) for doc in llama_docs]

### Upload docs to Lexy

In [None]:
# instantiate Lexy client
lx = LexyClient()
lx.info()

In [None]:
# create a collection for our new documents
github_repos_collection = lx.create_collection("github_repos", description="Code from select Github repositories")
github_repos_collection

In [None]:
len(lexy_docs)

In [None]:
# add the lexy docs to our new collection 
docs_added = lx.add_documents(lexy_docs, collection_id="github_repos")
docs_added[:5]

### Pipeline for ingesting a new repo

Now that we have the code to ingest a new repo, we can wrap it in a function and use it to streamline the ingestion of any new repo we want to add.

In [None]:
def lexy_docs_from_github_repo(repo_url: str, 
                               repo_path_prefix: str = local_repo_dir, 
                               file_extensions: list[str] = include_extensions) -> list[Document]:
    """ Clones a GitHub repo and returns a list of documents ready for upload to Lexy. """
    # clone the repo locally
    name, path = clone_repo(repo_url, repo_path_prefix)
    # read using llama_index
    llama_reader = SimpleDirectoryReader(path, 
                                         filename_as_id=True,
                                         recursive=True,
                                         required_exts=file_extensions)
    llama_repo_docs = llama_reader.load_data()
    # convert to lexy docs
    lexy_repo_docs = [llama_to_lexy(doc) for doc in llama_repo_docs]
    return lexy_repo_docs

In [None]:
# get docs for a new repo
repo_docs = lexy_docs_from_github_repo("https://github.com/mosaicml/composer.git")

In [None]:
len(repo_docs)

In [None]:
# use this to filter out any known bad files
exclude_filenames = [
    # "broken.js",  # this file contains an invalid null byte and is used for testing
]

docs_to_add = [d for d in repo_docs if d.meta.get("file_name") not in exclude_filenames]
len(docs_to_add)

In [None]:
# upload to lexy
docs_added = lx.add_documents(docs_to_add, collection_id="github_repos")
docs_added[:5]

## Part 2: Extracting comments and docstrings

In this part, we'll write a function to parse comments and docstrings from code files. We'll use Lexy to run our function on our documents, and to write the output to an index for querying.

Much of this section is included in the tutorial on [custom transformers](https://getlexy.com/tutorials/custom-transformers/).

Using `tree-sitter-languages`, we come up with the following code to extract comments and docstrings from code for a variety of languages (C++, Python, Typescript, and TSX).

In [None]:
import tree_sitter_languages

from lexy.models.document import Document
from lexy.transformers import lexy_transformer
from lexy.transformers.embeddings import text_embeddings


lang_from_ext = {
    'cc': 'cpp',
    'h': 'cpp',
    'py': 'python',
    'ts': 'typescript',
    'tsx': 'tsx',
}

COMMENT_PATTERN_CPP = "(comment) @comment"
COMMENT_PATTERN_PY = """
    (module . (comment)* . (expression_statement (string)) @module_doc_str)

    (class_definition
        body: (block . (expression_statement (string)) @class_doc_str))

    (function_definition
        body: (block . (expression_statement (string)) @function_doc_str))
"""
COMMENT_PATTERN_TS = "(comment) @comment"
COMMENT_PATTERN_TSX = "(comment) @comment"

comment_patterns = {
    'cpp': COMMENT_PATTERN_CPP,
    'python': COMMENT_PATTERN_PY,
    'typescript': COMMENT_PATTERN_TS,
    'tsx': COMMENT_PATTERN_TSX
}


@lexy_transformer(name='code.extract_comments.v1')
def get_comments(doc: Document) -> list[dict]:
    lang = lang_from_ext.get(doc.meta['file_ext'].replace('.', ''))
    comment_pattern = comment_patterns.get(lang, None)

    if comment_pattern is None:
        return []

    parser = tree_sitter_languages.get_parser(lang)
    language = tree_sitter_languages.get_language(lang)

    tree = parser.parse(bytes(doc.content, "utf-8"))
    root = tree.root_node

    query = language.query(comment_pattern)
    matches = query.captures(root)
    comments = []
    for m, name in matches:
        comment_text = m.text.decode('utf-8')
        c = {
            'comment_text': comment_text,
            'comment_embedding': text_embeddings(comment_text),
            'comment_meta': {
                'start_point': m.start_point,
                'end_point': m.end_point,
                'type': name
            }
        }
        comments.append(c)
    return comments

### Test on sample documents

Let's test our function on a few documents to see if it's working as expected. You'll have to replace the `document_id` with the `document_id` of a document in your collection.

In [None]:
# typescript
"""
    select document_id from documents d 
    where collection_id = 'github_repos' 
    and meta->>'file_path' = 'turbo/packages/turbo-gen/src/commands/raw/index.ts';
"""
ts_doc_id = 'f799cabc-2a14-464f-af1d-a1848ae8bd40'
ts_doc = lx.get_document(ts_doc_id)

In [None]:
ts_doc.dict()

In [None]:
c = get_comments(ts_doc)
print(*[{k: v for k, v in d.items() if k != 'comment_embedding'} for d in c], sep='\n')

In [None]:
# cpp
"""
    select document_id from documents d 
    where collection_id = 'github_repos' 
    and meta->>'file_path' = 'ray/cpp/include/ray/api/metric.h';
"""
cpp_doc_id = '9e9f73de-dbfb-4a2a-ac66-093b348685fb'
cpp_doc = lx.get_document(cpp_doc_id)

In [None]:
cpp_doc.dict()

In [None]:
c = get_comments(cpp_doc)
print(*[{k: v for k, v in d.items() if k != 'comment_embedding'} for d in c], sep='\n')

In [None]:
# python
"""
    select document_id from documents d 
    where collection_id = 'github_repos' 
    and meta->>'file_path' = 'composer/composer/algorithms/alibi/alibi.py';
"""
py_doc_id = '30bbb9e1-2f9d-484a-b0d3-409a65fcfdd5'
py_doc = lx.get_document(py_doc_id)

In [None]:
py_doc.dict()

In [None]:
c = get_comments(py_doc)
print(*[{k: v for k, v in d.items() if k != 'comment_embedding'} for d in c], sep='\n')

### Registering the function with Lexy

It looks like our function is working as expected. To run it against all of our documents, we can follow the instructions in the custom transformers tutorial. We'll use the `lexy_transformer` decorator to register our function with Lexy, and then use the `LexyClient` to run our function on our documents and write the output to an index.

We put the above code into a file called `code.py` and place it inside of the `lexy.transformers` directory. 

*Additional instructions...*

In [None]:
#TODO: the things you get from Lexy include the following...

## Part 3: Running similarity search queries