# Code Search with Vector Embeddings and Qdrant

In this example, we will use vector embeddings to navigate a codebase, and find relevant code sniipets. We will search codebases using natural semantic queries, and search for code based on a smiliar logic. There is a [live deployment](https://code-search.qdrant.tech/) from the Qdrant codebase for search with a web interface.

We need two models to accomplish our goal:
* NLP model - general usage neural encoder for NLP, in this example, we will use [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).
* Code model - specialized embeddings for code-to-code similarity search. We will use [`jinaai/jina-embeddings-v2-base-code`](https://huggingface.co/jinaai/jina-embeddings-v2-base-code), which supports English and 30 widely used programming languages with a 8192 sequence length.

## Setups

In [None]:
!pip install -qU inflection qdrant-client fastembed

* `inflection` - a string transformation library. It singularizes and pluralizes English words, and transforms CamelCase to underscored string
* `fastembed` - a CPU-first, lightweight library for generating vector embeddings. GPU support is also available
* `qdrant-client` - interface with the Qdrant server

## Data preparation

Chunking the application sources into smaller parts is non-trivial task. In general, functions, class methods, structs, enums, and all the other language-specific constructs are good candidates for chunks. They are big enough to contain some meaningful information, but small enough to be processed by embedding models with a limited context window. Not to mention that we can also use docstrings, comments, and other metadata to enrich the chunks with additional information.

Text-based search is based on function signatures, but code search may return smaller pieces, such as loops. So, if we receive a particular function signature from the NLP model and part of its implementation from the code model, we will merge the results.

## Parsing the codebase

In this example, we will use the [`Qdrant`](https://github.com/qdrant/qdrant) repository.

While this codebase uses Rust, we can use this approach with any other language. We can use an **Language Server Protocal (LSP)** tool to build a graph of the codebase, and then extract chunks. We will use the [`rust-analyzer`](https://rust-analyzer.github.io/) and export the parsed codebase into the LSIF (Language Server Index Format) format, a standard for code intelligence data. Next we will use the LSIF data to navigate the codebase and extract the chunks.

The same approach can be applied for other languages as well.

We will then export the chunks into JSON documents with not only the code itself, but also context with the location of the code in the project. We can examine the Qdrant structures, parsed in JSON, in the [`structured.jsonl`](https://storage.googleapis.com/tutorial-attachments/code-search/structures.jsonl) file. We need to download it and use it as a source of data for our code search.

In [None]:
!wget https://storage.googleapis.com/tutorial-attachments/code-search/structures.jsonl

We then load the file and parse the lines into a list of dictionaries.

In [None]:
import json

structures = []
with open('structures.jsonl', 'r') as fp:
    for i, row in enumerate(fp):
        entry = json.loads(row)
        structures.append(entry)

In [None]:
structures[0]

## Code to natural language conversion

Each programming language has its own syntax which is not a part of the natural language. Thus, a general-purpose model may not understand the code as is. We need to normalize the data by removing code specifics and including additional context, such as module, class, function, and file name:
1. Extract the signature of the function, method, or other code construct.
2. Divide camel case and snake case names into separate words.
3. Take the docstring, comments, and other important metadata.
4. Build a sentence from the extracted data using a predefined template.
5. Remove the speical characters and replace them with spaces.

We can now define the `textify` function that uses the `inflection` library to carry out our conversions.

In [None]:
import inflection
import re
from typing import Dict, Any


def textify(chunk: Dict[str, Any]) -> str:
    # Get rid of all the camel case / snake case
    # - inflection.underscore changes the camel case to snake case
    # - inflection.humanize converts the snake case to human reable form
    name = inflection.humanize(inflection.underscore(chunk['name']))
    signature = inflection.humanize(inflection.underscore(chunk['signature']))

    # Check if docstring is provided
    docstring = ''
    if chunk['docstring']:
        docstring = f"that does {chunk['docstring']} "

    # Extract the location of that snippet of code
    context = f"module {chunk['context']['module']} file {chunk['context']['file_name']}"
    if chunk['context']['struct_name']:
        struct_name = inflection.humanize(inflection.underscore(chunk['context']['struct_name']))
        context = f"defined in struct {struct_name} {context}"

    # Combine all the bits and pieces together
    text_representation = f"{chunk['code_type']} {name} {docstring} defined as {signature} {context}"

    # Remove any special characters and concatenate the tokens
    tokens = re.split(r"\W", text_representation)
    tokens = filter(lambda x: x, tokens)

    return ' '.join(tokens)

Now we can use `textify` to convert all chunks into text representations.

In [None]:
text_representations = list(map(textify, structures))

In [None]:
text_representations[0]

## Natural language embeddings

In [None]:
from fastembed import TextEmbedding

batch_size = 5

nlp_model = TextEmbedding(
    'sentence-transformers/all-MiniLM-L6-v2',
    threads=0
)

nlp_embeddings = nlp_model.embed(
    text_representations,
    batch_size=batch_size
)

## Code Embeddings

In [None]:
code_snippets = [
    structure['context']['snippet'] for structure in structures
]

code_model = TextEmbedding('jinaai/jina-embeddings-v2-base-code')

code_embeddings = code_model.embed(
    code_snippets,
    batch_size=batch_size
)

## Building Qdrant collection

Qdrant supports multiple modes of deployment, including in-memory for prototyping, Docker and Qdrant Cloud.

We will use an in-memory instance in this example.

In [None]:
from qdrant_client import QdrantClient, models

COLLECTION_NAME = 'qdrant-sources'

# use in-memory storage
client = QdrantClient(':memory:')

We will create a collection to store our vectors.

In [None]:
client.create_collection(
    COLLECTION_NAME,
    vectors_config={
        'text': models.VectorParams(
            size=384,
            distance=models.Distance.COSINE
        ),
        'code': models.VectorParams(
            size=768,
            distance=models.Distance.COSINE
        )
    }
)

In [None]:
from tqdm import tqdm

points = []
total = len(structures)
print('Number of points to upload: ', total)

for id, (text_embedding, code_embedding, structure) in tqdm(
    enumerate(zip(nlp_embeddings, code_embeddings, structures)),
    total=total
):
    # FastEmbed returns generator. Embeddings are computed as consumed
    points.append(
        models.PointStruct(
            id=id,
            vector={
                'text': text_embedding,
                'code': code_embedding
            },
            payload=structure
        )
    )

    # Upload points in batches
    if len(points) >= batch_size:
        client.upload_points(COLLECTION_NAME, points=points, wait=True)
        points = []

# Ensure any remaining points are uploaded
if points:
    client.upload_points(COLLECTION_NAME, points=points)
print(f"Total points in collection: {client.count(COLLECTION_NAME).count}")

The uploaded points are immediately available for search.

## Querying the codebase

We will use one of the models to search the collection via Qdrant's Query API. Start with text embeddings.

In [None]:
query = "How do I count points in a collection?"

hits = client.query_points(
    COLLECTION_NAME,
    query=next(nlp_model.query_embed(query)).tolist(),
    using='text',
    limit=3
).points

In [None]:
hits

Next we try the code embeddings.

In [None]:
hits = client.query_points(
    COLLECTION_NAME,
    query=next(code_model.query_embed(query)).tolist(),
    using='code',
    limit=3
).points

hits

Code and text embeddings can capture different aspects of the codebase. We can use both models to query the collection and then combine the results to get the most relevant code snippets.

In [None]:
from qdrant_client import models

hits = client.query_points(
    collection_name=COLLECTION_NAME,
    prefetch=[
        models.Prefetch(
            query=next(nlp_model.query_embed(query)).tolist(),
            using='text',
            limit=5
        ),
        models.Prefetch(
            query=next(code_model.query_embed(query)).tolist(),
            using='code',
            limit=5
        )
    ],
    query=models.FusionQuery(fusion=models.Fusion.RRF)
).points

In [None]:
for hit in hits:
    print(
        "| ",
        hit.payload["context"]["module"],
        " | ",
        hit.payload["context"]["file_path"],
        " | ",
        hit.score,
        " | `",
        hit.payload["signature"],
        "` |",
    )

This is how we can fuse the results from different models. In real-world scenario, we may run some reranking and deduplication, as well as additional processing of the results.

## Grouping the results

We can improve the search results by grouping them by payload properties. In this example, we can group the results by the module. If we use code embeddings, we can see multiple results from the `map_index` module.

In [None]:
reuslts = client.query_points_groups(
    COLLECTION_NAME,
    query=next(code_model.query_embed(query)).tolist(),
    using='code',
    group_by='context.module',
    limit=5,
    group_size=1
)

In [None]:
for group in results.groups:
    for hit in group.hits:
        print(
            "| ",
            hit.payload["context"]["module"],
            " | ",
            hit.payload["context"]["file_name"],
            " | ",
            hit.score,
            " | `",
            hit.payload["signature"],
            "` |",
        )