## Code Search with Vector Embeddings and Qdrant

*Authored by: [Qdrant Team](https://github.com/qdrant/qdrant)*

 In this notebook, we demonstrate how you can use vector embeddings to navigate a codebase, to help you find relevant code snippets. We'll search codebases using natural semantic queries, and search for code based on a similar logic.

You can check out the [live deployment](https://code-search.qdrant.tech/) of this approach which exposes the Qdrant codebase for search with a web interface.

### The approach

- General usage neural encoder for Natural Language Processing (NLP), in our case `all-MiniLM-L6-v2` from the sentence-transformers library. We'll call this NLP model.
- Specialized embeddings for code-to-code similarity search. We use the `jina-embeddings-v2-base-code` model. We'll call this code model.

To prepare our code for `all-MiniLM-L6-v2`, we preprocess the code to text that more closely resembles natural language. The Jina embeddings model supports a variety of standard programming languages, so there is no need to preprocess the snippets. We can use the code as is.

### Data preparation

Chunking the application sources into smaller parts is a non-trivial task. In general, functions, class methods, structs, enums, and all the other language-specific constructs are good candidates for chunks. They are big enough to contain some meaningful information, but small enough to be processed by embedding models with a limited context window. You can also use docstrings, comments, and other metadata can be used to enrich the chunks with additional information.

<div style="text-align:center"><img src="https://huggingface.co/datasets/Anush008/cookbook-images/resolve/main/data-chunking.png" /></div>

NLP-based search is based on function signatures, but code search may return smaller pieces, such as loops. So, if we receive a particular function signature from the NLP model and part of its implementation from the code model, we merge the results.

### Parsing the Codebaase

We'll use the [Qdrant codebase](https://github.com/qdrant/qdrant) for this demo.
While this codebase uses Rust, you can use this approach with any other language. You can use an [Language Server Protocol (LSP)](https://microsoft.github.io/language-server-protocol/) tool to build a graph of the codebase, and then extract chunks. We did our work with the [rust-analyzer](https://rust-analyzer.github.io/). We exported the parsed codebase into the [LSIF](https://microsoft.github.io/language-server-protocol/specifications/lsif/0.4.0/specification/) format, a standard for code intelligence data. Next, we used the LSIF data to navigate the codebase and extract the chunks.

> For other languages, you can use the same approach. There are [plenty of implementations](https://microsoft.github.io/language-server-protocol/implementors/servers/) available

We then exported the chunks into JSON documents with not only the code itself, but also context with the location of the code in the project.

You can examine the Qdrant structures, parsed in JSON, in the [structures.jsonl file](https://storage.googleapis.com/tutorial-attachments/code-search/structures.jsonl) in our Google Cloud Storage bucket. Download it and use it as a source of data for our code search.

In [None]:
!wget https://storage.googleapis.com/tutorial-attachments/code-search/structures.jsonl

Next, load the file and parse the lines into a list of dictionaries:

In [None]:
import json

structures = []
with open("structures.jsonl", "r") as fp:
    for i, row in enumerate(fp):
        entry = json.loads(row)
        structures.append(entry)

Let's see how one entry looks like.

In [None]:
structures[0]

{'name': 'InvertedIndexRam',
 'signature': '# [doc = " Inverted flatten index from dimension id to posting list"] # [derive (Debug , Clone , PartialEq)] pub struct InvertedIndexRam { # [doc = " Posting lists for each dimension flattened (dimension id -> posting list)"] # [doc = " Gaps are filled with empty posting lists"] pub postings : Vec < PostingList > , # [doc = " Number of unique indexed vectors"] # [doc = " pre-computed on build and upsert to avoid having to traverse the posting lists."] pub vector_count : usize , }',
 'code_type': 'Struct',
 'docstring': '= " Inverted flatten index from dimension id to posting list"',
 'line': 15,
 'line_from': 13,
 'line_to': 22,
 'context': {'module': 'inverted_index',
  'file_path': 'lib/sparse/src/index/inverted_index/inverted_index_ram.rs',
  'file_name': 'inverted_index_ram.rs',
  'struct_name': None,
  'snippet': '/// Inverted flatten index from dimension id to posting list\n#[derive(Debug, Clone, PartialEq)]\npub struct InvertedIndexRam

### Code to natural language conversion

Each programming language has its own syntax which is not a part of the natural language. Thus, a general-purpose model probably does not understand the code as is. We can, however, normalize the data by removing code specifics and including additional context, such as module, class, function, and file name. We took the following steps:

1. Extract the signature of the function, method, or other code construct.
2. Divide camel case and snake case names into separate words.
3. Take the docstring, comments, and other important metadata.
4. Build a sentence from the extracted data using a predefined template.
5. Remove the special characters and replace them with spaces.
6. As input, expect dictionaries with the same structure. Define a `textify` function to do the conversion. We’ll use an `inflection` library to convert with different naming conventions.

In [None]:
%pip install inflection

We can now define the textify function:

In [None]:
import inflection
import re

from typing import Dict, Any

def textify(chunk: Dict[str, Any]) -> str:
    # Get rid of all the camel case / snake case
    # - inflection.underscore changes the camel case to snake case
    # - inflection.humanize converts the snake case to human readable form
    name = inflection.humanize(inflection.underscore(chunk["name"]))
    signature = inflection.humanize(inflection.underscore(chunk["signature"]))

    # Check if docstring is provided
    docstring = ""
    if chunk["docstring"]:
        docstring = f"that does {chunk['docstring']} "

    # Extract the location of that snippet of code
    context = (
        f"module {chunk['context']['module']} "
        f"file {chunk['context']['file_name']}"
    )
    if chunk["context"]["struct_name"]:
        struct_name = inflection.humanize(
            inflection.underscore(chunk["context"]["struct_name"])
        )
        context = f"defined in struct {struct_name} {context}"

    # Combine all the bits and pieces together
    text_representation = (
        f"{chunk['code_type']} {name} "
        f"{docstring}"
        f"defined as {signature} "
        f"{context}"
    )

    # Remove any special characters and concatenate the tokens
    tokens = re.split(r"\W", text_representation)
    tokens = filter(lambda x: x, tokens)
    return " ".join(tokens)

Now we can use `textify` to convert all chunks into text representations:

In [None]:
text_representations = list(map(textify, structures))

Let's see how one of our representations looks like:

In [None]:
text_representations[1000]

'Function Hnsw discover precision that does Checks discovery search precision when using hnsw index this is different from the tests in defined as Fn hnsw discover precision module integration file hnsw_discover_test rs'

Let's begin generating embeddings for our data. We'll use [FastEmbed](https://github.com/qdrant/fastembed) - A CPU-first, lightweight library for generating vector embeddings.

In [None]:
%pip install https://github.com/qdrant/fastembed/archive/jina-embeddings-v2-base-code.zip

### Natural language embeddings

We can encode text representations through the [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model.



In [None]:
from fastembed import TextEmbedding

nlp_model = TextEmbedding("sentence-transformers/all-MiniLM-L6-v2")
nlp_embeddings = list(nlp_model.embed(text_representations, parallel=0))

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

### Code Embeddings

We'll be using [jinaai/jina-embeddings-v2-base-code](https://huggingface.co/jinaai/jina-embeddings-v2-base-code) for the task. It supports English and 30 widely used programming languages with a 8192 sequence length.

In [None]:
code_snippets = [
    structure["context"]["snippet"] for structure in structures
]

code_model = TextEmbedding("jinaai/jina-embeddings-v2-base-code")

code_embeddings = list(code_model.embed(code_snippets, parallel=0))

### Building Qdrant collection

We use the `qdrant-client` library to interact with the Qdrant server. Let’s install that client:

In [None]:
%pip install qdrant-client

Qdrant supports multiple modes of deployment. Including in-memory for prototyping, Docker and Qdrant Cloud. You can refer to the [installation instructions](https://qdrant.tech/documentation/guides/installation/) for more information.

We'll continue the tutorial using an in-memory instance.

> NOTE: In-memory can only be used for quick-prototyping and tests. It is a Python based implementation of the Qdrant server methods.

Let's create a collection to store our vectors.

In [None]:
from qdrant_client import QdrantClient, models

client = QdrantClient(location=":memory:")  # Use in-memory storage
# client = QdrantClient(location="localhost:6333")  # For Qdrant server

client.create_collection(
    "qdrant-sources",
    vectors_config={
        "text": models.VectorParams(
            size=nlp_embeddings[0].shape[0],
            distance=models.Distance.COSINE,
        ),
        "code": models.VectorParams(
            size=code_embeddings[0].shape[0],
            distance=models.Distance.COSINE,
        ),
    }
)

Our newly created collection is ready to accept the data. Let’s upload the embeddings:

In [None]:
import uuid

points = [
    models.PointStruct(
        id=uuid.uuid4().hex,
        vector={
            "text": text_embedding,
            "code": code_embedding,
        },
        payload=structure,
    )
    for text_embedding, code_embedding, structure in zip(nlp_embeddings, code_embeddings, structures)
]

client.upload_points("qdrant-sources", points=points, batch_size=64)

The uploaded points are immediately available for search. Next, query the collection to find relevant code snippets.

### Querying the codebase

We use one of the models to search the collection via Qdrant's new [Query API](https://qdrant.tech/blog/qdrant-1.10.x/). Start with text embeddings. Run the following query “How do I count points in a collection?”. Review the results.

In [None]:
query = "How do I count points in a collection?"

hits = client.query_points(
    "qdrant-sources",
    query=next(nlp_model.query_embed(query)).tolist(),
    using="text",
    limit=5,
).points


Now, review the results. The following table lists the module, the file name
and score. Each line includes a link to the signature, as a code block from
the file.

| module             | file_name           | score      | signature                                                                                                                                                                                                                                                                                 |
|--------------------|---------------------|------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| toc                | point_ops.rs        | 0.59448624 | [<img src="/documentation/tutorials/code-search/github-mark.png" width="16" style="display: inline"> `pub async fn count`](https://github.com/qdrant/qdrant/blob/7aa164bd2dda1c0fc9bf3a0da42e656c95c2e52a/lib/storage/src/content_manager/toc/point_ops.rs#L120)                          |
| operations         | types.rs            | 0.5493385  | [<img src="/documentation/tutorials/code-search/github-mark.png" width="16" style="display: inline"> `pub struct CountRequestInternal`](https://github.com/qdrant/qdrant/blob/7aa164bd2dda1c0fc9bf3a0da42e656c95c2e52a/lib/collection/src/operations/types.rs#L831)                       |
| collection_manager | segments_updater.rs | 0.5121002  | [<img src="/documentation/tutorials/code-search/github-mark.png" width="16" style="display: inline"> `pub(crate) fn upsert_points<'a, T>`](https://github.com/qdrant/qdrant/blob/7aa164bd2dda1c0fc9bf3a0da42e656c95c2e52a/lib/collection/src/collection_manager/segments_updater.rs#L339) |
| collection         | point_ops.rs        | 0.5063539  | [<img src="/documentation/tutorials/code-search/github-mark.png" width="16" style="display: inline"> `pub async fn count`](https://github.com/qdrant/qdrant/blob/7aa164bd2dda1c0fc9bf3a0da42e656c95c2e52a/lib/collection/src/collection/point_ops.rs#L213)                                |
| map_index          | mod.rs              | 0.49973983 | [<img src="/documentation/tutorials/code-search/github-mark.png" width="16" style="display: inline"> `fn get_points_with_value_count<Q>`](https://github.com/qdrant/qdrant/blob/7aa164bd2dda1c0fc9bf3a0da42e656c95c2e52a/lib/segment/src/index/field_index/map_index/mod.rs#L88)          |

It seems we were able to find some relevant code structures. Let's try the same with the code embeddings:

In [None]:
hits = client.query_points(
    "qdrant-sources",
    query=next(code_model.query_embed(query)).tolist(),
    using="code",
    limit=5,
).points

Output:

| module        | file_name                  | score      | signature                                                                                                                                                                                                                                                                   |
|---------------|----------------------------|------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| field_index   | geo_index.rs               | 0.73278356 | [<img src="/documentation/tutorials/code-search/github-mark.png" width="16" style="display: inline"> `fn count_indexed_points`](https://github.com/qdrant/qdrant/blob/7aa164bd2dda1c0fc9bf3a0da42e656c95c2e52a/lib/segment/src/index/field_index/geo_index.rs#L612)         |
| numeric_index | mod.rs                     | 0.7254976  | [<img src="/documentation/tutorials/code-search/github-mark.png" width="16" style="display: inline"> `fn count_indexed_points`](https://github.com/qdrant/qdrant/blob/3fbe1cae6cb7f51a0c5bb4b45cfe6749ac76ed59/lib/segment/src/index/field_index/numeric_index/mod.rs#L322) |
| map_index     | mod.rs                     | 0.7124739  | [<img src="/documentation/tutorials/code-search/github-mark.png" width="16" style="display: inline"> `fn count_indexed_points`](https://github.com/qdrant/qdrant/blob/3fbe1cae6cb7f51a0c5bb4b45cfe6749ac76ed59/lib/segment/src/index/field_index/map_index/mod.rs#L315)     |
| map_index     | mod.rs                     | 0.7124739  | [<img src="/documentation/tutorials/code-search/github-mark.png" width="16" style="display: inline"> `fn count_indexed_points`](https://github.com/qdrant/qdrant/blob/3fbe1cae6cb7f51a0c5bb4b45cfe6749ac76ed59/lib/segment/src/index/field_index/map_index/mod.rs#L429)     |
| fixtures      | payload_context_fixture.rs | 0.706204   | [<img src="/documentation/tutorials/code-search/github-mark.png" width="16" style="display: inline"> `fn total_point_count`](https://github.com/qdrant/qdrant/blob/3fbe1cae6cb7f51a0c5bb4b45cfe6749ac76ed59/lib/segment/src/fixtures/payload_context_fixture.rs#L122)       |

While the scores retrieved by different models are not comparable, but we can
see that the results are different. Code and text embeddings can capture
different aspects of the codebase. We can use both models to query the collection
and then combine the results to get the most relevant code snippets, from a single batch request.

In [None]:
from qdrant_client import models

hits = client.query_points(
    collection_name="qdrant-sources",
    prefetch=[
        models.Prefetch(
            query=next(nlp_model.query_embed(query)).tolist(),
            using="text",
            limit=5,
        ),
        models.Prefetch(
            query=next(code_model.query_embed(query)).tolist(),
            using="code",
            limit=5,
        ),
    ],
).points

Output:

| module             | file_name                  | score      | signature                                                                                                                                                                                                                                                                                 |
|--------------------|----------------------------|------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| toc                | point_ops.rs               | 0.59448624 | [<img src="/documentation/tutorials/code-search/github-mark.png" width="16" style="display: inline"> `pub async fn count`](https://github.com/qdrant/qdrant/blob/7aa164bd2dda1c0fc9bf3a0da42e656c95c2e52a/lib/storage/src/content_manager/toc/point_ops.rs#L120)                          |
| operations         | types.rs                   | 0.5493385  | [<img src="/documentation/tutorials/code-search/github-mark.png" width="16" style="display: inline"> `pub struct CountRequestInternal`](https://github.com/qdrant/qdrant/blob/7aa164bd2dda1c0fc9bf3a0da42e656c95c2e52a/lib/collection/src/operations/types.rs#L831)                       |
| collection_manager | segments_updater.rs        | 0.5121002  | [<img src="/documentation/tutorials/code-search/github-mark.png" width="16" style="display: inline"> `pub(crate) fn upsert_points<'a, T>`](https://github.com/qdrant/qdrant/blob/7aa164bd2dda1c0fc9bf3a0da42e656c95c2e52a/lib/collection/src/collection_manager/segments_updater.rs#L339) |
| collection         | point_ops.rs               | 0.5063539  | [<img src="/documentation/tutorials/code-search/github-mark.png" width="16" style="display: inline"> `pub async fn count`](https://github.com/qdrant/qdrant/blob/7aa164bd2dda1c0fc9bf3a0da42e656c95c2e52a/lib/collection/src/collection/point_ops.rs#L213)                                |
| map_index          | mod.rs                     | 0.49973983 | [<img src="/documentation/tutorials/code-search/github-mark.png" width="16" style="display: inline"> `fn get_points_with_value_count<Q>`](https://github.com/qdrant/qdrant/blob/7aa164bd2dda1c0fc9bf3a0da42e656c95c2e52a/lib/segment/src/index/field_index/map_index/mod.rs#L88)          |
| field_index        | geo_index.rs               | 0.73278356 | [<img src="/documentation/tutorials/code-search/github-mark.png" width="16" style="display: inline"> `fn count_indexed_points`](https://github.com/qdrant/qdrant/blob/7aa164bd2dda1c0fc9bf3a0da42e656c95c2e52a/lib/segment/src/index/field_index/geo_index.rs#L612)                       |
| numeric_index      | mod.rs                     | 0.7254976  | [<img src="/documentation/tutorials/code-search/github-mark.png" width="16" style="display: inline"> `fn count_indexed_points`](https://github.com/qdrant/qdrant/blob/3fbe1cae6cb7f51a0c5bb4b45cfe6749ac76ed59/lib/segment/src/index/field_index/numeric_index/mod.rs#L322)               |
| map_index          | mod.rs                     | 0.7124739  | [<img src="/documentation/tutorials/code-search/github-mark.png" width="16" style="display: inline"> `fn count_indexed_points`](https://github.com/qdrant/qdrant/blob/3fbe1cae6cb7f51a0c5bb4b45cfe6749ac76ed59/lib/segment/src/index/field_index/map_index/mod.rs#L315)                   |
| map_index          | mod.rs                     | 0.7124739  | [<img src="/documentation/tutorials/code-search/github-mark.png" width="16" style="display: inline"> `fn count_indexed_points`](https://github.com/qdrant/qdrant/blob/3fbe1cae6cb7f51a0c5bb4b45cfe6749ac76ed59/lib/segment/src/index/field_index/map_index/mod.rs#L429)                   |
| fixtures           | payload_context_fixture.rs | 0.706204   | [<img src="/documentation/tutorials/code-search/github-mark.png" width="16" style="display: inline"> `fn total_point_count`](https://github.com/qdrant/qdrant/blob/3fbe1cae6cb7f51a0c5bb4b45cfe6749ac76ed59/lib/segment/src/fixtures/payload_context_fixture.rs#L122)                     |

This is one example of how you can use different models and combine the results.
In a real-world scenario, you might run some reranking and deduplication, as
well as additional processing of the results.

### Grouping the results

You can improve the search results, by grouping them by payload properties.
In our case, we can group the results by the module. If we use code embeddings,
we can see multiple results from the `map_index` module. Let's group the
results and assume a single result per module:

> NOTE: The query API doesn't support grouping yet. We'll use the older search API.

In [None]:
results = client.search_groups(
    "qdrant-sources",
    query_vector=(
        "code", next(code_model.query_embed(query)).tolist(),
    ),
    group_by="context.module",
    limit=5,
    group_size=1,
)

| module        | file_name                  | score      | signature                                                                                                                                                                                                                                                                   |
|---------------|----------------------------|------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| field_index   | geo_index.rs               | 0.73278356 | [<img src="/documentation/tutorials/code-search/github-mark.png" width="16" style="display: inline"> `fn count_indexed_points`](https://github.com/qdrant/qdrant/blob/7aa164bd2dda1c0fc9bf3a0da42e656c95c2e52a/lib/segment/src/index/field_index/geo_index.rs#L612)         |
| numeric_index | mod.rs                     | 0.7254976  | [<img src="/documentation/tutorials/code-search/github-mark.png" width="16" style="display: inline"> `fn count_indexed_points`](https://github.com/qdrant/qdrant/blob/3fbe1cae6cb7f51a0c5bb4b45cfe6749ac76ed59/lib/segment/src/index/field_index/numeric_index/mod.rs#L322) |
| map_index     | mod.rs                     | 0.7124739  | [<img src="/documentation/tutorials/code-search/github-mark.png" width="16" style="display: inline"> `fn count_indexed_points`](https://github.com/qdrant/qdrant/blob/3fbe1cae6cb7f51a0c5bb4b45cfe6749ac76ed59/lib/segment/src/index/field_index/map_index/mod.rs#L315)     |
| fixtures      | payload_context_fixture.rs | 0.706204   | [<img src="/documentation/tutorials/code-search/github-mark.png" width="16" style="display: inline"> `fn total_point_count`](https://github.com/qdrant/qdrant/blob/3fbe1cae6cb7f51a0c5bb4b45cfe6749ac76ed59/lib/segment/src/fixtures/payload_context_fixture.rs#L122)       |
| hnsw_index    | graph_links.rs             | 0.6998417  | [<img src="/documentation/tutorials/code-search/github-mark.png" width="16" style="display: inline"> `fn num_points `](https://github.com/qdrant/qdrant/blob/3fbe1cae6cb7f51a0c5bb4b45cfe6749ac76ed59/lib/segment/src/index/hnsw_index/graph_links.rs#L477)                 |

That concludes our tutorial. Thanks for taking the time to get here.