# Code search using embeddings
## Introduction
- Instead of sending the whole codebase, we can use embeddings to only send relevant pieces to the llm.
- This notebook shows how embeddings can also be used to implement semantic code search. 
- We implement a simple version of file parsing and extracting of functions from python files, which can be embedded, indexed, and queried.

Inspired by : <https://github.com/openai/openai-cookbook/blob/main/examples/Code_search_using_embeddings.ipynb>

## Installation

In [1]:
%pip install -q pandas openai numpy

Note: you may need to restart the kernel to use updated packages.


## Parsing code files

We first setup some simple parsing functions that allow us to extract important information from our codebase.

In [2]:
import pandas as pd
from pathlib import Path

# We look for function definitions in Python files.
DEF_PREFIXES = ['def ', 'async def ']
NEWLINE = '\n'

def get_function_name(code):
    """
    Extract function name from a line beginning with 'def' or 'async def'.
    """
    for prefix in DEF_PREFIXES:
        if code.startswith(prefix):
            return code[len(prefix): code.index('(')]


def get_until_no_space(all_lines, i):
    """
    Get all lines until a line outside the function definition is found.
    """
    ret = [all_lines[i]]
    for j in range(i + 1, len(all_lines)):
        if len(all_lines[j]) == 0 or all_lines[j][0] in [' ', '\t', ')']:
            ret.append(all_lines[j])
        else:
            break
    return NEWLINE.join(ret)


def get_functions(filepath):
    """
    Get all functions in a Python file.
    """
    with open(filepath, 'r') as file:
        all_lines = file.read().replace('\r', NEWLINE).split(NEWLINE)
        for i, l in enumerate(all_lines):
            for prefix in DEF_PREFIXES:
                if l.startswith(prefix):
                    code = get_until_no_space(all_lines, i)
                    function_name = get_function_name(code)
                    yield {
                        'code': code,
                        'function_name': function_name,
                        'filepath': filepath,
                    }
                    break


def extract_functions_from_repo(code_root):
    """
    Extract all .py functions from the repository.
    """
    code_files = list(code_root.glob('**/*.py'))

    num_files = len(code_files)
    print(f'Total number of .py files: {num_files}')

    if num_files == 0:
        print('Verify openai-python repo exists and code_root is set correctly.')
        return None

    all_funcs = [
        func
        for code_file in code_files
        for func in get_functions(str(code_file))
    ]

    num_funcs = len(all_funcs)
    print(f'Total number of functions extracted: {num_funcs}')

    return all_funcs

## Read our code repository

We'll first load the `data/code` folder and extract the needed information using the functions we defined above.

In [3]:
code_root = Path("./repo")
#code_root = root_dir / 'repo'
all_funcs = extract_functions_from_repo(code_root)

Total number of .py files: 3
Total number of functions extracted: 3


## Calculate code embeddings

Now that we have our content, we can pass the data to the `text-embedding-3-small` embeding model and get back our vector embeddings.

In [4]:
from openai import OpenAI
client = OpenAI()

def get_embedding(text, model="text-embedding-3-small"):
    text = text.replace("\n", " ")
    return client.embeddings.create(input = [text], model=model).data[0].embedding

embedding_model = 'text-embedding-3-small'

# we turn the array of all functions onto a dataframe
df = pd.DataFrame(all_funcs)
df

Unnamed: 0,code,function_name,filepath
0,"def main():\n hello(""Hello, World!"")\n\n",main,repo/main.py
1,"def hello(text):\n """"""\n This function p...",hello,repo/lib/hello.py
2,def hello_patrick():\n # Print the message\...,hello_patrick,repo/lib/hello.py


## Calculate embeddings

In [5]:
# We calculate the embeddings for each function
# and save them to the dataframe
df['code_embedding'] = df['code'].apply(lambda x: get_embedding(x, model=embedding_model))

# we extract the function name and the file path
df['filepath'] = df['filepath'].map(lambda x: Path(x).relative_to(code_root))

# Next we save the dataframe to a CSV file
#df.to_csv("data/code_search_openai-python.csv", index=False)
df.head()

Unnamed: 0,code,function_name,filepath,code_embedding
0,"def main():\n hello(""Hello, World!"")\n\n",main,main.py,"[0.03210125491023064, -0.01357596181333065, 0...."
1,"def hello(text):\n """"""\n This function p...",hello,lib/hello.py,"[0.024970512837171555, -0.0030333898030221462,..."
2,def hello_patrick():\n # Print the message\...,hello_patrick,lib/hello.py,"[0.028183333575725555, -0.01952880620956421, 0..."


## Similarity score using cosine

In [6]:
import numpy as np
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

## Search code function

We define a search_functions method that takes our data that contains our embeddings, a query string, and some other configuration options. The process of searching our database works like such:

1. We first embed our query string (code_query) with `text-embedding-3-small`. The reasoning here is that a query string like 'a function that reverses a string' and a function like 'def reverse(string): return string[::-1]' will be very similar when embedded.
2. We then calculate the cosine similarity between our query string embedding and all data points in our database. This gives a distance between each point and our query.
3. We finally sort all of our data points by their distance to our query string and return the number of results requested in the function parameters. 

In [7]:
def search_functions(df, code_query, n=3, pprint=True, n_lines=7):

    # calculate the embedding for our query
    embedding = get_embedding(code_query, model='text-embedding-3-small')

    # now calculate the cosine similarity
    df['similarities'] = df.code_embedding.apply(lambda x: cosine_similarity(x, embedding))

    # sort the dataframe by similarity
    res = df.sort_values('similarities', ascending=False).head(n)

    # pprint is True by default
    if pprint:
        for r in res.iterrows():
            print(f"{r[1].filepath}:{r[1].function_name}  score={round(r[1].similarities, 3)}")
            print("\n".join(r[1].code.split("\n")[:n_lines]))
            print('-' * 70)

    return res

## Example code queries

In [8]:
res = search_functions(df, 'hello world greeting', n=3)

main.py:main  score=0.637
def main():
    hello("Hello, World!")


----------------------------------------------------------------------
lib/hello.py:hello  score=0.539
def hello(text):
    """
    This function prints 'Hello, World!' to the console.
    """
    # Print the message
    if text == "patrick":
        hello_patrick()
----------------------------------------------------------------------
lib/hello.py:hello_patrick  score=0.471
def hello_patrick():
    # Print the message
    print("Oh no ! Not Patrick!")

----------------------------------------------------------------------


In [9]:
res = search_functions(df, 'printing hello_patrick that says oh no', n=3, n_lines=20)

lib/hello.py:hello_patrick  score=0.753
def hello_patrick():
    # Print the message
    print("Oh no ! Not Patrick!")

----------------------------------------------------------------------
lib/hello.py:hello  score=0.484
def hello(text):
    """
    This function prints 'Hello, World!' to the console.
    """
    # Print the message
    if text == "patrick":
        hello_patrick()
    else:
        print(text)

----------------------------------------------------------------------
main.py:main  score=0.383
def main():
    hello("Hello, World!")


----------------------------------------------------------------------


## Extra
- https://blog.voyageai.com/2024/01/23/voyage-code-2-elevate-your-code-retrieval/