# Code Indexing

This notebook is used to index the code in a vector store/database.

## 1. Setup OpenAI API

First we setup the OpenAI API and embeddings.

>The max retries is increased to 50 to be sure that if OpenAI API is throttling us to several calls per minute, we can still get the embeddings.

In [4]:
import os
import openai

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores.redis import Redis

API_KEY = os.getenv('AZURE_OPENAI_API_KEY') 
RESOURCE_ENDPOINT = os.getenv('AZURE_OPENAI_ENDPOINT')

openai.api_type = 'azure'
openai.api_key = API_KEY if API_KEY else ''
openai.api_base = RESOURCE_ENDPOINT if RESOURCE_ENDPOINT else 'https://<your-api-name>.openai.azure.com'
openai.api_version = '2023-03-15-preview'

os.environ["OPENAI_API_TYPE"] = openai.api_type
os.environ["OPENAI_API_BASE"] = openai.api_base
os.environ["OPENAI_API_KEY"] = openai.api_key
os.environ["OPENAI_API_VERSION"] = openai.api_version

embeddings_model_name = 'text-embedding-ada-002'
embeddings = OpenAIEmbeddings(model=embeddings_model_name, chunk_size=1, max_retries=50)

# The GPT model name is used for testing purpose only
gpt_model_name = '<Add your gpt model name here>'

## 3. Index the code

We index the code in the vector store.

### 3.1. Setup the indexing

We setup the vector store where the code will be loaded, the code folder to be indexed, etc.

In [6]:
# Name of the index that contains the embeddings
index_name = "<insert index name>"
# URL of the Redis Stack
index_url = "redis://<login>:<password>@<host>:6379"

# Directories where the code is located
code_root_directory = 'outputs/<add your path here>'

# The keywords that represents folders to be ignored
ignore_paths = [
    '.git',
    '.vscode',
    'target',
    'node_modules',
    'build'
]

# The keywords that represents files to be ignored
ignore_files = [
    '.gitignore'
]

# The keywords that represents file extensions to be ignored
ignore_extensions = [
    '.class'
]

# The size of the chunks
chunk_size = 2000

# The overlap between chunks
chunk_overlap = 0

### 3.2. Common functions

The functions below are used to check if the path or the file must be ignored.

In [4]:
def is_path_ignored(path):
    """Checks if the path is in the ignore list or not.

    Parameters
    ----------
    path : str
        The path to be checked.

    Returns
    -------
    bool
        True if the path is in the ignore list, False otherwise.
    """
    for ignore_path in ignore_paths:
        if ignore_path in path:
            return True
    return False

def is_file_ignored(filename):
    """Checks if the file is in the ignore list or not.

    Parameters
    ----------
    filename : str
        The name of the file to be checked.

    Returns
    -------
    bool
        True if the file name is in the ignore list, False otherwise.
    """
    for ignore_file in ignore_files:
        if filename == ignore_file:
            return True
    return False

def is_extension_ignored(filename):
    """Checks if the file extension is in the ignore list or not.

    Parameters
    ----------
    filename : str
        The name of the file to be checked.

    Returns
    -------
    bool
        True if the file extension is in the ignore list, False otherwise.
    """
    for ignore_extension in ignore_extensions:
        if filename.endswith(ignore_extension):
            return True
    return False

### 3.3. Get the documents

Use a text loader to get the documents from the code folder.

In [5]:
docs = []

for dirpath, dirnames, filenames in os.walk(code_root_directory):
    if is_path_ignored(dirpath):
        continue
    for filename in filenames:
        if is_file_ignored(filename) or is_extension_ignored(filename):
            continue
        file_path = os.path.join(dirpath, filename)
        try:
            loader = TextLoader(file_path, encoding='utf-8')
            documents = loader.load_and_split()
            for document in documents:
                document.metadata = {
                    "source": f"{dirpath}/{filename}",
                    "fileName": filename
                    }
            docs.extend(documents)
        except Exception as e: 
            pass

### 3.4. Split the documents in chunks

We split the documents in chunks of 2000 lines using the recursive character text splitter.

In [6]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
texts = text_splitter.split_documents(docs)

### 3.5. Index the chunks

We index the chunks in the vector store.

In [7]:
redis = Redis.from_documents(texts, embeddings, redis_url=index_url,  index_name=index_name)

## 4. Test the indexing

Test the indexing by using the prompt to ask the purpose of the project.

In [7]:
from langchain.chat_models import AzureChatOpenAI
from langchain.chains import ConversationalRetrievalChain

retriever = redis.as_retriever()
retriever.k = 20

chat_model = AzureChatOpenAI(deployment_name=gpt_model_name, temperature=0.0, model_kwargs={ 'top_p': 1.0 })
qa = ConversationalRetrievalChain.from_llm(chat_model, retriever=retriever, return_source_documents=True)

question = 'What is the purpose of the project?'
response = qa({'question': question, 'chat_history': []})
print(f"-> **Question**: {question} \n")
print(f"**Answer**: {response['answer']} \n")
sources = response['source_documents']
i = 0
for source in sources:
    if i > 5:
        break
    i += 1
    print(f"**Source**: {source.metadata} \n")