# Codebase Understanding using LLM
Explore how LangChain, Deep Lake, and GPT-4 can transform our understanding of complex codebases, such as Twitter's open-sourced recommendation algorithm.

---

## Introduction
In this lesson we will explore how LangChain, Deep Lake, and GPT-4 can transform our understanding of complex codebases, such as Twitter's open-sourced recommendation algorithm. This approach enables us to ask any question directly to the source code, significantly speeding up the code comprehension.

In this lesson, you'll learn how to index a codebase, store embeddings and code in Deep Lake, set up a Conversational Retriever Chain, and ask insightful questions to the codebase.

## Workflow

This guide involves understanding source code using LangChain in three steps:
1. Index a codebase by cloning the repository, parsing the code, dividing it into chunks, and using OpenAI/HuggingFace to perform indexing.
2. Establish a Conversational Retriever Chain by loading the dataset, setting up the retriever, and connecting to a language model like GPT-4 for question answering.
3. Query the codebase in natural language and retrieve answers. The guide ends with a demonstration of how to ask and retrieve answers to several questions about the indexed codebase.

## Setup

In [1]:
import openai
import os
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())
openai.api_type = os.environ.get("OPENAI_API_TYPE")
openai.api_base = os.environ.get("OPENAI_API_BASE")
openai.api_version = os.environ.get("OPENAI_API_VERSION")
openai.api_key = os.environ.get("OPENAI_API_KEY")

## Building the system

### 1. Indexing the Twitter Algorithm Code Base (Optional)

You can skip this part and jump right into using an already indexed dataset (just like the one in this example). To index the code base, first clone the repository, parse the code, break it into chunks, and apply OpenAI/HuggingFace indexing:

In [None]:
!git clone https://github.com/twitter/the-algorithm ../../temp/the-algorithm

Next, load all files inside the repository.

In [3]:
import os
from langchain.document_loaders import TextLoader

root_dir = "../../temp/the-algorithm"
docs = []
for dirpath, dirnames, filenames in os.walk(root_dir):
    for file in filenames:
        try:
            loader = TextLoader(os.path.join(dirpath, file), encoding="utf-8")
            docs.extend(loader.load_and_split())
        except Exception as e:
            pass

Subsequently, divide the loaded files into chunks:

In [None]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)

Perform the indexing process. It can take large amount of time to calculate embeddings and upload them to Activeloop. Afterward, you can publish the dataset publicly:

In [None]:
from langchain.vectorstores import DeepLake
from langchain.embeddings import HuggingFaceEmbeddings

my_activeloop_org_id = os.environ.get("ACTIVELOOP_ORG_ID")
my_activeloop_dataset_name = "twitter-algorithm"

dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"
db = DeepLake(dataset_path=dataset_path, embedding_function=HuggingFaceEmbeddings())
db.add_documents(texts)

If the dataset has been already created, you can load it later without recomputing embeddings as seen below.

### 2. Conversational Retriever Chain

First, load the dataset, establish the retriever, and create the Conversational Chain:

In [6]:
from langchain.vectorstores import DeepLake
from langchain.embeddings import HuggingFaceEmbeddings

my_activeloop_org_id = os.environ.get("ACTIVELOOP_ORG_ID")
# my_activeloop_org_id = "davitbun" # contains the already embedded dataset
my_activeloop_dataset_name = "twitter-algorithm"

dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"

db = DeepLake(
    dataset_path=dataset_path,
    read_only=True,
    embedding_function=HuggingFaceEmbeddings(),
)

Deep Lake Dataset in hub://iamrk04/twitter-algorithm already exists, loading from the storage


Next, setup the retriever:

In [7]:
retriever = db.as_retriever()
retriever.search_kwargs["distance_metric"] = "cos"
retriever.search_kwargs["fetch_k"] = 100
retriever.search_kwargs["maximal_marginal_relevance"] = True
retriever.search_kwargs["k"] = 10

You can also define custom filtering functions using `Deep Lake filters`:

In [8]:
def filter(x):
    if "com.google" in x["text"].data()["value"]:
        return False
    metadata = x["metadata"].data()["value"]
    return "scala" in metadata["source"] or "py" in metadata["source"]


# Uncomment the following line to apply custom filtering
# retriever.search_kwargs['filter'] = filter

Connect to GPT-4 for question answering.

In [9]:
from langchain.chat_models import AzureChatOpenAI
from langchain.chains import ConversationalRetrievalChain

model = AzureChatOpenAI(deployment_name="gpt4", temperature=0.1)
qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever)

### 3. Ask Questions to the Codebase in Natural Language

Define all the juicy questions you want to be answered:

In [12]:
questions = [
    "What does favCountParams do?",
    "is it Likes + Bookmarks, or not clear from the code?",
    "What are the major negative modifiers that lower your linear ranking parameters?",
    # "How do you get assigned to SimClusters?",
    # "What is needed to migrate from one SimClusters to another SimClusters?",
    # "How much do I get boosted within my cluster?",
    "How does Heavy ranker work. what are it’s main inputs?",
    # "How can one influence Heavy ranker?",
    # "why threads and long tweets do so well on the platform?",
    # "Are thread and long tweet creators building a following that reacts to only threads?",
    "Do you need to follow different strategies to get most followers vs to get most likes and bookmarks per tweet?",
    # "Content meta data and how it impacts virality (e.g. ALT in images).",
    "What are some unexpected fingerprints for spam factors?",
    # "Is there any difference between company verified checkmarks and blue verified individual checkmarks?",
]
chat_history = []

for question in questions:
    result = qa({"question": question, "chat_history": chat_history})
    chat_history.append((question, result["answer"]))
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer']} \n")

-> **Question**: What does favCountParams do? 

**Answer**: I couldn't find any information about "favCountParams" in the provided context. Please provide more information or context about "favCountParams" to help me understand its purpose and functionality. 

-> **Question**: is it Likes + Bookmarks, or not clear from the code? 

**Answer**: It is not clear from the code provided whether favCountParams is the sum of Likes and Bookmarks. 

-> **Question**: What are the major negative modifiers that lower your linear ranking parameters? 

**Answer**: Based on the provided context, there are a few negative modifiers that can lower linear ranking parameters:

1. Fatigue Factor: In the `ImpressionBasedFatigueRanker` class, the fatigue factor is used to penalize candidates based on their recent impressions. The higher the fatigue factor, the more the candidate's rank is penalized due to recent impressions.

2. AuthorSensitiveScoreWeightInReranking: This parameter is used to adjust the weigh