# Code Understanding

[![Open In Collab](TODO INSERT LINK)

## Use case

--- 

We can use LLMs for interpreting and analyzing a code base from a specific repository.

Some use cases include:

- Q&A over the code base to understand how it works
- Using LLMs for suggesting refactors or improvements
- Using LLMs for documenting the code

## Overview

--- 

The pipeline for building a solution for QA over a code base is:

1. `Load the code base:` Load code base into documents.
2. `Split the code base:` Split documents.
3. `Store the code base`:  Code snippets are embedded using a code-aware embedding model and stored in a vector database.
4. `Construct the Retriever:` Conversational RetrieverChain searches the VectorStore to identify the most relevant code snippets for a given query.
5. `Ask questions about the code:` Define a list of questions to ask about the codebase, and then use the `ConversationalRetrievalChain` to generate context-aware answers. The LLM (GPT-4) generates comprehensive, context-aware answers based on retrieved code snippets and conversation history.


## Previous requirements

--- 

For this use case, we will build a **code understanding solution for the actual LangChain repository.**

First, get required packages and set environment variables:

In [29]:
# !python3 -m pip install --upgrade langchain chromadb openai

import os

os.environ['OPENAI_API_KEY'] = "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"

## Step 1: Load the code base

---

We will upload all python project files using the `langchain.document_loaders.TextLoader`.

The following script iterates over the files in the LangChain repository and loads every `.py` file (a.k.a. **documents**):

In [30]:
from langchain.document_loaders import TextLoader

root_dir = "../../../.."

docs = []
for dirpath, dirnames, filenames in os.walk(root_dir):
    for file in filenames:
        if file.endswith(".py") and "/.venv/" not in dirpath:
            try:
                loader = TextLoader(os.path.join(dirpath, file), encoding="utf-8")
                docs.extend(loader.load_and_split())
            except Exception as e:
                pass
print(f"{len(docs)}")

2392


### Go deeper

- You can check **code specific** document loaders [here](docs/integrations/document_loaders/source_code)

- In this case we are analyzing a code base within this same repo, but you can analyze any repo using:

In [31]:
# !git clone https://github.com/twitter/the-algorithm # replace any repository of your choice

## Step 2: Split the code base

---

Split the `Document` into chunks for embedding and vector storage.

In [32]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)

Created a chunk of size 1009, which is longer than the specified 1000
Created a chunk of size 1144, which is longer than the specified 1000
Created a chunk of size 1509, which is longer than the specified 1000
Created a chunk of size 1003, which is longer than the specified 1000
Created a chunk of size 1025, which is longer than the specified 1000
Created a chunk of size 1197, which is longer than the specified 1000
Created a chunk of size 1230, which is longer than the specified 1000
Created a chunk of size 1320, which is longer than the specified 1000
Created a chunk of size 1047, which is longer than the specified 1000
Created a chunk of size 2456, which is longer than the specified 1000
Created a chunk of size 1021, which is longer than the specified 1000
Created a chunk of size 1532, which is longer than the specified 1000
Created a chunk of size 1535, which is longer than the specified 1000
Created a chunk of size 1366, which is longer than the specified 1000
Created a chunk of s

### Go deeper

You can split documents using a [language-specific logic](docs/integrations/document_loaders/source_code#splitting)

## Step 3: Store the Documents

---

We need to store the documents in a way we can semantically search for their content. In order to do this, the most common approach is to embed the contents of each document then store the embedding and document in a vector store, with the embedding being used to index the document.

This can take several minutes:

In [33]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

vectorstore = Chroma.from_documents(documents=texts, embedding=OpenAIEmbeddings())

### Go deeper

- Browse the > 40 vectorstores integrations [here](https://integrations.langchain.com/).
- See further documentation on vectorstores [here](/docs/modules/data_connection/vectorstores/).
- Browse the > 30 text embedding integrations [here](https://integrations.langchain.com/).
- See further documentation on embedding models [here](/docs/modules/data_connection/text_embedding/).

In this case we are storing the vectorstore in memory using `Chroma`, but you can upload them into a Vector Store such as [Deep Lake](docs/integrations/vectorstores/deeplake) and use it for storing and retrieving documents.

## Step 4: Retrieve the data

---

To retrieve the code first we construct a retriever based on chroma:

In [38]:
retriever = vectorstore.as_retriever(
    search_kwargs={
        "k": 6,
        "fetch_k": 20,
        "maximal_marginal_relevance": True,
    }
)

Then we build a `ConversationalRetreivalChain` based on the model and retreiver:

In [56]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain

model = ChatOpenAI(model_name="gpt-3.5-turbo")  # switch to 'gpt-4'
qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever)


### Go deeper

- Browse the > 55 LLM and chat model integrations [here](https://integrations.langchain.com/).
- See further documentation on LLMs and chat models [here](/docs/modules/model_io/models/).
- Use local LLMS: The popularity of [PrivateGPT](https://github.com/imartinez/privateGPT) and [GPT4All](https://github.com/nomic-ai/gpt4all) underscore the importance of running LLMs locally.



## Step 5: Run QA over code base

---

Now that we have the `ConversationalRetrievalChain` we can use the LLM to run QA over the code base:

In [50]:
question = "hat one improvement do you propose in code in relation to the class herarchy for the Chain class?"
result = qa({"question": question, "chat_history": []})
print(result['answer'])

One improvement that could be made to the code in relation to the class hierarchy for the Chain class is to add more specific subclasses that inherit from the Chain class. This can help to organize and categorize the different types of chains that are used in the codebase. For example, instead of having generic subclasses like LLMChain, MapReduceChain, and RouterChain, more specific subclasses could be created such as TextGenerationChain, DataProcessingChain, and RoutingChain. This would make the codebase more modular and easier to understand and maintain.


In [51]:
questions = [
    "What is the class hierarchy?",
    # "What classes are derived from the Chain class?",
    # "What classes and functions in the ./langchain/utilities/ forlder are not covered by unit tests?",
    "What one improvement do you propose in code in relation to the class herarchy for the Chain class?",
]
chat_history = []

for question in questions:
    result = qa({"question": question, "chat_history": chat_history})
    chat_history.append((question, result["answer"]))
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer']} \n")

-> **Question**: What is the class hierarchy? 

**Answer**: The class hierarchy in the provided context is as follows:

- BaseModel
  - ConstitutionalPrinciple
  - Dog
  
- AgentAction
- AgentFinish
- BaseChatMessageHistory
- BaseMemory
- BaseMessage
  - AIMessage
  - ChatMessage
  - FunctionMessage
  - HumanMessage
  - SystemMessage
- BaseDocumentTransformer
- Document
- BaseLLMOutputParser
- BaseOutputParser
- BasePromptTemplate
- BaseRetriever
- ChatGeneration
- ChatResult
- Generation
- LLMResult
- RunInfo
- PromptValue
- BasePromptTemplate
- format_document
- Callbacks
- BaseTracer
  - Run
- FakeListChatModel
- FakeListLLM
- CommaSeparatedListOutputParser
- ChatPromptTemplate
- HumanMessagePromptTemplate
- SystemMessagePromptTemplate
- StuffDocumentsChain
- LLMChain
- ConversationalRetrievalChain
- PromptTemplate
- OpenAI
- RouterRunnable
- Runnable
- RunnableConfig
- RunnableLambda
- RunnableMap
- RunnablePassthrough
- RunnableSequence

Note: This list may not include all the cla