
## Background:
You are provided with a boilerplate Python notebook that interacts with a pull request (PR) reviewer system using a Large Language Model (LLM). This system is designed to review code by identifying potential issues in the changes made in a PR. The initial setup involves parsing code from the repository, storing code snippets in a vector database (using ChromaDB), and querying this database to provide contextually relevant code examples for LLM-based code review. The reviews are done file-by-file with prompts structured to guide LLM in generating constructive feedback in a specific JSON format.

## Task Overview:
Your challenge is to conceptualize and implement enhancements in three key areas of the existing PR reviewer system:
1. **Splitting files into chunks**
2. **Retriever query generation system**
3. **LLM review prompt structure**


The challenge focuses on three critical areas requiring innovation and enhancement:

1. **Splitting Files into Chunks**: You have to develop a sophisticated method for dividing repository code into meaningfully sized chunks or propose an entirely new approach for managing code snippets. The goal is to enhance logical or functional coherence within the chunks without losing efficiency.

2. **Retriever Query Generator System**: You have to refine the system’s approach to generating queries for retrieving relevant documents from a vector database. You should aim to improve the precision and relevance of document retrieval and might want to explore advanced query generation strategies and incorporating richer contextual information.

3. **LLM Review Prompt Revamping**: Currently, the system reviews code on a per-file basis. You are encouraged to explore alternative methodologies, such as reviewing entire PRs in a holistic manner or employing a sophisticated LLM agent system for more effective and context-aware reviews.

**You may choose to work on just one or any number of the above points.**

## Environment variables

In [None]:
import os
from dotenv import load_dotenv

load_dotenv(".env")


assert os.getenv("GITHUB_TOKEN") is not None and os.getenv("GITHUB_TOKEN") != "a_key_goes_here"
assert os.getenv("OPENAI_API_KEY") is not None and os.getenv("OPENAI_API_KEY") != "a_key_goes_here"

## Introduction to Pre-made Tools

In order to streamline the process of interacting with and analyzing repository files for our PR reviewer system, we've developed a suite of pre-made tools. These utilities are designed to handle various aspects of file manipulation, diff extraction, and repository management seamlessly. Below is an overview of each tool and its primary purpose.

### LocalFile

The `LocalFile` class acts as a refined interface for file interaction, offering a simplified way to access a file's contents along with pertinent metadata. It's an enhanced representation that abstracts away the complexities of raw file handling, making file operations more intuitive and less error-prone.

### DiffRepresentation

Dealing with diffs can be intricate, given their crucial role in identifying changes between file versions. The `DiffRepresentation` class simplifies this task by providing methods to extract and analyze diff metadata effectively. It specifically aids in parsing edit locations within a file, such as the line numbers of changed sections and the position of the [diff hunk header](https://stackoverflow.com/q/28111035), which marks the start of a set of differences in file content. This specialized tool ensures that diffs are handled accurately, facilitating a deeper analysis of code changes.

### GithubUtils

The `GithubUtils` class encapsulates functionality for fetching information and files from a specific GitHub repository and its associated PRs. It leverages the PyGithub library to abstract away the intricacies of API communication, offering a more user-friendly interface for retrieving the data needed for review analysis.

### LocalRepository

To further mitigate the limitations posed by GitHub's rate limiting and to enhance the efficiency of repository analysis, the `LocalRepository` class provides a mechanism to mirror a GitHub repository locally.


In [None]:
import os
import github
from korbit_tools.github_service import GithubUtils

rest_github = github.Github(os.getenv("GITHUB_TOKEN"))

REPOSITORY = "langchain-ai/langchain"
PR_NUMBER = 13999

ALLOWED_EXTENSIONS = [".py"]

repo = rest_github.get_repo(REPOSITORY)
pr = repo.get_pull(PR_NUMBER)

# Here is how we get content for a Pull Request
for content_file, pr_diff in GithubUtils.get_pull_request_content_file_iter(repo, pr, allowed_extensions=ALLOWED_EXTENSIONS):
    print(pr_diff.diff[:100])
    print(content_file.contents[:100])
    break

## Vector store
### Download repository locally

In [None]:
from korbit_tools.github_service import download_repository
from korbit_tools.repository_search import LocalRepository


repo_path = download_repository(repo, pr.base.sha, "./repositories")
local_repository = LocalRepository(repo_path)

### Load local repository

Here we convert the repository's files, into langchain Documents model, to then store them into ChromDB.

[https://python.langchain.com/docs/integrations/document_loaders/source_code#splitting](https://python.langchain.com/docs/integrations/document_loaders/source_code#splitting)

In [None]:
repo_path

In [None]:
from langchain_community.document_loaders import PythonLoader
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader(repo_path,glob="**/*.py", loader_cls=PythonLoader)

documents = loader.load()
len(documents)

### Split files into chunks

We need to chunk the files into multiple ones otherwise the context will have too much context and the interesting information will be lost in the context.


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

py_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=2000, chunk_overlap=100
)
texts = py_splitter.split_documents(documents)
len(texts)

The following cell will take several minutes to run.

In [None]:
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# The vector store will be persisted in the current directory as .chromadb folder
db = Chroma.from_documents(texts, OpenAIEmbeddings(disallowed_special=()), persist_directory=".chromadb")

In [None]:
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# This cell allow you to retrieve the vector store from the previously created one
db = Chroma(persist_directory=".chromadb", embedding_function=OpenAIEmbeddings(disallowed_special=()))

### Retriever
Here is the configuration of the vectorDB retriever. You can obviously change that to match what you think will work best. For example using a subset of the diff, or the all content of the file you are reviewing.
You can use this function repository_search to query the all vectorial database, containing embeddings of code snippets from the repository the PR request is made on.

In [None]:
retriever = db.as_retriever(
    search_type="mmr", # https://api.python.langchain.com/en/latest/retrievers/langchain_community.retrievers.zep.SearchType.html
    search_kwargs={"k": 4},
)

INDENTATION = "  "
def repository_search(query: str) -> str:
    relevant_docs = retriever.get_relevant_documents(query)
    output = ""
    for doc in relevant_docs:
        output += f"- {doc.metadata["source"]}\n{INDENTATION}```{doc.metadata.get("language", "py")}\n{textwrap.indent(doc.page_content, INDENTATION)}\n{INDENTATION}```\n\n"
    return output

## Prompt
We are providing a boilerplate prompt for reviewing one file at a time, but feel free to customize it to your liking. Modify the steps, inputs, outputs, and overall prompt as you see fit to make it your own.

By default, the expected output for this prompt is in JSON format as shown below. However, you are welcome to alter this structure if you believe there is a more effective way to provide feedback:

```json
[
    {
        "description": "A brief description of the issue.",
        "code_snippet": "The relevant code snippet causing the issue.",
        "category": "The category of the issue.",
        "severity": "Severity level of the detected issue on a scale from 1 to 10."
    }
]
```

In [None]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

ISSUE_DETECTOR_SYSTEM = """\
You are a senior software engineer tasked with mentoring a team of developers to improve their pull request. I am one of the developers, and I need your review on my pull request changes.

The review comment must be in JSON format and contain a list of the following objects:
1. description: Constructive critical feedback that is relevant to the pull request.
2. code_snippet: The code snippet the feedback refers to.
3. category: The type of issue detected: bug, good practices, or other.
4. severity: The urgency of the detected issue. Scale from 1 to 10.

Based on the above context, you must output a JSON list of issues (review comments) that you have identified. For each issue, use the following attributes:
```json
[{{
  "description": "An issue description you found in the diff",
  "code_snippet": "def my_function(x):\n    return x**2\n\nmy_function(123)",
  "category": "naming",
  "severity": 1
}}]
```

You can explain also why you created those issues.

If you didn't find any issues, output an empty list, but you must output something.
"""


ISSUE_DETECTOR_AGENT_HUMAN = """
- title: {pr_title}
- description: {pr_description}
- repository folder tree:

```json
{folder_tree}
```

- file path: {file_path}
- file content:

```
{content_file}
```

- diff representation:

```diff
{pr_diff}
```
"""


prompt_reviewer = ChatPromptTemplate.from_messages(
    [
        ("system", ISSUE_DETECTOR_SYSTEM),
        ("human", ISSUE_DETECTOR_AGENT_HUMAN),
        MessagesPlaceholder(variable_name="assistant_context"),
    ]
)

In [None]:
from langchain_core.messages import AIMessage


ISSUE_DETECTOR_CONTEXT_ASSISTANT = """\
In order to find issues in the pull request diff, I need to find the relevant code snippets in the repository. Here is the vector database query I made to review the pull request diff:
```
{query}
```

Result of the query:
```
{context}
```
"""

def compute_assistant_context_message(content_file, pr_diff) -> list[AIMessage]:
    """
    This is a very basic version of the RAG system. 
    We expect you to improve this by creating a good query system,
    that return the most relevant context from the vector store.
    """
    query = pr_diff.diff
    output = repository_search(query)
    return [AIMessage(content=ISSUE_DETECTOR_CONTEXT_ASSISTANT.format(query=query, context=output))]

## PR review

After setting up our system to analyze pull requests using the Large Language Model (LLM), we now enter the critical phase of putting our setup to work. In essence, our goal is to systematically review each file associated with a given pull request. This is achieved by leveraging our vector database to unearth relevant code snippets and differences, which then become the foundation for our LLM-based code review.


In [None]:
from langchain.chains import LLMChain

llm = ChatOpenAI(model="gpt-4-1106-preview")
chain = LLMChain(llm=llm, prompt=prompt_reviewer)

In [None]:
from korbit_tools.string_search import extract_json_from_text

In [None]:
from korbit_tools.llm_utils import count_token_string
from korbit_tools.string_search import extract_json_from_text

# NOTE: We won't review files that are above 60k tokens long. The model won't be able process it along other files retrieved by the similarity search._
CONTENT_FILE_TOKEN_LIMIT = 60000

pr_issues = {}
# Get all files that have been changed in the Pull Request
for content_file, pr_diff in GithubUtils.get_pull_request_content_file_iter(
    repo, pr, allowed_extensions=ALLOWED_EXTENSIONS
):

    token_count = count_token_string(content_file.contents)
    if token_count >= CONTENT_FILE_TOKEN_LIMIT:
        continue

    # Setup the LLM inputs to match the variables in the prompt above
    llm_inputs["content_file"] = content_file.contents
    llm_inputs["file_path"] = content_file.path
    llm_inputs["pr_diff"] = pr_diff.diff

    # assistant_context variable is linked to message in the prompt template
    llm_inputs["assistant_context"] = compute_assistant_context_message(
        content_file, pr_diff
    )

    output = chain.invoke(llm_inputs)
    
    pr_issues[content_file.path] = output

    print(content_file.path)
    print("\n\n")
    print(output.get("text"))
    print("\n\n")
    break

for i in pr_issues:
    print(extract_json_from_text(i.get("text")))

In [None]:
local_repository.count_languages_extensions()

### Cleanup downloaded repositories

In [None]:

!rm -rf ./repositories/* --exclude='.gitkeep'