<a href="https://colab.research.google.com/github/parky-sood/codebase-genie/blob/main/RAG_Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG Process Overview

![Screenshot 2024-11-25 at 7 12 58 PM](https://github.com/user-attachments/assets/0bd67cf0-43d5-46d2-879c-a752cae4c8e3)

# Library Dependencies

In [None]:
! pip install pygithub langchain langchain-community openai tiktoken pinecone-client langchain_pinecone sentence-transformers

Collecting pygithub
  Downloading PyGithub-2.5.0-py3-none-any.whl.metadata (3.9 kB)
Collecting langchain-community
  Downloading langchain_community-0.3.8-py3-none-any.whl.metadata (2.9 kB)
Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting pinecone-client
  Downloading pinecone_client-5.0.1-py3-none-any.whl.metadata (19 kB)
Collecting langchain_pinecone
  Downloading langchain_pinecone-0.2.0-py3-none-any.whl.metadata (1.7 kB)
Collecting pynacl>=1.4.0 (from pygithub)
  Downloading PyNaCl-1.5.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl.metadata (8.6 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain)
  Downloading SQLAlchemy-2.0.35-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5

In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from langchain_pinecone import PineconeVectorStore
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
from google.colab import userdata
from pinecone import Pinecone
import os
import tempfile
from github import Github, Repository
from git import Repo
from openai import OpenAI
from pathlib import Path
from langchain.schema import Document
from pinecone import Pinecone
import ast


  from tqdm.autonotebook import tqdm, trange


# Clone Repo Locally

In [None]:
def clone_repo(repo_url):
  repo_name = repo_url.split("/")[-1]
  repo_path = f"/content/{repo_name}"
  Repo.clone_from(repo_url, repo_path)
  return str(repo_path)

In [None]:
repo_url = "https://github.com/CoderAgent/SecureAgent"
path = clone_repo(repo_url)

In [None]:
SUPPORTED_EXTENSIONS = {'.py', '.js', '.tsx', '.jsx', '.ipynb', '.java',
                         '.cpp', '.ts', '.go', '.rs', '.vue', '.swift', '.c', '.h', '.md'}


IGNORED_DIRS = {'node_modules', 'venv', 'env', 'dist', 'build', '.git',
                '__pycache__', '.next', '.vscode', 'vendor'}

# Get file content using relative path from repo root

In [None]:
def get_function_py(content):
  tree = ast.parse(content)

  functions = []

  for node in ast.walk(tree):
    if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
      header = f"def {node.name}("
      header += ", ".join(arg.arg for arg in node.args.args)
      header += "):"

      body = ast.get_source_segment(content, node)

      functions.append({"header": header, "body": body})

  return functions


In [None]:
def get_file_content(file_path, repo_path, file_extension):
  try:
    contents = []
    with open(file_path, "r", encoding="utf-8", errors="replace") as f:
      content = f.read()

      rel_path = os.path.relpath(file_path, repo_path)

      if file_extension == ".py":
        functions = get_function_py(content)
      else:
        return {
            "name": rel_path,
            "function": None,
            "content": content
        }

      for item in functions:
        func_header, func_body = item["header"], item["body"]
        contents.append({"name": rel_path, "function": func_header, "content": func_body})

      return contents

  except Exception as e:
    print(f"Error reading file {file_path}: {e}")
    return None

def get_main_files_content(repo_path: str):
  """
  Get content of supported code files from local repository.

  Args:
    repo_path: Path to local repo

  Returns:
    List of dictionaries containing file names and contents
  """

  files_content = []

  try:
    for root, _, files in os.walk(repo_path):
      # Skip if current directory is in ignored directories

      if any(ignored_dir in root for ignored_dir in IGNORED_DIRS):
        continue

      # Process each file in current directory
      for file in files:
        file_path = os.path.join(root, file)

        file_extension = os.path.splitext(file)[1]

        if file_extension in SUPPORTED_EXTENSIONS:
          file_content = get_file_content(file_path, repo_path, file_extension)

          if file_content:
            if isinstance(file_content, dict):
              files_content.append(file_content)

            else:
              for content in file_content:
                files_content.append(content)

  except Exception as e:
    print(f"Error reading repository: {str(e)}")

  return files_content


In [None]:
file_content = get_main_files_content(path)

# Embeddings

In [None]:
def get_huggingface_embeddings(text, model_name="sentence-transformers/all-mpnet-base-v2"):
    model = SentenceTransformer(model_name)
    return model.encode(text)

In [None]:
text = "I am a software developer"

embeddings = get_huggingface_embeddings(text)

# Using Pinecone
**1. Create an account on [Pinecone.io](https://app.pinecone.io/)**

**2. Create a new index called "codebase-rag" and set the dimensions to 768. Leave the rest of the settings as they are.**

![Screenshot 2024-11-24 at 10 58 50 PM](https://github.com/user-attachments/assets/f5fda046-4087-432a-a8c2-86e061005238)



**3. Create an API Key for Pinecone**

![Screenshot 2024-11-24 at 10 44 37 PM](https://github.com/user-attachments/assets/e7feacc6-2bd1-472a-82e5-659f65624a88)


**4. Store your Pinecone API Key within Google Colab's secrets section, and then enable access to it (see the blue checkmark)**

![Screenshot 2024-11-24 at 10 45 25 PM](https://github.com/user-attachments/assets/eaf73083-0b5f-4d17-9e0c-eab84f91b0bc)



In [None]:
# Set the PINECONE_API_KEY as an environment variable
pinecone_api_key = userdata.get("PINECONE_API_KEY")
os.environ['PINECONE_API_KEY'] = pinecone_api_key

# Initialize Pinecone
pc = Pinecone(api_key=userdata.get("PINECONE_API_KEY"),)

# Connect to your Pinecone index
pinecone_index = pc.Index("codebase-rag")

In [None]:
vectorstore = PineconeVectorStore(index_name="codebase-rag", embedding=HuggingFaceEmbeddings())

  vectorstore = PineconeVectorStore(index_name="codebase-rag", embedding=HuggingFaceEmbeddings())
  vectorstore = PineconeVectorStore(index_name="codebase-rag", embedding=HuggingFaceEmbeddings())


In [None]:
documents = []

for file in file_content:
    if file['function']:
      doc = Document(
          page_content=f"{file['name']}\n{file['content']}",
          metadata={"source": file['name'], "function": file['function']}
      )
    else:
      doc = Document(
        page_content=f"{file['name']}\n{file['content']}",
        metadata={"source": file['name']}
    )

    documents.append(doc)


vectorstore = PineconeVectorStore.from_documents(
    documents=documents,
    embedding=HuggingFaceEmbeddings(),
    index_name="codebase-rag",
    namespace=repo_url
)

  embedding=HuggingFaceEmbeddings(),


# Perform RAG

1. Get your Groq API Key [here](https://console.groq.com/keys)

2. Paste your Groq API Key into your Google Colab secrets, and make sure to enable permissions for it

![Screenshot 2024-11-25 at 12 00 16 AM](https://github.com/user-attachments/assets/e5525d29-bca6-4dbd-892b-cc770a6b281d)


In [None]:
client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key=userdata.get("GROQ_API_KEY")
)

In [None]:
query = "How are python files parsed?"

In [None]:
raw_query_embedding = get_huggingface_embeddings(query)

In [None]:
# Feel free to change the "top_k" parameter to be a higher or lower number
top_matches = pinecone_index.query(vector=raw_query_embedding.tolist(), top_k=5, include_metadata=True, namespace=repo_url)

In [None]:
contexts = [item['metadata']['text'] for item in top_matches['matches']]

In [None]:
augmented_query = "<CONTEXT>\n" + "\n\n-------\n\n".join(contexts[ : 10]) + "\n-------\n</CONTEXT>\n\n\n\nMY QUESTION:\n" + query

In [None]:
system_prompt = f"""You are a Senior Software Engineer, specializing in TypeScript, Python, Java, C++, Go, Rust, C, and Swift.

Answer any questions I have about the codebase, based on the code provided. Always consider all of the context provided when forming a response.
"""

llm_response = client.chat.completions.create(
    model="llama-3.1-70b-versatile",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": augmented_query}
    ]
)

response = llm_response.choices[0].message.content

# Putting it all together

In [None]:
def perform_rag(query):
    raw_query_embedding = get_huggingface_embeddings(query)

    top_matches = pinecone_index.query(vector=raw_query_embedding.tolist(), top_k=5, include_metadata=True, namespace=repo_url)

    # Get the list of retrieved texts
    contexts = [item['metadata']['text'] for item in top_matches['matches']]

    augmented_query = "<CONTEXT>\n" + "\n\n-------\n\n".join(contexts[ : 10]) + "\n-------\n</CONTEXT>\n\n\n\nMY QUESTION:\n" + query

    # Modify the prompt below as need to improve the response quality
    system_prompt = f"""You are a Senior Software Engineer, specializing in TypeScript, Python, Java, C++, Go, Rust, C, and Swift.

    Answer any questions I have about the codebase, based on the code provided. Always consider all of the context provided when forming a response.
    """

    llm_response = client.chat.completions.create(
        model="llama-3.1-8b-instant",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": augmented_query}
        ]
    )

    return llm_response.choices[0].message.content

In [None]:
response = perform_rag("How is the javascript parser used?")

print(response)

The JavascriptParser is used in two main places in the provided codebase:

1. **diffContextPerHunk**: In this function, which is part of `src/context/review.ts`, `JavascriptParser` is used to determine the `enclosingContext` for each hunk in the diff. The `findEnclosingContext` method of `JavascriptParser` is called to identify the enclosing function for a given range of lines. This enclosing function is then used to construct the context string for the hunk.
2. **getParserForExtension**: In this function, which is part of `src/constants.ts`, `JavascriptParser` is one of the parsers registered for specific file extensions (javascript, typescript, jsx, and tsx). This function returns the parser instance based on the file extension of a given file. If the file has a .js or .ts extension, `JavascriptParser` will be used.

Additionally, the `JavascriptParser` is used by the `functionContextPatchStrategy` function, which in turn uses `diffContextPerHunk` to generate a patch strategy for rev