![Img](https://app.theheadstarter.com/static/hs-logo-opengraph.png)

# Headstarter Codebase RAG Project

![Screenshot 2024-11-25 at 7 12 58 PM](https://github.com/user-attachments/assets/48dd9de1-b4d2-4318-8f52-85ec209d8ebc)

# Install Necessary Libraries

In [None]:
! pip install pygithub langchain langchain-community openai tiktoken pinecone-client langchain_pinecone sentence-transformers

In [3]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from langchain_pinecone import PineconeVectorStore
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
from google.colab import userdata
from pinecone import Pinecone
import os
import tempfile
from github import Github, Repository
from git import Repo
from openai import OpenAI
from pathlib import Path
from langchain.schema import Document
from pinecone import Pinecone

# Clone a GitHub Repo locally

In [5]:
github_repo = "https://github.com/CoderAgent/SecureAgent"
github_repo.split("/")[-1]

'SecureAgent'

In [6]:
def clone_repository(repo_url):
    """Clones a GitHub repository to a temporary directory.

    Args:
        repo_url: The URL of the GitHub repository.

    Returns:
        The path to the cloned repository.
    """

    repo_name = github_repo.split("/")[-1]
    repo_path = f"/content/{repo_name}"
    Repo.clone_from(repo_url, str(repo_path))
    return repo_path

In [7]:
path = clone_repository(github_repo)

# Define which types of files to parse and which files / folders to ignore

In [11]:
SUPPORTED_EXTENSIONS = {'.py', '.js', '.tsx', '.jsx', '.ipynb', '.java',
                         '.cpp', '.ts', '.go', '.rs', '.vue', '.swift', '.c', '.h'}

IGNORED_DIRS = {'node_modules', 'venv', 'env', 'dist', 'build', '.git',
                '__pycache__', '.next', '.vscode', 'vendor'}

In [12]:
def get_file_content(file_path, repo_path):
    """
    Get content of a single file.

    Args:
        file_path (str): Path to the file

    Returns:
        Optional[Dict[str, str]]: Dictionary with file name and content
    """
    try:
        with open(file_path, "r", encoding="utf-8") as f:
            content = f.read()

        rel_path = os.path.relpath(file_path, repo_path)

        return {
            "name": rel_path,
            "content": content
        }

    except Exception as e:
        print(f"Error processing file {file_path}: {str(e)}")
        return None



def get_main_files_content(repo_path: str):
    """
    Get content of supported code files from the local repository.

    Args:
        repo_path: Path to the local repository

    Returns:
        List of dictionaries containing file names and contents
    """

    files_content = []

    try:

        for root, _, files in os.walk(repo_path):
            if any(ignored_dir in root for ignored_dir in IGNORED_DIRS):
                continue

            for file in files:
                file_path = os.path.join(root, file)
                if os.path.splitext(file)[1] in SUPPORTED_EXTENSIONS:
                    file_content = get_file_content(file_path, repo_path)

                    if file_content:
                        files_content.append(file_content)

    except Exception as e:
        print(e)

    return files_content


In [13]:
files_content = get_main_files_content(path)

In [14]:
files_content

[{'name': 'src/review-agent.ts',
  'content': 'import { Octokit } from "@octokit/rest";\nimport { WebhookEventMap } from "@octokit/webhooks-definitions/schema";\nimport { ChatCompletionMessageParam } from "groq-sdk/resources/chat/completions";\nimport * as xml2js from "xml2js";\nimport type {\n  BranchDetails,\n  BuilderResponse,\n  Builders,\n  CodeSuggestion,\n  PRFile,\n  PRSuggestion,\n} from "./constants";\nimport { PRSuggestionImpl } from "./data/PRSuggestionImpl";\nimport { generateChatCompletion } from "./llms/chat";\nimport {\n  PR_SUGGESTION_TEMPLATE,\n  buildPatchPrompt,\n  constructPrompt,\n  getReviewPrompt,\n  getTokenLength,\n  getXMLReviewPrompt,\n  isConversationWithinLimit,\n} from "./prompts";\nimport {\n  INLINE_FIX_FUNCTION,\n  getInlineFixPrompt,\n} from "./prompts/inline-prompt";\nimport { getGitFile } from "./reviews";\n\nexport const reviewDiff = async (messages: ChatCompletionMessageParam[]) => {\n  const message = await generateChatCompletion({\n    messages,

# Embeddings

In [16]:
def get_huggingface_embeddings(text, model_name="sentence-transformers/all-mpnet-base-v2"):
    model = SentenceTransformer(model_name)
    return model.encode(text)

In [17]:
text = "I like coding"

embeddings = get_huggingface_embeddings(text)

Access to the secret `HF_TOKEN` has not been granted on this notebook.
You will not be requested again.
Please restart the session if you want to be prompted again.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [18]:
embeddings

array([-8.99216253e-03,  7.10570663e-02, -4.78132442e-02,  2.93952459e-03,
        1.87183674e-02,  6.25578165e-02, -7.40241110e-02, -1.79869905e-02,
       -2.33031120e-02,  2.52969563e-02,  1.87584143e-02,  3.26438509e-02,
        2.42187586e-02,  1.06377333e-01,  1.96239222e-02,  4.58661374e-03,
       -2.44689696e-02, -2.76926588e-02, -3.92993074e-03, -1.75862256e-02,
        7.31046824e-03,  3.00471876e-02, -3.48030925e-02, -1.51053648e-02,
        4.95747151e-03, -1.16299754e-02,  2.20015217e-02, -6.58035874e-02,
       -2.77131405e-02,  1.00825489e-01,  3.28424969e-04, -9.48083028e-03,
       -2.19271556e-02, -4.70386026e-03,  1.50509834e-06,  1.31359445e-02,
       -1.65855084e-02,  1.93589311e-02, -2.62506492e-02,  1.61775868e-04,
       -1.85249578e-02, -1.79329403e-02, -2.03854330e-02,  2.96985302e-02,
       -5.77551797e-02,  1.27474237e-02,  8.77798349e-02, -6.04864694e-02,
        9.92063433e-03, -2.96341110e-04,  1.08572084e-03, -1.75143741e-02,
       -3.35006043e-02, -

# Setting up Pinecone
**1. Create an account on [Pinecone.io](https://app.pinecone.io/)**

**2. Create a new index called "codebase-rag" and set the dimensions to 768. Leave the rest of the settings as they are.**

![Screenshot 2024-11-24 at 10 58 50 PM](https://github.com/user-attachments/assets/f5fda046-4087-432a-a8c2-86e061005238)



**3. Create an API Key for Pinecone**

![Screenshot 2024-11-24 at 10 44 37 PM](https://github.com/user-attachments/assets/e7feacc6-2bd1-472a-82e5-659f65624a88)


**4. Store your Pinecone API Key within Google Colab's secrets section, and then enable access to it (see the blue checkmark)**

![Screenshot 2024-11-24 at 10 45 25 PM](https://github.com/user-attachments/assets/eaf73083-0b5f-4d17-9e0c-eab84f91b0bc)



In [19]:
# Set the PINECONE_API_KEY as an environment variable
pinecone_api_key = userdata.get("PINECONE_API_KEY")
os.environ['PINECONE_API_KEY'] = pinecone_api_key

# Initialize Pinecone
pc = Pinecone(api_key=userdata.get("PINECONE_API_KEY"),)

# Connect to your Pinecone index
pinecone_index = pc.Index("codebase-rag")

In [20]:

vectorstore = PineconeVectorStore(index_name="codebase-rag", embedding=HuggingFaceEmbeddings())

  vectorstore = PineconeVectorStore(index_name="codebase-rag", embedding=HuggingFaceEmbeddings())
  vectorstore = PineconeVectorStore(index_name="codebase-rag", embedding=HuggingFaceEmbeddings())


In [25]:
# Insert the codebase embeddings into Pinecone

documents = []

for file in files_content:
    doc = Document(
        page_content=f"{file['name']}\n\n{file['content']}",
        metadata={"source": file['name']}
    )

    documents.append(doc)


vectorstore = PineconeVectorStore.from_documents(
    documents=documents,
    embedding=HuggingFaceEmbeddings(),
    index_name="codebase-rag",
    namespace="https://github.com/CoderAgent/SecureAgent"
)

  embedding=HuggingFaceEmbeddings(),


In [None]:
documents

# Perform RAG

1. Get your OpenRouter API Key [here](https://openrouter.ai/settings/keys)

2. Paste your OpenRouter Key into your Google Colab secrets, and make sure to enable permissions for it

![Image](https://github.com/user-attachments/assets/bd64c5aa-952e-4a1e-9ac0-01d8fe93aaa1)


In [27]:
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=userdata.get("OPENROUTER_API_KEY")
)

In [41]:
query = "How is the javascript parser used?"

In [42]:
raw_query_embedding = get_huggingface_embeddings(query)

In [43]:
raw_query_embedding

array([ 5.71991056e-02, -3.48081402e-02, -3.27215381e-02,  5.29010706e-02,
       -3.88054885e-02,  2.21032687e-02,  1.60939023e-02, -1.00735845e-02,
        3.04608978e-02, -6.25874698e-02,  2.71743126e-02,  3.67565900e-02,
        5.69119379e-02,  5.45353852e-02,  4.02487554e-02, -4.73744869e-02,
        3.59909353e-03,  6.65522413e-03,  1.47536881e-02,  3.57538909e-02,
        1.81228817e-02,  1.24748312e-02, -2.07926147e-02,  6.99328035e-02,
       -1.78905670e-02, -1.98268220e-02, -8.77421536e-03, -4.04773466e-03,
       -4.82026860e-02, -1.55368429e-02, -6.26485273e-02, -6.66376017e-03,
        1.43067809e-02, -4.92968373e-02,  1.30648743e-06, -2.02891300e-03,
       -4.47639450e-02,  2.07317900e-02, -2.80544767e-03,  1.37847150e-02,
        4.11503110e-03,  6.87663862e-03, -2.91272271e-02, -6.68385159e-03,
        2.94112675e-02, -4.13797610e-02,  3.90248373e-02, -5.73173314e-02,
        3.29415165e-02,  1.95520534e-03, -7.05702056e-04, -2.74958815e-02,
        8.47590249e-03,  

In [44]:
top_matches = pinecone_index.query(vector=raw_query_embedding.tolist(),
                                   top_k=5,
                                   include_metadata=True,
                                   namespace="https://github.com/CoderAgent/SecureAgent")

In [45]:
top_matches

{'matches': [{'id': 'feb91682-3c09-4855-b8f0-d4a9bad93b4f',
              'metadata': {'source': 'src/context/language/javascript-parser.ts',
                           'text': 'src/context/language/javascript-parser.ts\n'
                                   '\n'
                                   'import { AbstractParser, EnclosingContext '
                                   '} from "../../constants";\n'
                                   'import * as parser from "@babel/parser";\n'
                                   'import traverse, { NodePath, Node } from '
                                   '"@babel/traverse";\n'
                                   '\n'
                                   'const processNode = (\n'
                                   '  path: NodePath<Node>,\n'
                                   '  lineStart: number,\n'
                                   '  lineEnd: number,\n'
                                   '  largestSize: number,\n'
                               

In [46]:
context = [item['metadata']['text'] for item in top_matches['matches']]

In [47]:
context

['src/context/language/javascript-parser.ts\n\nimport { AbstractParser, EnclosingContext } from "../../constants";\nimport * as parser from "@babel/parser";\nimport traverse, { NodePath, Node } from "@babel/traverse";\n\nconst processNode = (\n  path: NodePath<Node>,\n  lineStart: number,\n  lineEnd: number,\n  largestSize: number,\n  largestEnclosingContext: Node | null\n) => {\n  const { start, end } = path.node.loc;\n  if (start.line <= lineStart && lineEnd <= end.line) {\n    const size = end.line - start.line;\n    if (size > largestSize) {\n      largestSize = size;\n      largestEnclosingContext = path.node;\n    }\n  }\n  return { largestSize, largestEnclosingContext };\n};\n\nexport class JavascriptParser implements AbstractParser {\n  findEnclosingContext(\n    file: string,\n    lineStart: number,\n    lineEnd: number\n  ): EnclosingContext {\n    const ast = parser.parse(file, {\n      sourceType: "module",\n      plugins: ["jsx", "typescript"], // To allow JSX and TypeScri

In [48]:
augmented_query = "<CONTEXT>\n" + "\n\n-------\n\n".join(context) + "\n-------\n</CONTEXT>\n\n\n\nMY QUESTION:\n" + query

In [49]:
print(augmented_query)

<CONTEXT>
src/context/language/javascript-parser.ts

import { AbstractParser, EnclosingContext } from "../../constants";
import * as parser from "@babel/parser";
import traverse, { NodePath, Node } from "@babel/traverse";

const processNode = (
  path: NodePath<Node>,
  lineStart: number,
  lineEnd: number,
  largestSize: number,
  largestEnclosingContext: Node | null
) => {
  const { start, end } = path.node.loc;
  if (start.line <= lineStart && lineEnd <= end.line) {
    const size = end.line - start.line;
    if (size > largestSize) {
      largestSize = size;
      largestEnclosingContext = path.node;
    }
  }
  return { largestSize, largestEnclosingContext };
};

export class JavascriptParser implements AbstractParser {
  findEnclosingContext(
    file: string,
    lineStart: number,
    lineEnd: number
  ): EnclosingContext {
    const ast = parser.parse(file, {
      sourceType: "module",
      plugins: ["jsx", "typescript"], // To allow JSX and TypeScript
    });
    let large

In [50]:
system_prompt = """You are a Senior Software Engineer, who is an expert in TypeScript.

Answer the question I have about the codebase based on the context provided. Always consider all of the context provided
to answer my question.
"""

llm_response = client.chat.completions.create(
    model="deepseek/deepseek-r1:free",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": augmented_query}
    ]
)

response = llm_response.choices[0].message.content


In [51]:
print(response)

The JavaScript parser (JavascriptParser) is used to analyze code structure and provide contextual information around code changes in TypeScript/JavaScript files during code review. Here's how it works:

1. **Registration**: It's registered in `constants.ts` for JS/TS file extensions:
```typescript
["ts", "tsx", "js", "jsx"] -> JavascriptParser
```

2. **Code Analysis**:
- AST Parsing: Uses Babel parser with TypeScript/JSX support
- Context Detection: Implements `findEnclosingContext` to locate the largest enclosing code structure (functions, interfaces) around changed lines

3. **Review Process Integration**:
- `smarterContextPatchStrategy` uses the parser to:
  - Understand code structure around diff hunks
  - Provide contextual code blocks for AI analysis
  - Generate more meaningful review comments that understand scope

4. **Key Features**:
- Tracks both function declarations and TypeScript interfaces
- Finds the largest enclosing context using line ranges from diffs
- Enables prec

In [58]:
def perform_rag(query, model="deepseek/deepseek-r1:free"):
    raw_query_embedding = get_huggingface_embeddings(query)

    top_matches = pinecone_index.query(vector=raw_query_embedding.tolist(), top_k=5, include_metadata=True, namespace="https://github.com/CoderAgent/SecureAgent")

    # Get the list of retrieved texts
    contexts = [item['metadata']['text'] for item in top_matches['matches']]

    augmented_query = "<CONTEXT>\n" + "\n\n-------\n\n".join(contexts[ : 10]) + "\n-------\n</CONTEXT>\n\n\n\nMY QUESTION:\n" + query

    # Modify the prompt below as need to improve the response quality
    system_prompt = f"""You are a Senior Software Engineer, specializing in TypeScript.

    Answer any questions I have about the codebase, based on the code provided. Always consider all of the context provided when forming a response.
    """

    llm_response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": augmented_query}
        ]
    )

    return llm_response.choices[0].message.content

In [55]:
response = perform_rag("What does this github repo do?")

print(response)

The provided TypeScript code snippets show a GitHub app designed to automate code reviews for GitHub repositories using a language model (LLM) via the Groq API. Here's exactly what it does:

### 1. Listening to GitHub Events

- It listens for GitHub's webhook events specifically for:
  - `pull_request.opened`: Triggered when a pull request is newly opened.

### 2. Fetching and Analyzing Code Changes

- Once a pull request (PR) is opened, the app:
  - Retrieves the list of changed files from the PR (using GitHub API).
  - For each file, it formats the code diff patch to clearly indicate what was added or modified (highlighting new lines added).

### 3. Utilizing AI (LLM) to Review Code

- The code diffs obtained from the PR are passed to an LLM via Groq AI API. This is performed using detailed, carefully designed prompts aimed at generating code improvement suggestions.
- Prompts instruct the LLM specifically to:
  - Focus on important improvement areas (code quality, security, readabil

In [57]:
response = perform_rag("What does this github repo do?", "anthropic/claude-3.7-sonnet:thinking")

print(response)

# GitHub PR Review Bot

This repository is a GitHub App that automatically reviews pull requests using AI. Here's what it does:

## Core Functionality

1. **Automated PR Reviews**: The app listens for `pull_request.opened` webhook events and automatically generates code reviews.

2. **AI-Powered Analysis**: It uses Groq's LLM API (as indicated by the required `GROQ_API_KEY` environment variable) to analyze code changes and generate intelligent feedback.

3. **Comprehensive Review Output**:
   - **General PR Comments**: Overall feedback on the pull request
   - **Inline Code Suggestions**: Specific code improvements that can be directly applied within GitHub's interface

4. **Smart Diff Analysis**: The AI focuses specifically on reviewing new code additions in the PR (lines starting with '+'), avoiding critique of code that was already in the repository.

## Technical Implementation

The app is built with TypeScript and:
- Integrates with GitHub using Octokit
- Sets up a Node.js server 