![Img](https://app.theheadstarter.com/static/hs-logo-opengraph.png)

# Headstarter Codebase RAG Project

![Screenshot 2024-11-25 at 7 12 58 PM](https://github.com/user-attachments/assets/0bd67cf0-43d5-46d2-879c-a752cae4c8e3)

# Install Necessary Libraries

In [1]:
! pip install pygithub langchain langchain-community openai tiktoken pinecone-client langchain_pinecone sentence-transformers



In [2]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from langchain_pinecone import PineconeVectorStore
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
from google.colab import userdata
from pinecone import Pinecone
import os
import tempfile
from github import Github, Repository
from git import Repo
from openai import OpenAI
from pathlib import Path
from langchain.schema import Document
from pinecone import Pinecone

  from tqdm.autonotebook import tqdm, trange


# Clone a GitHub Repo locally

In [3]:
def clone_repo(repo_url):
  repo_name = repo_url.split("/")[-1]
  repo_path = f"/content/{repo_name}"
  Repo.clone_from(repo_url, str(repo_path))
  return str(repo_name)

In [None]:
clone_repo("https://github.com/jballo/Airmi")

In [4]:
path = "/content/Airmi"

In [9]:
SUPPORTED_EXTENSIONS = [".py", ".js", ".tsx", ".ts", ".java", ".cpp"]

IGNORED_DIRS = ["node_modules", ".git", "dist", "__pycache__", ".next", ".vscode", "env", "venv"]

In [10]:
def get_file_content(file_path, repo_path):
  try:

    with open(file_path, "r", encoding="utf-8") as f:
      content = f.read()

      rel_path = os.path.relpath(file_path, repo_path)

      return {
          "name": rel_path,
          "content": content
      }
  except Exception as e:
    print(f"Error reading file {file_path}: {e}")
    return None

def get_main_files_content(repo_path: str):
   """
   Get content of supported code files from the local repository.


   Args:
       repo_path: Path to the local repository


   Returns:
       List of dictionaries containing file names and contents
   """
   files_content = []


   try:
       for root, _, files in os.walk(repo_path):
           # Skip if current directory is in ignored directories
           if any(ignored_dir in root for ignored_dir in IGNORED_DIRS):
               continue


           # Process each file in current directory
           for file in files:
               file_path = os.path.join(root, file)
               if os.path.splitext(file)[1] in SUPPORTED_EXTENSIONS:
                   file_content = get_file_content(file_path, repo_path)
                   if file_content:
                       files_content.append(file_content)


   except Exception as e:
       print(f"Error reading repository: {str(e)}")


   return files_content

In [11]:
file_content = get_main_files_content(path)

In [12]:
file_content

[{'name': 'tailwind.config.ts',
  'content': 'import type { Config } from "tailwindcss";\n\nconst config: Config = {\n  content: [\n    "./src/pages/**/*.{js,ts,jsx,tsx,mdx}",\n    "./src/components/**/*.{js,ts,jsx,tsx,mdx}",\n    "./src/app/**/*.{js,ts,jsx,tsx,mdx}",\n  ],\n  theme: {\n    extend: {\n      colors: {\n        background: "var(--background)",\n        foreground: "var(--foreground)",\n      },\n    },\n  },\n  plugins: [],\n};\nexport default config;\n'},
 {'name': 'next.config.ts',
  'content': 'import type { NextConfig } from "next";\n\nconst nextConfig: NextConfig = {\n  /* config options here */\n};\n\nexport default nextConfig;\n'},
 {'name': 'src/app/layout.tsx',
  'content': 'import type { Metadata } from "next";\nimport "./globals.css";\n\nexport const metadata: Metadata = {\n  title: "Create Next App",\n  description: "Generated by create next app",\n};\n\nexport default function RootLayout({\n  children,\n}: Readonly<{\n  children: React.ReactNode;\n}>) {\n  r

# Embeddings

In [13]:
def get_huggingface_embeddings(text, model_name="sentence-transformers/all-mpnet-base-v2"):
    model = SentenceTransformer(model_name)
    return model.encode(text)

In [14]:
text = "I am a programmer"

embeddings = get_huggingface_embeddings(text)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [15]:
embeddings

array([ 1.81737654e-02, -3.02657508e-03, -4.77465875e-02,  1.86379403e-02,
        3.14537995e-02,  1.87255293e-02, -1.52534274e-02, -6.77293688e-02,
       -1.26903653e-02,  1.28427576e-02,  5.80701306e-02,  4.00234833e-02,
        3.27073298e-02,  7.12998435e-02,  5.56373373e-02,  1.68628506e-02,
        6.97603747e-02, -5.02619930e-02,  6.13140827e-03, -1.46559235e-02,
       -4.51957993e-03,  4.82934639e-02, -2.53051296e-02, -1.97862904e-03,
       -4.36902530e-02, -2.41507161e-02,  1.29505759e-02, -3.78611824e-03,
       -2.05718316e-02,  1.09819308e-01,  3.07672890e-03, -2.80443169e-02,
       -1.55807249e-02, -1.24789868e-02,  1.75239131e-06, -2.93756695e-03,
       -1.43048428e-02,  4.88386713e-02, -6.21114224e-02,  2.95061413e-02,
       -1.40470508e-02,  2.20708270e-02,  1.13067888e-02,  4.70893271e-02,
        7.58305984e-03, -8.30314530e-05,  6.67821169e-02, -1.21320095e-02,
        4.39386303e-03,  2.47453637e-02,  1.02529004e-02, -6.54432410e-03,
       -5.53147821e-03, -

# Setting up Pinecone
**1. Create an account on [Pinecone.io](https://app.pinecone.io/)**

**2. Create a new index called "codebase-rag" and set the dimensions to 768. Leave the rest of the settings as they are.**

![Screenshot 2024-11-24 at 10 58 50 PM](https://github.com/user-attachments/assets/f5fda046-4087-432a-a8c2-86e061005238)



**3. Create an API Key for Pinecone**

![Screenshot 2024-11-24 at 10 44 37 PM](https://github.com/user-attachments/assets/e7feacc6-2bd1-472a-82e5-659f65624a88)


**4. Store your Pinecone API Key within Google Colab's secrets section, and then enable access to it (see the blue checkmark)**

![Screenshot 2024-11-24 at 10 45 25 PM](https://github.com/user-attachments/assets/eaf73083-0b5f-4d17-9e0c-eab84f91b0bc)



In [16]:
# Set the PINECONE_API_KEY as an environment variable
pinecone_api_key = userdata.get("PINECONE_API_KEY")
os.environ['PINECONE_API_KEY'] = pinecone_api_key

# Initialize Pinecone
pc = Pinecone(api_key=userdata.get("PINECONE_API_KEY"),)

# Connect to your Pinecone index
pinecone_index = pc.Index("codebase-rag")

In [17]:
vectorstore = PineconeVectorStore(index_name="codebase-rag", embedding=HuggingFaceEmbeddings())

  vectorstore = PineconeVectorStore(index_name="codebase-rag", embedding=HuggingFaceEmbeddings())
  vectorstore = PineconeVectorStore(index_name="codebase-rag", embedding=HuggingFaceEmbeddings())


In [19]:
# Insert the codebase embeddings into Pinecone

documents = []

for file in file_content:
  doc = Document(
      page_content=f"{file['name']}\n{file['content']}",
      metadata={"source": file['name']}
  )

  documents.append(doc)

vectorstore = PineconeVectorStore.from_documents(
    documents=documents,
    embedding=HuggingFaceEmbeddings(),
    index_name="codebase-rag",
    namespace="https://github.com/jballo/Airmi"
)







  embedding=HuggingFaceEmbeddings(),


# Perform RAG

1. Get your Groq API Key [here](https://console.groq.com/keys)

2. Paste your Groq API Key into your Google Colab secrets, and make sure to enable permissions for it

![Screenshot 2024-11-25 at 12 00 16 AM](https://github.com/user-attachments/assets/e5525d29-bca6-4dbd-892b-cc770a6b281d)


In [20]:
client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key=userdata.get("GROQ_API_KEY")
)

In [41]:
query = "How do I run this app?"

In [44]:
query_embedding = get_huggingface_embeddings(query)

In [45]:
query_embedding

array([-1.31608481e-02, -2.53478996e-02,  2.79921815e-02,  1.61318742e-02,
       -2.71736868e-02,  6.10447191e-02,  6.53133960e-04,  1.37305306e-02,
        3.28075737e-02, -6.97701946e-02, -3.97196002e-02,  5.03428839e-02,
        4.21380103e-02,  2.11055521e-02,  4.72497977e-02, -4.02478129e-02,
        4.08329107e-02, -1.06820406e-03,  5.58333807e-02,  3.05293947e-02,
        3.84199768e-02,  1.44366436e-02, -5.95829226e-02,  1.17453458e-02,
       -2.20744777e-02, -4.28975932e-02, -1.26740774e-02,  5.42177372e-02,
        1.29888696e-03, -3.28473523e-02, -5.75144924e-02, -3.79336532e-04,
        1.16412723e-02,  3.67633328e-02,  1.19419508e-06, -5.49665615e-02,
        1.06474981e-02,  6.29877159e-03,  2.46591810e-02,  8.21967330e-03,
        8.28878731e-02,  7.00120255e-02,  1.15234954e-02, -7.34425755e-03,
        3.29019055e-02, -1.88495386e-02, -8.41212925e-03,  5.25287120e-03,
        7.18077049e-02,  3.75441499e-02, -3.01440386e-03, -4.16908711e-02,
        7.68121183e-02,  

In [46]:
top_matches = pinecone_index.query(vector=query_embedding.tolist(), top_k=5, include_metadata=True, namespace="https://github.com/jballo/Airmi")

In [47]:
top_matches

{'matches': [{'id': '6602da59-85ee-4736-bfef-e443fd3c70ad',
              'metadata': {'source': 'src/app/page.tsx',
                           'text': 'src/app/page.tsx\n'
                                   'import Image from "next/image";\n'
                                   '\n'
                                   'export default function Home() {\n'
                                   '  return <div>Hello World</div>;\n'
                                   '}\n'},
              'score': 0.126395434,
              'values': []},
             {'id': 'b75701d4-002c-4a91-aed8-134f9f681d28',
              'metadata': {'source': 'tailwind.config.ts',
                           'text': 'tailwind.config.ts\n'
                                   'import type { Config } from '
                                   '"tailwindcss";\n'
                                   '\n'
                                   'const config: Config = {\n'
                                   '  content: [\n'
           

In [48]:
contexts = [item['metadata']['text'] for item in top_matches['matches']]

In [49]:
contexts

['src/app/page.tsx\nimport Image from "next/image";\n\nexport default function Home() {\n  return <div>Hello World</div>;\n}\n',
 'tailwind.config.ts\nimport type { Config } from "tailwindcss";\n\nconst config: Config = {\n  content: [\n    "./src/pages/**/*.{js,ts,jsx,tsx,mdx}",\n    "./src/components/**/*.{js,ts,jsx,tsx,mdx}",\n    "./src/app/**/*.{js,ts,jsx,tsx,mdx}",\n  ],\n  theme: {\n    extend: {\n      colors: {\n        background: "var(--background)",\n        foreground: "var(--foreground)",\n      },\n    },\n  },\n  plugins: [],\n};\nexport default config;\n',
 'src/app/layout.tsx\nimport type { Metadata } from "next";\nimport "./globals.css";\n\nexport const metadata: Metadata = {\n  title: "Create Next App",\n  description: "Generated by create next app",\n};\n\nexport default function RootLayout({\n  children,\n}: Readonly<{\n  children: React.ReactNode;\n}>) {\n  return (\n    <html lang="en">\n      <body>{children}</body>\n    </html>\n  );\n}\n',
 'next.config.ts\ni

In [50]:
augmented_query = "<CONTEXT>\n" + "\n\n-------\n\n".join(contexts[ : 10]) + "\n-------\n</CONTEXT>\n\n\n\nMY QUESTION:\n" + query

In [51]:
print(augmented_query)

<CONTEXT>
src/app/page.tsx
import Image from "next/image";

export default function Home() {
  return <div>Hello World</div>;
}


-------

tailwind.config.ts
import type { Config } from "tailwindcss";

const config: Config = {
  content: [
    "./src/pages/**/*.{js,ts,jsx,tsx,mdx}",
    "./src/components/**/*.{js,ts,jsx,tsx,mdx}",
    "./src/app/**/*.{js,ts,jsx,tsx,mdx}",
  ],
  theme: {
    extend: {
      colors: {
        background: "var(--background)",
        foreground: "var(--foreground)",
      },
    },
  },
  plugins: [],
};
export default config;


-------

src/app/layout.tsx
import type { Metadata } from "next";
import "./globals.css";

export const metadata: Metadata = {
  title: "Create Next App",
  description: "Generated by create next app",
};

export default function RootLayout({
  children,
}: Readonly<{
  children: React.ReactNode;
}>) {
  return (
    <html lang="en">
      <body>{children}</body>
    </html>
  );
}


-------

next.config.ts
import type { NextConf

In [54]:
system_prompt = f"""You are a Senior Software Engineer, specializing in TypeScript.

Answer any questions I have about the codebase, based on all the context provided.
Always consider all of the context provided when forming a response.

Let's think step by step. Verify step by step."""

try:
  llm_response = client.chat.completions.create(
      model="llama-3.1-8b-instant",
      messages=[
          {"role": "system", "content": system_prompt},
          {"role": "user", "content": augmented_query}
      ]
  )

  response = llm_response.choices[0].message.content
except Exception as e:
       print(f"Error generating response from model llama-3.1-8b-instant: {str(e)}")

In [55]:
print(response)

Based on the context you've provided, it appears you're working with a Next.js project. 

To run this app, you'll need to navigate to the root of your project in your terminal and run the following command:

```bash
npm run dev
```

or if you're using `yarn`:

```bash
yarn dev
```

This command will start the development server, and you should be able to access your app at `http://localhost:3000` in your web browser.

Note: Ensure that you've installed all the required dependencies by running:

```bash
npm install
```

or

```bash
yarn install
```

Make sure you've `node` and `npm` or `yarn` installed on your machine. 

Also, before running the app, install the necessary dependencies for the project:

```bash
npm install tailwindcss images
```

or 

```bash
yarn add tailwindcss images
```

The `tailwindcss` and `images` packages are required for image processing and Tailwind CSS functionality.


In [58]:
def perform_rag(query):
   raw_query_embedding = get_huggingface_embeddings(query)


   top_matches = pinecone_index.query(vector=raw_query_embedding.tolist(), top_k=5, include_metadata=True, namespace="https://github.com/jballo/Airmi")


   # Get the list of retrieved texts
   contexts = [item['metadata']['text'] for item in top_matches['matches']]


   augmented_query = "<CONTEXT>\n" + "\n\n-------\n\n".join(contexts[ : 10]) + "\n-------\n</CONTEXT>\n\n\n\nMY QUESTION:\n" + query


   # Modify the prompt below as need to improve the response quality
   system_prompt = f"""You are a Senior Software Engineer, specializing in TypeScript.


   Answer any questions I have about the codebase, based on the code provided. Always consider all of the context provided when forming a response.
   """


   llm_response = client.chat.completions.create(
       model="llama-3.1-8b-instant",
       messages=[
           {"role": "system", "content": system_prompt},
           {"role": "user", "content": augmented_query}
       ]
   )


   return llm_response.choices[0].message.content

In [59]:
response = perform_rag("How can I improve the quality of the starting app?")

print(response)

Improving the quality of the starting app can involve multiple aspects, some of which I'll address here:

1. **Type Safety**: The existing code is written in TypeScript, which is great. However, we can improve type safety by adding more specific types, particularly in the React components. For instance, in `RootLayout.tsx`, we can define the type of the `children` prop as `React.ReactNode[]` or `React.ReactNode | null`.

2. **Code Style and Naming Conventions**: To maintain consistency and adherence to standard coding practices, consider enforcing a code style convention, such as Prettier, and ensuring the team follows a naming convention. For example, use camelCase for variable and function names, and PascalCase for type and interface names.

```typescript
// Instead of import Image from 'next/image'
import { Image } from 'next/image';

// Instead of export default function Home() {
export const Home = () => {
  // code here...
}
```

3. **Context API for Shared Data**: As the applica