# RAG System for Code Repository Queries

This notebook demonstrates a **Retrieval-Augmented Generation (RAG)** system that allows natural language queries about your GitHub repositories.

**How it works:**
1. **Indexing**: Embed repository metadata, READMEs, and commit messages using `nomic-embed-text`
2. **Retrieval**: Find the most relevant chunks using cosine similarity
3. **Generation**: Use an LLM to generate answers based on retrieved context

**Components:**
- Embedding model: `nomic-embed-text` (via Ollama)
- Vector store: Simple numpy-based cosine similarity
- LLM: `llama3.1:8b` (via Ollama)

In [1]:
import sys
sys.path.insert(0, "..")

import pandas as pd
from pathlib import Path
from src.rag.code_rag import CodeRAG

DATA = Path("..") / "data"
OUT = Path("..") / "outputs"

---
## Step 1: Load Data

In [2]:
repos = pd.read_csv(DATA / "repos.csv")
readmes = pd.read_csv(DATA / "readmes.csv")
languages = pd.read_csv(DATA / "languages.csv")
commits = pd.read_csv(DATA / "commits.csv")

print(f"Repositories: {len(repos)}")
print(f"READMEs: {len(readmes)}")
print(f"Commits: {len(commits)}")
print(f"\nRepos: {repos['repo_name'].tolist()}")

Repositories: 7
READMEs: 5
Commits: 23

Repos: ['Competitive-Programming', 'leetcode-solutions', 'nfl-bdb-2024', 'polars', 'questions', 'roadtrip-planner', 'roadtripplanner']


---
## Step 2: Initialize RAG System & Index Repositories

In [3]:
# Initialize the RAG system
rag = CodeRAG(
    embedding_model="nomic-embed-text",
    llm_model="llama3.1:8b"
)

# Index all repositories
rag.index_repositories(repos, readmes, languages, commits)

Indexing repositories...
  Embedding: overview for Competitive-Programming...
  Embedding: commits for Competitive-Programming...
  Embedding: overview for leetcode-solutions...
  Embedding: readme for leetcode-solutions...
  Embedding: commits for leetcode-solutions...
  Embedding: overview for nfl-bdb-2024...
  Embedding: readme for nfl-bdb-2024...
  Embedding: overview for polars...
  Embedding: readme for polars...
  Embedding: commits for polars...
  Embedding: overview for questions...
  Embedding: overview for roadtrip-planner...
  Embedding: readme for roadtrip-planner...
  Embedding: commits for roadtrip-planner...
  Embedding: overview for roadtripplanner...
  Embedding: readme for roadtripplanner...
  Embedding: commits for roadtripplanner...
Indexed 17 chunks from 7 repositories.


In [4]:
# Save the index for future use
rag.save(OUT / "rag_index")

Saved RAG index to ../outputs/rag_index


---
## Step 3: Query the RAG System

Now we can ask natural language questions about the codebase!

In [5]:
# Query 1: General question about projects
result = rag.query("What programming languages do I use most?")

print("QUESTION:", result["question"])
print("\nANSWER:")
print(result["answer"])
print(f"\nSources: {result['sources']}")
print(f"Latency: {result['latency_s']}s")

QUESTION: What programming languages do I use most?

ANSWER:
The context does not provide sufficient information about the programming languages used in the repository "Competitive-Programming". The "Primary Language" is listed as "nan", which is likely an abbreviation for "not available" or "unknown", and there are no other language-related details provided. Therefore, I cannot determine what programming languages you use most based on this context.

Sources: [{'repo': 'Competitive-Programming', 'type': 'overview', 'similarity': 0.563}, {'repo': 'polars', 'type': 'commits', 'similarity': 0.53}, {'repo': 'leetcode-solutions', 'type': 'readme', 'similarity': 0.528}]
Latency: 9.98s


In [6]:
# Query 2: Specific project question
result = rag.query("What is the polars project about?")

print("QUESTION:", result["question"])
print("\nANSWER:")
print(result["answer"])
print(f"\nSources: {result['sources']}")
print(f"Latency: {result['latency_s']}s")

QUESTION: What is the polars project about?

ANSWER:
The polars project is an analytical query engine written for DataFrames, designed to be fast, easy to use, and expressive. It supports various programming languages including Python, Rust, Node.js, R, and SQL, and uses Apache Arrow Columnar Format. The project aims to provide a powerful expression API with features like lazy or eager execution, streaming, query optimization, multi-threading, SIMD, and more.

Sources: [{'repo': 'polars', 'type': 'overview', 'similarity': 0.684}, {'repo': 'polars', 'type': 'readme', 'similarity': 0.669}, {'repo': 'polars', 'type': 'commits', 'similarity': 0.598}]
Latency: 12.12s


In [7]:
# Query 3: Activity patterns
result = rag.query("What kind of commits have I been making recently?")

print("QUESTION:", result["question"])
print("\nANSWER:")
print(result["answer"])
print(f"\nSources: {result['sources']}")
print(f"Latency: {result['latency_s']}s")

QUESTION: What kind of commits have I been making recently?

ANSWER:
Based on the provided context, it appears that you've been creating new problems for Competitive-Programming (e.g., "Create 1772E", "Create 1738C"), adding new solutions for leetcode-solutions (e.g., "add new stuff", "new"), and making initial commits for roadtripplanner.

Sources: [{'repo': 'Competitive-Programming', 'type': 'commits', 'similarity': 0.742}, {'repo': 'leetcode-solutions', 'type': 'commits', 'similarity': 0.691}, {'repo': 'roadtripplanner', 'type': 'commits', 'similarity': 0.676}]
Latency: 4.81s


In [8]:
# Query 4: Skills question
result = rag.query("Based on my projects, what are my technical skills?")

print("QUESTION:", result["question"])
print("\nANSWER:")
print(result["answer"])
print(f"\nSources: {result['sources']}")
print(f"Latency: {result['latency_s']}s")

QUESTION: Based on my projects, what are my technical skills?

ANSWER:
Based on the provided context, it appears that you have experience with Polars, which is a Rust library for parallel computing and data manipulation. The recent commits suggest that you are familiar with:

1. Git (based on the commit messages)
2. Rust programming language
3. Polars library specifically (with commits related to its functionality)

However, there is no information about your technical skills in the context provided. The README for roadtrip-planner and the repository Competitive-Programming do not contain any relevant information about your technical skills.

To answer your question more accurately, I would need additional context or information about your projects and skills.

Sources: [{'repo': 'polars', 'type': 'commits', 'similarity': 0.535}, {'repo': 'Competitive-Programming', 'type': 'overview', 'similarity': 0.519}, {'repo': 'roadtrip-planner', 'type': 'readme', 'similarity': 0.518}]
Latency: 9.

In [9]:
# Query 5: Recommendation question
result = rag.query("Which of my projects would be most impressive to show an employer?")

print("QUESTION:", result["question"])
print("\nANSWER:")
print(result["answer"])
print(f"\nSources: {result['sources']}")
print(f"Latency: {result['latency_s']}s")

QUESTION: Which of my projects would be most impressive to show an employer?

ANSWER:
Based on the provided context, I would recommend showing the "roadtrip-planner" project to an employer as it appears to be a more substantial and complex project compared to the other two repositories.

The roadtrip-planner repository has multiple commits with significant changes, including setting up a React frontend with search functionality, adding a FastAPI backend, and implementing a vibe-based place search pipeline. It also includes several Python scripts for collecting places from Google Places API and generating semantic embeddings using sentence-transformers.

In contrast, the "Competitive-Programming" repository appears to be a collection of solutions to competitive programming problems, which may not demonstrate as much technical expertise or project management skills as the roadtrip-planner project. The leetcode-solutions repository is simply a README file with no additional information ab

---
## Interactive Query Function

Use this function to ask your own questions:

In [10]:
def ask(question: str):
    """Ask a question about your repositories."""
    result = rag.query(question)
    print(f"Q: {result['question']}")
    print(f"\nA: {result['answer']}")
    print(f"\n[Sources: {[s['repo'] + '/' + s['type'] for s in result['sources']]} | {result['latency_s']}s]")

# Example usage:
ask("Do I have any experience with data processing?")

Q: Do I have any experience with data processing?

A: Based on the provided context, there is no direct indication that you have experience with data processing. However, the README for the nfl-bdb-2024 repository does mention "curated datasets" and "reproducible figures", which suggests some level of data processing and analysis has been done in this specific project.

It's also worth noting that the repository includes a notebook called `01_analysis.ipynb` under the `notebooks/` directory, which might imply that some form of data analysis is being performed. However, without more information about your personal experience or involvement with this project, it's difficult to say for certain whether you have experience with data processing.

If I had to provide a tentative answer based on the context provided, I would say "it appears that someone involved in this project has some level of experience with data processing, but it's unclear if that person is you."

[Sources: ['nfl-bdb-2024

---
## RAG System Evaluation

Let's evaluate the RAG system's performance across multiple queries.

In [11]:
# Evaluate with a set of test queries
test_queries = [
    "What is the largest repository?",
    "Which repos are forks?",
    "What languages do I know?",
    "Describe my Python projects",
    "What kind of developer am I?"
]

eval_results = []
for q in test_queries:
    result = rag.query(q)
    eval_results.append({
        "question": q,
        "answer_length": len(result["answer"]),
        "top_source": result["sources"][0]["repo"] if result["sources"] else "N/A",
        "top_similarity": result["sources"][0]["similarity"] if result["sources"] else 0,
        "latency_s": result["latency_s"]
    })
    print(f"Processed: {q[:40]}...")

eval_df = pd.DataFrame(eval_results)
print("\n" + "="*60)
print("RAG EVALUATION RESULTS")
print("="*60)
print(eval_df.to_string(index=False))
print(f"\nAverage latency: {eval_df['latency_s'].mean():.2f}s")
print(f"Average similarity: {eval_df['top_similarity'].mean():.3f}")

Processed: What is the largest repository?...
Processed: Which repos are forks?...
Processed: What languages do I know?...
Processed: Describe my Python projects...
Processed: What kind of developer am I?...

RAG EVALUATION RESULTS
                       question  answer_length              top_source  top_similarity  latency_s
What is the largest repository?            440            nfl-bdb-2024           0.551       8.42
         Which repos are forks?            364        roadtrip-planner           0.513       6.32
      What languages do I know?            249               questions           0.543       4.06
    Describe my Python projects            431 Competitive-Programming           0.556       5.40
   What kind of developer am I?            492                  polars           0.568       6.79

Average latency: 6.20s
Average similarity: 0.546


---
## Summary

This RAG system demonstrates:

1. **Semantic Search**: Using embeddings to find relevant content based on meaning, not just keywords
2. **Context-Aware Generation**: LLM answers are grounded in actual repository data
3. **Source Attribution**: Each answer cites which repositories and document types were used

**Potential Improvements:**
- Add more granular chunking (function-level, file-level)
- Include actual source code content
- Implement hybrid search (semantic + keyword)
- Add reranking for better retrieval accuracy

In [12]:
print("="*60)
print("RAG SYSTEM STATISTICS")
print("="*60)
print(f"Total indexed chunks: {len(rag.vector_store.documents)}")
print(f"Embedding model: nomic-embed-text")
print(f"LLM model: llama3.1:8b")
print(f"Vector dimensions: {rag.vector_store.embeddings.shape[1]}")
print(f"\nChunks by type:")
for doc_type in set(d['type'] for d in rag.vector_store.documents):
    count = sum(1 for d in rag.vector_store.documents if d['type'] == doc_type)
    print(f"  {doc_type}: {count}")

RAG SYSTEM STATISTICS
Total indexed chunks: 17
Embedding model: nomic-embed-text
LLM model: llama3.1:8b
Vector dimensions: 768

Chunks by type:
  readme: 5
  overview: 7
  commits: 5
