PotatoGPT is a small utility designed to pull code and documents from a GitHub repository and turn them into structured text data ready for embedding. It includes:
- automatic repo cloning
- flexible file-type filters
- chunking logic for long files
- optional local vector embedding
- detailed setup instructions for Windows users
This guide walks you through the full setup and shows exactly how to run the loader.
pip install openai chromadb gitpython sqlite-utilsThese packages handle cloning, scanning, embedding, and storing text.
This section covers:
- directory structure
- how to create the script
- how to run it
- what output to expect
By the end youβll have:
- A project folder ready for use
- A working Python script
- Local clones of your GitHub files
- A list of loaded documents printed on screen
For clarity, this guide uses:
C:\Projects\PotatoGPT
You can choose any location, but keep paths simple while following the tutorial.
Your project should look like this:
C:\Projects\PotatoGPT
βββ loader.py
βββ local_repo (created automatically)
Do not create local_repo yourself β the script handles that the first time you run it.
cd C:\Projects\PotatoGPT
pip install gitpythonOnly gitpython is required for this step.
More dependencies are added later when embeddings are introduced.
Create a new file named loader.py and insert the following:
from git import Repo
import os
import fnmatch
# ===================================
# 1. GITHUB REPO CONFIGURATION
# ===================================
# Replace with YOUR repository URL
repo_url = "https://github.com/YOUR_USERNAME/YOUR_REPO.git"
# Local folder to clone repo into
repo_path = "./local_repo"
# ===================================
# 2. CLONE IF NOT EXISTS
# ===================================
if not os.path.exists(repo_path):
print("Cloning GitHub repo...")
Repo.clone_from(repo_url, repo_path)
else:
print("Repository already exists, skipping clone.")
# ===================================
# 3. SELECT FILE PATTERNS
# ===================================
file_patterns = [
"*.md",
"*.txt",
"*.cs",
"*.cpp",
"*.py",
"*.json",
"*.sql"
]
# ===================================
# 4. SCAN FILES AND LOAD CONTENT
# ===================================
docs = []
for root, _, files in os.walk(repo_path):
for pattern in file_patterns:
for filename in fnmatch.filter(files, pattern):
full_path = os.path.join(root, filename)
try:
with open(full_path, "r", encoding="utf-8", errors="ignore") as f:
docs.append({
"path": full_path,
"content": f.read()
})
except Exception as e:
print("Error reading:", full_path, e)
# ===================================
# 5. RESULT
# ===================================
print("="*50)
print("DOCUMENT LOADER COMPLETE")
print("Total documents loaded:", len(docs))
print("Example entry:\n", docs[0] if docs else "No documents found")
print("="*50)Update the repo URL:
repo_url = "https://github.com/YOUR_USERNAME/YOUR_REPO.git"Start with a public repository. Handling private repos comes later.
cd C:\Projects\PotatoGPT
python loader.pyCloning GitHub repo...
==================================================
DOCUMENT LOADER COMPLETE
Total documents loaded: 53
Example entry:
{'path': './local_repo/README.md', 'content': '# Project ...'}
==================================================
Repository already exists, skipping clone.
==================================================
DOCUMENT LOADER COMPLETE
Total documents loaded: 53
Example entry:
{'path': './local_repo/src/App.cs', 'content': 'public class ...'}
==================================================
At this stage:
- cloning works
- the repo exists locally
- file scanning is confirmed
Your folder should look like:
C:\Projects\PotatoGPT
βββ loader.py
βββ local_repo
β βββ README.md
β βββ src
β β βββ ...
fatal: 'git' is not recognized as an internal or external command
Install Git:
Restart the terminal afterwards.
ModuleNotFoundError: No module named 'git'
Install:
pip install gitpythonAlready handled via:
encoding="utf-8", errors="ignore"Place this before scanning:
ignore_dirs = ["bin", "obj", ".git"]
for root, dirs, files in os.walk(repo_path):
dirs[:] = [d for d in dirs if d not in ignore_dirs]Add more patterns:
"*.xaml",
"*.h",
"*.hpp",
"*.config",
"*.yaml",
"*.xml",Example embedding function using text-embedding-3-small:
from openai import OpenAI
client = OpenAI()
def embed_text(txt: str):
res = client.embeddings.create(
model="text-embedding-3-small",
input=txt
)
return res.data[0].embeddingThis step is optional and is used later in PotatoGPT to build vector databases or retrieval pipelines.