🥔 PotatoGPT — GitHub Repo Document Loader & Embedding Pipeline

PotatoGPT is a small utility designed to pull code and documents from a GitHub repository and turn them into structured text data ready for embedding. It includes:

automatic repo cloning
flexible file-type filters
chunking logic for long files
optional local vector embedding
detailed setup instructions for Windows users

This guide walks you through the full setup and shows exactly how to run the loader.

1. Installing Required Packages

pip install openai chromadb gitpython sqlite-utils

These packages handle cloning, scanning, embedding, and storing text.

2. Exporting GitHub Files as Documents

This section covers:

directory structure
how to create the script
how to run it
what output to expect

By the end you’ll have:

A project folder ready for use
A working Python script
Local clones of your GitHub files
A list of loaded documents printed on screen

3. Project Folder

For clarity, this guide uses:

C:\Projects\PotatoGPT

You can choose any location, but keep paths simple while following the tutorial.

4. File Structure

Your project should look like this:

C:\Projects\PotatoGPT
    ├── loader.py
    └── local_repo   (created automatically)

Do not create local_repo yourself — the script handles that the first time you run it.

5. Installing GitPython (Step 2 Requirement)

cd C:\Projects\PotatoGPT
pip install gitpython

Only gitpython is required for this step. More dependencies are added later when embeddings are introduced.

6. Creating `loader.py`

Create a new file named loader.py and insert the following:

from git import Repo
import os
import fnmatch

# ===================================
# 1. GITHUB REPO CONFIGURATION
# ===================================

# Replace with YOUR repository URL
repo_url = "https://github.com/YOUR_USERNAME/YOUR_REPO.git"

# Local folder to clone repo into
repo_path = "./local_repo"

# ===================================
# 2. CLONE IF NOT EXISTS
# ===================================
if not os.path.exists(repo_path):
    print("Cloning GitHub repo...")
    Repo.clone_from(repo_url, repo_path)
else:
    print("Repository already exists, skipping clone.")

# ===================================
# 3. SELECT FILE PATTERNS
# ===================================

file_patterns = [
    "*.md",
    "*.txt",
    "*.cs",
    "*.cpp",
    "*.py",
    "*.json",
    "*.sql"
]

# ===================================
# 4. SCAN FILES AND LOAD CONTENT
# ===================================

docs = []

for root, _, files in os.walk(repo_path):
    for pattern in file_patterns:
        for filename in fnmatch.filter(files, pattern):
            full_path = os.path.join(root, filename)

            try:
                with open(full_path, "r", encoding="utf-8", errors="ignore") as f:
                    docs.append({
                        "path": full_path,
                        "content": f.read()
                    })
            except Exception as e:
                print("Error reading:", full_path, e)

# ===================================
# 5. RESULT
# ===================================

print("="*50)
print("DOCUMENT LOADER COMPLETE")
print("Total documents loaded:", len(docs))
print("Example entry:\n", docs[0] if docs else "No documents found")
print("="*50)

7. Important Configuration Change

Update the repo URL:

repo_url = "https://github.com/YOUR_USERNAME/YOUR_REPO.git"

Start with a public repository. Handling private repos comes later.

8. Running the Script

cd C:\Projects\PotatoGPT
python loader.py

9. Expected Output

First run:

Cloning GitHub repo...
==================================================
DOCUMENT LOADER COMPLETE
Total documents loaded: 53
Example entry:
 {'path': './local_repo/README.md', 'content': '# Project ...'}
==================================================

Second run:

Repository already exists, skipping clone.
==================================================
DOCUMENT LOADER COMPLETE
Total documents loaded: 53
Example entry:
 {'path': './local_repo/src/App.cs', 'content': 'public class ...'}
==================================================

At this stage:

cloning works
the repo exists locally
file scanning is confirmed

10. Verifying the Output Files

Your folder should look like:

C:\Projects\PotatoGPT
    ├── loader.py
    ├── local_repo
    │     ├── README.md
    │     ├── src
    │     │    ├── ...

11. Common Fixes

Git not found

fatal: 'git' is not recognized as an internal or external command

Install Git:

https://git-scm.com/download/win

Restart the terminal afterwards.

Missing dependency

ModuleNotFoundError: No module named 'git'

Install:

pip install gitpython

Encoding errors

Already handled via:

encoding="utf-8", errors="ignore"

12. Optional: Ignore Entire Folders

Place this before scanning:

ignore_dirs = ["bin", "obj", ".git"]

for root, dirs, files in os.walk(repo_path):
    dirs[:] = [d for d in dirs if d not in ignore_dirs]

13. Optional: Add More File Types

Add more patterns:

"*.xaml",
"*.h",
"*.hpp",
"*.config",
"*.yaml",
"*.xml",

14. Step 3 — Adding Embeddings (OpenAI Free Tier Compatible)

Example embedding function using text-embedding-3-small:

from openai import OpenAI
client = OpenAI()

def embed_text(txt: str):
    res = client.embeddings.create(
        model="text-embedding-3-small",
        input=txt
    )
    return res.data[0].embedding

This step is optional and is used later in PotatoGPT to build vector databases or retrieval pipelines.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
agents		agents
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Modelfile		Modelfile
README.md		README.md
embedder.py		embedder.py
embeddings.db		embeddings.db
example_chroma.py		example_chroma.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🥔 PotatoGPT — GitHub Repo Document Loader & Embedding Pipeline

1. Installing Required Packages

2. Exporting GitHub Files as Documents

3. Project Folder

4. File Structure

5. Installing GitPython (Step 2 Requirement)

6. Creating `loader.py`

7. Important Configuration Change

8. Running the Script

9. Expected Output

First run:

Second run:

10. Verifying the Output Files

11. Common Fixes

Git not found

Missing dependency

Encoding errors

12. Optional: Ignore Entire Folders

13. Optional: Add More File Types

14. Step 3 — Adding Embeddings (OpenAI Free Tier Compatible)

About

Uh oh!

Releases

Packages

Languages

License

potatoscript/PotatoGPT

Folders and files

Latest commit

History

Repository files navigation

🥔 PotatoGPT — GitHub Repo Document Loader & Embedding Pipeline

1. Installing Required Packages

2. Exporting GitHub Files as Documents

3. Project Folder

4. File Structure

5. Installing GitPython (Step 2 Requirement)

6. Creating loader.py

7. Important Configuration Change

8. Running the Script

9. Expected Output

First run:

Second run:

10. Verifying the Output Files

11. Common Fixes

Git not found

Missing dependency

Encoding errors

12. Optional: Ignore Entire Folders

13. Optional: Add More File Types

14. Step 3 — Adding Embeddings (OpenAI Free Tier Compatible)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

6. Creating `loader.py`

Packages