Skip to content

potatoscript/PotatoGPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ₯” PotatoGPT β€” GitHub Repo Document Loader & Embedding Pipeline

PotatoGPT is a small utility designed to pull code and documents from a GitHub repository and turn them into structured text data ready for embedding. It includes:

  • automatic repo cloning
  • flexible file-type filters
  • chunking logic for long files
  • optional local vector embedding
  • detailed setup instructions for Windows users

This guide walks you through the full setup and shows exactly how to run the loader.


1. Installing Required Packages

pip install openai chromadb gitpython sqlite-utils

These packages handle cloning, scanning, embedding, and storing text.


2. Exporting GitHub Files as Documents

This section covers:

  • directory structure
  • how to create the script
  • how to run it
  • what output to expect

By the end you’ll have:

  1. A project folder ready for use
  2. A working Python script
  3. Local clones of your GitHub files
  4. A list of loaded documents printed on screen

3. Project Folder

For clarity, this guide uses:

C:\Projects\PotatoGPT

You can choose any location, but keep paths simple while following the tutorial.


4. File Structure

Your project should look like this:

C:\Projects\PotatoGPT
    β”œβ”€β”€ loader.py
    └── local_repo   (created automatically)

Do not create local_repo yourself β€” the script handles that the first time you run it.


5. Installing GitPython (Step 2 Requirement)

cd C:\Projects\PotatoGPT
pip install gitpython

Only gitpython is required for this step. More dependencies are added later when embeddings are introduced.


6. Creating loader.py

Create a new file named loader.py and insert the following:

from git import Repo
import os
import fnmatch

# ===================================
# 1. GITHUB REPO CONFIGURATION
# ===================================

# Replace with YOUR repository URL
repo_url = "https://github.com/YOUR_USERNAME/YOUR_REPO.git"

# Local folder to clone repo into
repo_path = "./local_repo"

# ===================================
# 2. CLONE IF NOT EXISTS
# ===================================
if not os.path.exists(repo_path):
    print("Cloning GitHub repo...")
    Repo.clone_from(repo_url, repo_path)
else:
    print("Repository already exists, skipping clone.")

# ===================================
# 3. SELECT FILE PATTERNS
# ===================================

file_patterns = [
    "*.md",
    "*.txt",
    "*.cs",
    "*.cpp",
    "*.py",
    "*.json",
    "*.sql"
]

# ===================================
# 4. SCAN FILES AND LOAD CONTENT
# ===================================

docs = []

for root, _, files in os.walk(repo_path):
    for pattern in file_patterns:
        for filename in fnmatch.filter(files, pattern):
            full_path = os.path.join(root, filename)

            try:
                with open(full_path, "r", encoding="utf-8", errors="ignore") as f:
                    docs.append({
                        "path": full_path,
                        "content": f.read()
                    })
            except Exception as e:
                print("Error reading:", full_path, e)

# ===================================
# 5. RESULT
# ===================================

print("="*50)
print("DOCUMENT LOADER COMPLETE")
print("Total documents loaded:", len(docs))
print("Example entry:\n", docs[0] if docs else "No documents found")
print("="*50)

7. Important Configuration Change

Update the repo URL:

repo_url = "https://github.com/YOUR_USERNAME/YOUR_REPO.git"

Start with a public repository. Handling private repos comes later.


8. Running the Script

cd C:\Projects\PotatoGPT
python loader.py

9. Expected Output

First run:

Cloning GitHub repo...
==================================================
DOCUMENT LOADER COMPLETE
Total documents loaded: 53
Example entry:
 {'path': './local_repo/README.md', 'content': '# Project ...'}
==================================================

Second run:

Repository already exists, skipping clone.
==================================================
DOCUMENT LOADER COMPLETE
Total documents loaded: 53
Example entry:
 {'path': './local_repo/src/App.cs', 'content': 'public class ...'}
==================================================

At this stage:

  • cloning works
  • the repo exists locally
  • file scanning is confirmed

10. Verifying the Output Files

Your folder should look like:

C:\Projects\PotatoGPT
    β”œβ”€β”€ loader.py
    β”œβ”€β”€ local_repo
    β”‚     β”œβ”€β”€ README.md
    β”‚     β”œβ”€β”€ src
    β”‚     β”‚    β”œβ”€β”€ ...

11. Common Fixes

Git not found

fatal: 'git' is not recognized as an internal or external command

Install Git:

https://git-scm.com/download/win

Restart the terminal afterwards.


Missing dependency

ModuleNotFoundError: No module named 'git'

Install:

pip install gitpython

Encoding errors

Already handled via:

encoding="utf-8", errors="ignore"

12. Optional: Ignore Entire Folders

Place this before scanning:

ignore_dirs = ["bin", "obj", ".git"]

for root, dirs, files in os.walk(repo_path):
    dirs[:] = [d for d in dirs if d not in ignore_dirs]

13. Optional: Add More File Types

Add more patterns:

"*.xaml",
"*.h",
"*.hpp",
"*.config",
"*.yaml",
"*.xml",

14. Step 3 β€” Adding Embeddings (OpenAI Free Tier Compatible)

Example embedding function using text-embedding-3-small:

from openai import OpenAI
client = OpenAI()

def embed_text(txt: str):
    res = client.embeddings.create(
        model="text-embedding-3-small",
        input=txt
    )
    return res.data[0].embedding

This step is optional and is used later in PotatoGPT to build vector databases or retrieval pipelines.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages