<a href="https://colab.research.google.com/github/lixun911/flutter-firebase-tutorials/blob/master/whole_system_even_git_server.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Okay, this is a much more comprehensive and self-contained system\! Setting up your own Git server (like Gitea) within Docker alongside ArangoDB and your ingestion pipeline creates a fully isolated and reproducible environment for your AI coder project.

Here's a step-by-step guide to implement and deploy the *whole system*, including a Git server, using Docker Compose:

## System Architecture Overview

```
+-------------------+       +-------------------+       +-------------------+
|                   |       |                   |       |                   |
|   Git Server      | <---> |  Ingestion        | ----> |   ArangoDB        |
|   (e.g., Gitea)   |       |  Pipeline         |       |   Knowledge Graph |
|   (Container 1)   |       |  (Container 2)    |       |   (Container 3)   |
|                   |       |                   |       |                   |
+-------------------+       +-------------------+       +-------------------+
      ^                                                         ^
      |                                                         |
      | (Host Volume for persistent Git data)                   | (Host Volume for persistent DB data)
      |                                                         |
      +---------------------------------------------------------+
                          Your Host Machine
```

  * **Git Server (Gitea):** Manages your Git repositories. Your developers (or you) will push code here.
  * **Ingestion Pipeline:** A Python application that clones/pulls from Gitea, parses the code, extracts relationships, generates embeddings, and pushes this structured data into ArangoDB.
  * **ArangoDB:** Stores your multi-model knowledge graph (documents, graph, vectors, full-text search).

## File Structure

```
/your-ai-coder-system/
├── docker-compose.yml
├── .env
├── gitea_data/             # Persistent volume for Gitea data
│   ├── gitea/
│   └── git/
├── arangodb_data/          # Persistent volume for ArangoDB data
└── ingestion_pipeline/
    ├── Dockerfile
    ├── requirements.txt
    └── app/
        ├── main.py
        ├── git_parser.py
        └── arangodb_manager.py
```

## Step 1: Define Services in `docker-compose.yml`

This file will orchestrate all three containers and their networking.

```yaml
# your-ai-coder-system/docker-compose.yml
version: '3.8'

networks:
  ai_coder_net:
    driver: bridge

volumes:
  gitea_data:
  arangodb_data:

services:
  # 1. Gitea - Your Git Server
  gitea:
    image: gitea/gitea:1.21.11 # Using a stable version
    container_name: gitea-server
    environment:
      - GITEA__DATABASE__TYPE=sqlite3 # Simple for local setup
      - GITEA__DATABASE__PATH=/data/gitea/gitea.db
      - GITEA__APP_NAME=Your AI Coder Git
      - GITEA__SERVER__DOMAIN=localhost
      - GITEA__SERVER__HTTP_PORT=3000
      - GITEA__SERVER__ROOT_URL=http://localhost:3000/
      - GITEA__SERVER__SSH_PORT=2222 # If you plan to use SSH for Git
      - GITEA__SSH_GENERATED_KEY_TYPE=ed25519
      - GITEA__SECURITY__INSTALL_LOCK=true # Prevents re-installation via web UI
      - GITEA__ADMIN_USER=${GITEA_ADMIN_USER} # From .env
      - GITEA__ADMIN_PASSWORD=${GITEA_ADMIN_PASSWORD} # From .env
    volumes:
      - gitea_data:/data # Persistent volume for Gitea's internal data and Git repos
    ports:
      - "3000:3000" # Web UI
      - "2222:2222" # SSH for Git (optional, if you use SSH cloning)
    networks:
      - ai_coder_net
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/"]
      interval: 10s
      timeout: 5s
      retries: 5

  # 2. ArangoDB - Your Knowledge Graph Database
  arangodb:
    image: arangodb/arangodb:latest
    container_name: arangodb-instance
    environment:
      ARANGO_RANDOM_ROOT_PASSWORD: "1" # Generates a random root password
      # ARANGO_ROOT_PASSWORD: "${ARANGO_ROOT_PASSWORD}" # Uncomment if you want a fixed password from .env
    volumes:
      - arangodb_data:/var/lib/arangodb3 # Persistent data volume
    ports:
      - "8529:8529" # ArangoDB Web UI and API
    networks:
      - ai_coder_net
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8529/_admin/server/version"]
      interval: 10s
      timeout: 5s
      retries: 5

  # 3. Ingestion Pipeline - Processes Git data into ArangoDB
  ingestion_pipeline:
    build: ./ingestion_pipeline # Build from Dockerfile in this directory
    container_name: git-arangodb-ingest
    depends_on:
      gitea:
        condition: service_healthy # Ensure Gitea is ready
      arangodb:
        condition: service_healthy # Ensure ArangoDB is ready
    environment:
      # ArangoDB Connection
      ARANGO_HOST: arangodb-instance # Service name in Docker Compose network
      ARANGO_PORT: 8529
      ARANGO_USER: root
      # For ARANGO_PASSWORD, you'll need to get it from ArangoDB logs if using random password,
      # or set it in .env and pass it here if using a fixed password.
      # For simplicity in this local setup with random password, we'll fetch it from logs manually for initial run.
      # In a real system, use Docker Secrets or a dedicated secret management solution.
      # ARANGO_PASSWORD: "${ARANGO_ROOT_PASSWORD}" # If using fixed password

      # Gitea Connection (for cloning repos)
      GITEA_HOST: gitea-server # Service name in Docker Compose network
      GITEA_HTTP_PORT: 3000
      GITEA_ADMIN_USER: ${GITEA_ADMIN_USER}
      GITEA_ADMIN_PASSWORD: ${GITEA_ADMIN_PASSWORD}
      # This example assumes you'll clone via HTTP with admin credentials.
      # For production, consider using a dedicated Git user token or SSH keys.

      # Git Repo to Ingest (example, you'll update this)
      GIT_REPO_ORG: "ai-coder-org" # Example organization on Gitea
      GIT_REPO_NAME: "sample-codebase" # Example repository name on Gitea
      # The full URL will be constructed in the Python script
    volumes:
      - ./repos:/app/repos # Mount a volume for cloning Git repos
    networks:
      - ai_coder_net
    # command: python /app/main.py # Uncomment if you want to explicitly define the entrypoint
    # entrypoint: ["python", "/app/main.py"] # Alternative to CMD in Dockerfile
```

## Step 2: Create `.env` File

This file will store sensitive information for Gitea.

```
# your-ai-coder-system/.env
GITEA_ADMIN_USER=aicoder_admin
GITEA_ADMIN_PASSWORD=SuperSecureGiteaPassw0rd!

# If you want a fixed ArangoDB password instead of random
# ARANGO_ROOT_PASSWORD=MySecureArangoDBPassw0rd!
```

## Step 3: Prepare Ingestion Pipeline Code

The Python code for your ingestion pipeline (as detailed in the previous response) will go into the `ingestion_pipeline/` directory.

### `ingestion_pipeline/Dockerfile` (No changes from previous response)

```dockerfile
# ingestion_pipeline/Dockerfile
FROM python:3.10-slim-bookworm

# Install Git and any other system dependencies needed for parsing/embeddings
RUN apt-get update && \
    apt-get install -y git libmagic1 && \
    rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Copy requirements file first to leverage Docker cache
COPY ingestion_pipeline/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the application code
COPY ingestion_pipeline/app/ .

# Define a default command to run when the container starts
CMD ["python", "main.py"]
```

### `ingestion_pipeline/requirements.txt` (No changes from previous response)

```
# ingestion_pipeline/requirements.txt
python-arango
GitPython
tree_sitter
tree_sitter_languages
sentence-transformers # For embeddings
requests
```

### `ingestion_pipeline/app/arangodb_manager.py` (Minor change for password handling)

In [None]:
# ingestion_pipeline/app/arangodb_manager.py
from arango import ArangoClient
import os

class ArangoDBManager:
    def __init__(self, arango_password=None): # Added arango_password parameter
        self.client = ArangoClient(hosts=f"http://{os.getenv('ARANGO_HOST')}:{os.getenv('ARANGO_PORT')}")
        self.db = None
        # Use the password passed during initialization or from environment variable
        self.arango_password = arango_password if arango_password else os.getenv('ARANGO_PASSWORD')
        self.connect()

    def connect(self):
        try:
            self.db = self.client.db(
                "_system", # Or your specific database name
                username=os.getenv('ARANGO_USER'),
                password=self.arango_password # Use the stored password
            )
            print("Connected to ArangoDB successfully!")
        except Exception as e:
            print(f"Error connecting to ArangoDB: {e}")
            raise

    # ... rest of the class methods (create_collections_and_graphs, insert_document, insert_edge, execute_aql)
    # ... are the same as in the previous response.
    # For brevity, I'm omitting them here, but ensure they are included in your actual file.

    def create_collections_and_graphs(self):
        # Vertex collections
        for col_name in ["Commits", "Developers", "Files", "Classes", "Methods", "Features", "Bugs"]:
            if not self.db.has_collection(col_name):
                self.db.create_collection(col_name)
                print(f"Created collection: {col_name}")

        # Edge collections (for relationships)
        for edge_name in ["AuthoredBy", "Modifies", "Calls", "Imports", "DependsOn", "Fixes", "Implements"]:
            if not self.db.has_collection(edge_name):
                self.db.create_collection(edge_name, edge=True)
                print(f"Created edge collection: {edge_name}")

        # Create a named graph (optional, but good for logical grouping)
        if not self.db.has_graph("SoftwareKnowledgeGraph"):
            graph = self.db.create_graph("SoftwareKnowledgeGraph")
            graph.create_edge_definition(
                edge_collection="AuthoredBy",
                from_vertex_collections=["Commits"],
                to_vertex_collections=["Developers"]
            )
            graph.create_edge_definition(
                edge_collection="Modifies",
                from_vertex_collections=["Commits"],
                to_vertex_collections=["Files", "Classes", "Methods"]
            )
            graph.create_edge_definition(
                edge_collection="Calls",
                from_vertex_collections=["Methods"],
                to_vertex_collections=["Methods", "Classes"] # Methods call methods/classes
            )
            # ... add other edge definitions
            print("Created graph: SoftwareKnowledgeGraph")

    def insert_document(self, collection_name, data, key=None):
        try:
            collection = self.db.collection(collection_name)
            if key:
                data['_key'] = key
            doc = collection.insert(data, overwrite=True) # overwrite on re-ingestion
            return doc['_id']
        except Exception as e:
            print(f"Error inserting document into {collection_name}: {e}")
            return None

    def insert_edge(self, edge_collection_name, from_vertex_id, to_vertex_id, properties=None):
        try:
            edge_collection = self.db.collection(edge_collection_name)
            edge_data = {"_from": from_vertex_id, "_to": to_vertex_id}
            if properties:
                edge_data.update(properties)
            edge = edge_collection.insert(edge_data, overwrite=True)
            return edge['_id']
        except Exception as e:
            print(f"Error inserting edge into {edge_collection_name}: {e}")
            return None

    def execute_aql(self, query, bind_vars=None):
        try:
            cursor = self.db.aql.execute(query, bind_vars=bind_vars)
            return list(cursor)
        except Exception as e:
            print(f"Error executing AQL query: {e}")
            raise

### `ingestion_pipeline/app/git_parser.py` (Updated for Gitea Auth)

In [None]:
# ingestion_pipeline/app/git_parser.py
import git
import os
from tree_sitter import Language, Parser
from tree_sitter_languages import get_language, get_parser
from sentence_transformers import SentenceTransformer
import hashlib # Import hashlib for better key generation
import urllib.parse # Import urllib.parse for URL encoding

# Load embedding model (consider a specific code-embedding model if needed)
try:
    embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
except Exception:
    print("Could not load SentenceTransformer model. Embeddings will be skipped.")
    embedding_model = None


class GitRepoParser:
    def __init__(self, repo_url, local_path, git_username=None, git_password=None):
        self.repo_url = repo_url
        self.local_path = local_path
        self.git_username = git_username
        self.git_password = git_password
        self.repo = self._clone_or_pull()
        self.parsers = {} # For Tree-sitter parsers

    def _clone_or_pull(self):
        # Construct authenticated URL with URL encoding for credentials
        auth_repo_url = self.repo_url
        if self.git_username and self.git_password:
            # Assumes HTTP/HTTPS cloning
            protocol, rest = self.repo_url.split('://', 1)
            encoded_username = urllib.parse.quote_plus(self.git_username)
            encoded_password = urllib.parse.quote_plus(self.git_password)
            auth_repo_url = f"{protocol}://{encoded_username}:{encoded_password}@{rest}"

        if os.path.exists(self.local_path):
            print(f"Pulling latest from {self.repo_url} into {self.local_path}...")
            repo = git.Repo(self.local_path)
            origin = repo.remotes.origin
            # Use authenticated URL for pull
            try:
                origin.pull(url=auth_repo_url) # Explicitly pass authenticated URL for pull
            except git.exc.GitCommandError as e:
                print(f"Error pulling from origin: {e}")
                print("Attempting to re-clone...")
                # If pull fails, try re-cloning (might be due to auth or repo state)
                if os.path.exists(self.local_path):
                    try:
                        import shutil
                        shutil.rmtree(self.local_path)
                    except OSError as e:
                        print(f"Error removing directory {self.local_path}: {e}")
                        raise # Re-raise if directory removal fails
                return git.Repo.clone_from(auth_repo_url, self.local_path)
            return repo
        else:
            print(f"Cloning {self.repo_url} into {self.local_path}...")
            # Use authenticated URL for cloning
            return git.Repo.clone_from(auth_repo_url, self.local_path)

    def _get_tree_sitter_parser(self, language_name):
        if language_name not in self.parsers:
            try:
                lang = get_language(language_name)
                parser = Parser()
                parser.set_language(lang)
                self.parsers[language_name] = parser
            except Exception as e:
                print(f"Warning: Could not load Tree-sitter parser for {language_name}: {e}")
                self.parsers[language_name] = None
        return self.parsers[language_name]

    def _parse_code_structure(self, file_content, file_extension):
        # Basic mapping from extension to Tree-sitter language
        lang_map = {
            '.py': 'python',
            '.js': 'javascript',
            '.java': 'java',
            '.go': 'go',
            '.ts': 'typescript',
            '.c': 'c',
            '.cpp': 'cpp',
            '.h': 'c',
            '.hpp': 'cpp',
            '.rb': 'ruby',
            '.php': 'php',
            '.sh': 'bash',
            '.rs': 'rust',
            '.swift': 'swift',
            '.kt': 'kotlin',
            '.scala': 'scala',
            '.r': 'r',
            '.m': 'matlab', # Or objective-c
            '.json': 'json',
            '.yaml': 'yaml',
            '.yml': 'yaml',
            '.xml': 'xml',
            '.html': 'html',
            '.css': 'css',
            '.sql': 'sql',
            '.pl': 'perl',
            '.lua': 'lua',
            '.groovy': 'groovy',
            # Removed potentially unavailable parsers: 'assembly', 'batch', 'powershell'
        }
        language_name = lang_map.get(file_extension.lower())
        if not language_name:
            print(f"No Tree-sitter mapping for extension: {file_extension}")
            return [] # No parser for this language

        parser = self._get_tree_sitter_parser(language_name)
        if not parser:
            return []

        try:
            tree = parser.parse(bytes(file_content, "utf8"))
        except Exception as e:
            print(f"Error parsing file with tree-sitter ({language_name}): {e}")
            return []

        methods_classes = []

        def traverse(node):
            try:
                if node.type in ['function_definition', 'method_definition', 'class_definition', 'struct_definition', 'enum_definition']:
                     # Attempt to find a common pattern for name (often an 'identifier' or similar)
                     name_node = None
                     for child in node.children:
                          if child.type in ['identifier', 'type_identifier', 'name', 'declaration_identifier']:
                              name_node = child
                              break
                          # Handle Python specific decorated_definition
                          if language_name == 'python' and child.type == 'decorated_definition':
                              for sub_child in child.children:
                                  if sub_child.type in ['function_definition', 'class_definition']:
                                       for identifier_child in sub_child.children:
                                            if identifier_child.type == 'identifier':
                                                 name_node = identifier_child
                                                 break
                                       if name_node:
                                            break
                              if name_node:
                                 break

                     name = name_node.text.decode('utf8') if name_node and name_node.text else 'unknown'

                     # Basic check to avoid parsing issues with incomplete nodes
                     if node.start_byte >= node.end_byte:
                          return

                     methods_classes.append({
                         "type": node.type,
                         "name": name,
                         "start_line": node.start_point[0] + 1,
                         "end_line": node.end_point[0] + 1,
                         "code_snippet": file_content[node.start_byte:node.end_byte],
                     })

                for child in node.children:
                    traverse(child)
            except Exception as e:
                # Catch exceptions during traversal to avoid stopping the whole process
                print(f"Warning: Error during Tree-sitter traversal in {language_name} for node type {node.type}: {e}")


        traverse(tree.root_node)
        return methods_classes


    def _generate_embedding(self, text):
        if embedding_model:
            try:
                # Add a check for text length or size if memory is a concern
                if len(text) > 10000: # Example threshold
                    print(f"Warning: Skipping embedding for large text snippet ({len(text)} characters).")
                    return None
                return embedding_model.encode(text).tolist()
            except Exception as e:
                print(f"Error generating embedding: {e}")
                return None
        return None # Return None if no model is loaded

    def get_repo_data(self):
        commits_data = []
        try:
            # Fetch more commits if needed, or implement incremental logic
            for commit in self.repo.iter_commits('main', max_count=100): # Increased limit for testing
                commit_info = {
                    "hash": commit.hexsha,
                    "author_name": commit.author.name,
                    "author_email": commit.author.email,
                    "committer_name": commit.committer.name,
                    "committer_email": commit.committer.email,
                    "timestamp": commit.authored_datetime.isoformat(),
                    "message": commit.message,
                    "lines_added": 0,
                    "lines_deleted": 0,
                    "files_changed": []
                }

                files_modified = []
                methods_classes_modified = []

                if commit.parents:
                    diff_index = commit.diff(commit.parents[0], create_patch=True)
                else:
                    diff_index = commit.diff(None, create_patch=True)

                for diff in diff_index:
                    file_path = diff.b_path if diff.b_path else diff.a_path
                    if file_path: # Ensure file_path is not None
                         commit_info["files_changed"].append(file_path)

                    try:
                        if diff.change_type == 'A':
                            commit_info["lines_added"] += diff.diff.decode(errors='ignore').count('\n+')
                        elif diff.change_type == 'D':
                            commit_info["lines_deleted"] += diff.diff.decode(errors='ignore').count('\n-')
                        elif diff.change_type == 'M':
                            commit_info["lines_added"] += diff.diff.decode(errors='ignore').count('\n+')
                            commit_info["lines_deleted"] += diff.diff.decode(errors='ignore').count('\n-')
                    except Exception as e:
                         print(f"Warning: Could not decode diff for {file_path} in commit {commit.hexsha}: {e}")


                    if diff.change_type in ['A', 'M'] and diff.b_path: # Process only if file exists after change
                        try:
                            blob = commit.tree / diff.b_path
                            file_content = blob.data_stream.read().decode('utf-8', errors='ignore')
                            _, file_extension = os.path.splitext(file_path)

                            parsed_elements = self._parse_code_structure(file_content, file_extension)
                            for el in parsed_elements:
                                el['file_path'] = file_path
                                el['embedding'] = self._generate_embedding(el['code_snippet'])
                                methods_classes_modified.append(el)

                            files_modified.append({
                                "path": file_path,
                                "content": file_content,
                                "content_hash": hashlib.sha256(file_content.encode('utf-8', errors='ignore')).hexdigest(), # Add content hash
                                "file_embedding": self._generate_embedding(file_content),
                                "parsed_elements": parsed_elements
                            })
                        except Exception as e:
                            print(f"Warning: Could not read or parse file {file_path} in commit {commit.hexsha}: {e}")
                    elif diff.change_type == 'D' and diff.a_path:
                         # Handle deleted files - you might want to mark them as deleted in ArangoDB
                         print(f"File deleted: {diff.a_path} in commit {commit.hexsha}")
                         # TODO: Implement logic to mark file as deleted in ArangoDB

                commit_info["files_detailed"] = files_modified
                commit_info["parsed_elements_modified"] = methods_classes_modified
                commits_data.append(commit_info)
        except Exception as e:
             print(f"Error fetching repository data: {e}")
             # Continue with any commits fetched so far, or raise the exception
             # raise # Uncomment to stop on error

        return commits_data

### `ingestion_pipeline/app/main.py` (Updated to use Gitea credentials and URL)

In [None]:
# ingestion_pipeline/app/main.py
import os
import time
from arangodb_manager import ArangoDBManager
from git_parser import GitRepoParser
import hashlib # Import hashlib for file key generation
import re # Import re for key validation

def run_ingestion():
    # --- 1. Get ArangoDB Password ---
    # Use the fixed password from .env set in docker-compose.yml
    arangodb_password = os.getenv('ARANGO_PASSWORD') # This should now be set from .env

    if not arangodb_password:
        print("\n--- ATTENTION: ArangoDB password not set. ---")
        print("Please ensure ARANGO_ROOT_PASSWORD is set in your .env file and passed to the ingestion_pipeline service in docker-compose.yml.")
        print("Ingestion cannot proceed without the ArangoDB password.")
        return # Exit if DB connection fails

    try:
        arangodb_manager = ArangoDBManager(arangodb_password=arangodb_password)
        arangodb_manager.create_collections_and_graphs()
    except Exception as e:
        print(f"Failed to connect to ArangoDB: {e}")
        print("Ensure ArangoDB is running and the password is correct.")
        return # Exit if DB connection fails


    # --- 2. Prepare Git Repo URL with Gitea credentials ---
    gitea_host = os.getenv('GITEA_HOST')
    gitea_http_port = os.getenv('GITEA_HTTP_PORT')
    gitea_admin_user = os.getenv('GITEA_ADMIN_USER')
    gitea_admin_password = os.getenv('GITEA_ADMIN_PASSWORD')
    git_repo_org = os.getenv('GIT_REPO_ORG')
    git_repo_name = os.getenv('GIT_REPO_NAME')

    # Construct the authenticated Git URL
    # This assumes HTTP/HTTPS cloning. For SSH, it's more complex (SSH agent forwarding/keys).
    # Ensure all necessary environment variables are set
    if not all([gitea_host, gitea_http_port, gitea_admin_user, gitea_admin_password, git_repo_org, git_repo_name]):
         print("Missing one or more Gitea environment variables. Cannot proceed with Git cloning.")
         return

    # The GitRepoParser will handle URL encoding of credentials
    repo_url = f"http://{gitea_host}:{gitea_http_port}/{git_repo_org}/{git_repo_name}.git"
    local_repo_path = f"/app/repos/{git_repo_name}" # Path inside the Docker container

    try:
        parser = GitRepoParser(repo_url, local_repo_path, gitea_admin_user, gitea_admin_password)
        all_commits_data = parser.get_repo_data()
    except Exception as e:
        print(f"Failed to clone or parse Git repository: {e}")
        return # Exit if Git operations fail


    print(f"Processing {len(all_commits_data)} commits...")

    # Helper function to validate ArangoDB keys
    def is_valid_arangodb_key(key):
        # ArangoDB keys must be strings and can only contain the characters a-z, A-Z, 0-9,
        # as well as the dash (-), underscore (_), colon (:), dot (.), and at sign (@).
        # Keys cannot start with a dash or dot.
        if not isinstance(key, str) or not key:
            return False
        if key[0] in ['-', '.']:
            return False
        return re.fullmatch(r'[a-zA-Z0-9\-_:@.]+', key) is not None


    for commit_data in all_commits_data:
        try:
            # 1. Ingest Developer
            # Using email hash for a more robust key and validating it
            developer_key = hashlib.sha256(commit_data["author_email"].encode()).hexdigest()
            if not is_valid_arangodb_key(developer_key):
                print(f"Warning: Generated invalid ArangoDB key for developer email {commit_data['author_email']}. Skipping.")
                developer_id = None
            else:
                developer_id = arangodb_manager.insert_document(
                    "Developers",
                    {"name": commit_data["author_name"], "email": commit_data["author_email"]},
                    key=developer_key
                )

            # 2. Ingest Commit
            # Using commit hash as key and validating it
            commit_key = commit_data["hash"]
            if not is_valid_arangodb_key(commit_key):
                 print(f"Warning: Generated invalid ArangoDB key for commit hash {commit_data['hash']}. Skipping.")
                 commit_id = None
            else:
                commit_id = arangodb_manager.insert_document(
                    "Commits",
                    {
                        "hash": commit_data["hash"],
                        "message": commit_data["message"],
                        "timestamp": commit_data["timestamp"],
                        "lines_added": commit_data["lines_added"],
                        "lines_deleted": commit_data["lines_deleted"],
                        "author_email": commit_data["author_email"],
                    },
                    key=commit_key
                )

            # 3. Create AuthoredBy relationship
            if commit_id and developer_id:
                arangodb_manager.insert_edge("AuthoredBy", commit_id, developer_id)

            # 4. Ingest Files, Classes, Methods, and Modifies relationships
            for file_data in commit_data["files_detailed"]:
                # Using path hash for a more robust key and validating it
                file_key = hashlib.sha256(file_data["path"].encode()).hexdigest()
                if not is_valid_arangodb_key(file_key):
                    print(f"Warning: Generated invalid ArangoDB key for file path {file_data['path']}. Skipping file.")
                    file_id = None
                else:
                    file_id = arangodb_manager.insert_document(
                        "Files",
                        {
                            "path": file_data["path"],
                            "content_hash": file_data["content_hash"], # Use the hash generated in parser
                            "last_commit_hash": commit_data["hash"],
                            "embedding": file_data["file_embedding"]
                        },
                        key=file_key
                    )
                    if commit_id and file_id:
                        arangodb_manager.insert_edge("Modifies", commit_id, file_id)

                    for element in file_data["parsed_elements"]:
                        # Using a composite key including file hash, element type, name, and start line and validating it
                        element_composite_key_str = f"{file_key}_{element['type']}_{element['name']}_{element['start_line']}"
                        element_key = hashlib.sha256(element_composite_key_str.encode()).hexdigest()

                        if not is_valid_arangodb_key(element_key):
                             print(f"Warning: Generated invalid ArangoDB key for element in file {file_data['path']}. Skipping element.")
                             continue # Skip this element

                        if element["type"] in ["class_definition", "struct_definition", "enum_definition"]:
                            class_id = arangodb_manager.insert_document(
                                "Classes", # Assuming Classes collection can hold various code structure types
                                {
                                    "name": element["name"],
                                    "type": element["type"], # Store the specific type
                                    "file_path": element["file_path"],
                                    "start_line": element["start_line"],
                                    "end_line": element["end_line"],
                                    "code_snippet": element["code_snippet"],
                                    "embedding": element["embedding"]
                                },
                                key=element_key
                            )
                            if commit_id and class_id:
                                arangodb_manager.insert_edge("Modifies", commit_id, class_id)
                        elif element["type"] in ["function_definition", "method_definition"]:
                            method_id = arangodb_manager.insert_document(
                                "Methods",
                                {
                                    "name": element["name"],
                                    "type": element["type"], # Store the specific type
                                    "file_path": element["file_path"],
                                    "start_line": element["start_line"],
                                    "end_line": element["end_line"],
                                    "code_snippet": element["code_snippet"],
                                    "embedding": element["embedding"]
                                },
                                key=element_key
                            )
                            if commit_id and method_id:
                                arangodb_manager.insert_edge("Modifies", commit_id, method_id)

                # TODO: Implement parsing for 'Calls', 'Imports', 'DependsOn' relationships
                # This often requires more sophisticated AST analysis across files/modules.

                # TODO: Integrate with Jira/other external systems for 'Fixes', 'Implements' relationships
                # This would involve querying Jira API using commit messages (e.g., regex for JIRA-XXX)
                # and then ingesting 'Features' and 'Bugs' nodes and 'Fixes'/'Implements' edges.
        except Exception as e:
            print(f"Error processing commit {commit_data.get('hash', 'N/A')}: {e}")
            # Continue to next commit, or break if error is critical
            # break # Uncomment to stop on first commit error


    print("Ingestion complete!")

if __name__ == "__main__":
    run_ingestion()

## Step 4: Deploy and Run the System

1.  **Create Directories:**

    ```bash
    mkdir your-ai-coder-system
    cd your-ai-coder-system
    mkdir gitea_data arangodb_data repos ingestion_pipeline ingestion_pipeline/app
    ```

2.  **Place Files:**

      * `docker-compose.yml` into `your-ai-coder-system/`
      * `.env` into `your-ai-coder-system/`
      * `Dockerfile` into `your-ai-coder-system/ingestion_pipeline/`
      * `requirements.txt` into `your-ai-coder-system/ingestion_pipeline/`
      * `main.py`, `git_parser.py`, `arangodb_manager.py` into `your-ai-coder-system/ingestion_pipeline/app/`

3.  **Build and Start Services:**
    Navigate to the `your-ai-coder-system/` directory in your terminal and run:

    ```bash
    docker compose up -d --build
    ```

      * `--build` ensures your `ingestion_pipeline` image is built.
      * `-d` runs containers in detached mode.

4.  **Initial Gitea Setup (Manual - First Time Only):**

      * Open your browser to `http://localhost:3000`.
      * Gitea should present an initial configuration page.
      * **Crucially:** Since you set `GITEA__SECURITY__INSTALL_LOCK=true` and provided `GITEA_ADMIN_USER`/`GITEA_ADMIN_PASSWORD` in `.env`, Gitea should automatically set up the admin user and lock the installation. You might just see a login page.
      * Log in with the `GITEA_ADMIN_USER` and `GITEA_ADMIN_PASSWORD` from your `.env` file.
      * **Create an Organization:** Create a new organization, e.g., `ai-coder-org` (matching `GIT_REPO_ORG` in `docker-compose.yml`).
      * **Create a Repository:** Inside that organization, create a new repository, e.g., `sample-codebase` (matching `GIT_REPO_NAME`). Initialize it with a README.
      * **Push Sample Code:** Clone this empty repository to your *host machine* (e.g., `git clone http://localhost:3000/ai-coder-org/sample-codebase.git`). Add some sample Python, JavaScript, or Java files to it. Commit and push these files back to Gitea. This provides data for your ingestion pipeline.

5.  **Run the Ingestion Pipeline:**
    Once Gitea has some code, you can trigger the ingestion.

    ```bash
    docker compose up ingestion_pipeline
    ```

      * This will restart the `ingestion_pipeline` container. It will connect to Gitea, clone the `sample-codebase` repository, process its commits, and ingest the data into ArangoDB.
      * Check logs: `docker logs git-arangodb-ingest` to see the progress and any errors.

6.  **Verify Data in ArangoDB:**

      * Open your browser to `http://localhost:8529`.
      * Log in to ArangoDB (if using a random password, get it from `docker logs arangodb-instance`).
      * Navigate to the "Collections" and "Graphs" sections to see the `Commits`, `Developers`, `Files`, `Classes`, `Methods` collections and the `SoftwareKnowledgeGraph`. You should see your ingested data\!

### Important Considerations and Next Steps:

  * **ArangoDB Password Management:** For production, avoid `ARANGO_RANDOM_ROOT_PASSWORD=1`. Instead, set `ARANGO_ROOT_PASSWORD` in your `.env` and pass it to both `arangodb` and `ingestion_pipeline` services in `docker-compose.yml`. Even better, use Docker Secrets.
  * **Gitea Authentication:** Using the Gitea admin password directly in the `GIT_REPO_URL` is convenient for a local demo but insecure for production. For production, create a dedicated Git user with a personal access token for the ingestion pipeline, and use that token in the URL.
  * **Incremental Sync:** The `git_parser.py` currently fetches a limited number of recent commits. You need to implement the logic for incremental updates:
    1.  Store the hash of the *last successfully processed commit* in ArangoDB.
    2.  On subsequent runs, retrieve this hash.
    3.  Call `parser.get_repo_data(since_commit_hash=last_hash)` to only process new commits.
    4.  Update the `last_processed_commit_hash` in ArangoDB after a successful run.
    5.  Handle file renames and deletions (Git diffs will show these; you'll need to update/delete corresponding nodes in ArangoDB).
  * **AST Parsing Depth:** The `_parse_code_structure` in `git_parser.py` is a very basic example. Real-world code parsing for `Calls` and `Imports` relationships requires much more sophisticated AST traversal and symbol resolution. Libraries like `tree-sitter` are powerful but require careful implementation for deep code analysis.
  * **External Integrations (Jira/Features/Bugs):**
      * You'll need to add code to `main.py` to connect to your Jira (or other issue tracker) API.
      * Parse commit messages for issue IDs (e.g., regex `[A-Z]+-\d+`).
      * Query the Jira API for details about those issues.
      * Ingest `(:Feature)` and `(:Bug)` nodes and `(:Fixes)` / `(:Implements)` edges, linking them to `(:Commit)` nodes.
  * **Scheduling:** Once the incremental sync is implemented, decide how to run it:
      * A `cron` job on your host that runs `docker compose up ingestion_pipeline` periodically.
      * A Git webhook from Gitea that triggers this Docker Compose command (requires a small web server to receive the webhook).
  * **Resource Management:** Monitor CPU, memory, and disk usage, especially as your codebase and knowledge graph grow. Adjust Docker resource limits if necessary.

This detailed setup provides a robust foundation for your AI coder's knowledge graph.

Okay, this is a much more comprehensive and self-contained system\! Setting up your own Git server (like Gitea) within Docker alongside ArangoDB and your ingestion pipeline creates a fully isolated and reproducible environment for your AI coder project.

Here's a step-by-step guide to implement and deploy the *whole system*, including a Git server, using Docker Compose:

## System Architecture Overview

In [None]:
# your-ai-coder-system/docker-compose.yml
version: '3.8'

networks:
  ai_coder_net:
    driver: bridge

volumes:
  gitea_data:
  arangodb_data:

services:
  # 1. Gitea - Your Git Server
  gitea:
    image: gitea/gitea:1.21.11 # Using a stable version
    container_name: gitea-server
    environment:
      - GITEA__DATABASE__TYPE=sqlite3 # Simple for local setup
      - GITEA__DATABASE__PATH=/data/gitea/gitea.db
      - GITEA__APP_NAME=Your AI Coder Git
      - GITEA__SERVER__DOMAIN=localhost
      - GITEA__SERVER__HTTP_PORT=3000
      - GITEA__SERVER__ROOT_URL=http://localhost:3000/
      - GITEA__SERVER__SSH_PORT=2222 # If you plan to use SSH for Git
      - GITEA__SSH_GENERATED_KEY_TYPE=ed25519
      - GITEA__SECURITY__INSTALL_LOCK=true # Prevents re-installation via web UI
      - GITEA__ADMIN_USER=${GITEA_ADMIN_USER} # From .env
      - GITEA__ADMIN_PASSWORD=${GITEA_ADMIN_PASSWORD} # From .env
    volumes:
      - gitea_data:/data # Persistent volume for Gitea's internal data and Git repos
    ports:
      - "3000:3000" # Web UI
      - "2222:2222" # SSH for Git (optional, if you use SSH cloning)
    networks:
      - ai_coder_net
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/"]
      interval: 10s
      timeout: 5s
      retries: 5

  # 2. ArangoDB - Your Knowledge Graph Database
  arangodb:
    image: arangodb/arangodb:latest
    container_name: arangodb-instance
    # Use a fixed password from .env for easier integration with the ingestion pipeline
    environment:
      ARANGO_ROOT_PASSWORD: "${ARANGO_ROOT_PASSWORD}"
    volumes:
      - arangodb_data:/var/lib/arangodb3 # Persistent data volume
    ports:
      - "8529:8529" # ArangoDB Web UI and API
    networks:
      - ai_coder_net
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8529/_admin/server/version"]
      interval: 10s
      timeout: 5s
      retries: 5

  # 3. Ingestion Pipeline - Processes Git data into ArangoDB
  ingestion_pipeline:
    build: ./ingestion_pipeline # Build from Dockerfile in this directory
    container_name: git-arangodb-ingest
    depends_on:
      arangodb:
        condition: service_healthy # Ensure ArangoDB is ready
    environment:
      # ArangoDB Connection
      ARANGO_HOST: arangodb-instance # Service name in Docker Compose network
      ARANGO_PORT: 8529
      ARANGO_USER: root
      ARANGO_PASSWORD: "${ARANGO_ROOT_PASSWORD}" # Use the fixed password from .env

      # Gitea Connection (for cloning repos)
      GITEA_HOST: gitea-server # Service name in Docker Compose network
      GITEA_HTTP_PORT: 3000
      GITEA_ADMIN_USER: ${GITEA_ADMIN_USER}
      GITEA_ADMIN_PASSWORD: ${GITEA_ADMIN_PASSWORD}
      # This example assumes you'll clone via HTTP with admin credentials.
      # For production, consider using a dedicated Git user token or SSH keys.

      # Git Repo to Ingest (example, you'll update this)
      GIT_REPO_ORG: "ai-coder-org" # Example organization on Gitea
      GIT_REPO_NAME: "sample-codebase" # Example repository name on Gitea
      # The full URL will be constructed in the Python script
    volumes:
      - ./repos:/app/repos # Mount a volume for cloning Git repos
    networks:
      - ai_coder_net
    # command: python /app/main.py # Uncomment if you want to explicitly define the entrypoint
    # entrypoint: ["python", "/app/main.py"] # Alternative to CMD in Dockerfile

In [None]:
# ingestion_pipeline/Dockerfile
FROM python:3.10-slim-bookworm

# Install Git and any other system dependencies needed for parsing/embeddings
RUN apt-get update && \
    apt-get install -y git libmagic1 && \
    rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Copy requirements file first to leverage Docker cache
COPY ingestion_pipeline/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the application code
COPY ingestion_pipeline/app/ .

# Define a default command to run when the container starts
CMD ["python", "main.py"]