fix(git-integration): target commit from an outdated branch (CM-717)#3480
fix(git-integration): target commit from an outdated branch (CM-717)#3480
Conversation
There was a problem hiding this comment.
Pull Request Overview
Implements default branch tracking and adaptive cloning strategy for git repositories to improve correctness when the default branch changes and optimize incremental processing.
- Adds branch column and model field to persist the tracked default branch.
- Introduces branch-change detection and clone strategy selection (full vs minimal/batched).
- Propagates clone mode via CloneBatchInfo instead of passing a separate parameter.
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| repository_worker.py | Updates processing to remove explicit clone mode flag and persist detected default branch after final batch. |
| services/utils.py | Adds remote default branch discovery helper and local default branch retrieval. |
| commit_service.py | Refactors commit processing to read clone mode from batch info. |
| clone_service.py | Adds strategy logic, branch change detection, and refactors clone operations. |
| models/repository.py | Adds branch field to Repository model. |
| models/clone_batch.py | Adds clone_with_batches flag to batch info model. |
| database/crud.py | Extends update_last_processed_commit to (optionally) store branch (currently always overwrites). |
| migrations (add/remove branch) | Schema migration to add/drop branch column with documentation comment. |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| async def _perform_full_clone(self, repo_path: str, remote: str): | ||
| """Perform full repository clone""" | ||
| self.logger.info(f"Performing full clone for repo {remote}...") | ||
| await run_shell_command(["git", "clone", remote, repo_path], cwd=repo_path) |
There was a problem hiding this comment.
The full clone command clones into a destination path (repo_path) while also setting cwd to that same existing directory created by mkdtemp, causing git clone to fail because the destination already exists. Use a parent temp directory and let git create the target folder, or keep the temp directory as the working dir and clone into '.' (or omit the destination argument) so the repository contents populate the existing directory. Example fix: create temp_dir = tempfile.mkdtemp(...); then run_shell_command(["git", "clone", remote, "."], cwd=temp_dir) and treat temp_dir as repo_path.
| await run_shell_command(["git", "clone", remote, repo_path], cwd=repo_path) | |
| await run_shell_command(["git", "clone", remote, "."], cwd=repo_path) |
| sql_query = """ | ||
| UPDATE git.repositories | ||
| SET "lastProcessedCommit" = $1, | ||
| "branch" = $2, | ||
| "updatedAt" = NOW() | ||
| WHERE id = $2 | ||
| WHERE id = $3 | ||
| """ | ||
| result = await execute(sql_query, (commit_hash, repo_id)) | ||
| result = await execute(sql_query, (commit_hash, branch, repo_id)) |
There was a problem hiding this comment.
Passing no branch (None) will overwrite the existing branch value with NULL, contradicting the "optionally" wording and preventing persistence of a previously stored branch when only the commit updates. Implement conditional SQL (two variants) or use CASE to avoid clearing the branch when branch is None. Example: build query without updating "branch" if branch is None, or use SET "branch" = COALESCE($2, "branch") and pass branch (but only if you never need to intentionally clear it).
Changes proposed ✍️
This pull request introduces significant improvements to the repository cloning and processing workflow, primarily by tracking and responding to changes in the default branch of git repositories. The changes add a
branchcolumn to the database, update the models and migration scripts, and enhance the cloning logic to detect and handle default branch changes, ensuring accurate and efficient processing.Database schema and model updates:
branchcolumn to thegit.repositoriestable to track the default branch for each repository, with corresponding migration scripts to add and remove the column. [1] [2]Repositorymodel to include the newbranchfield, and updated relevant SQL queries and return values to handle the branch information. [1] [2] [3] [4]Cloning logic improvements:
has_default_branch_changed) and to determine the appropriate cloning strategy (determine_clone_strategy). Full clones are now triggered when the branch changes or for new repositories, while incremental (batched) clones are used otherwise. [1] [2]get_remote_default_branchto determine the default branch of a remote repository without cloning it.Clone and commit processing updates:
clone_with_batchesflag through the batch info and commit processing logic. [1] [2] [3] [4] [5]These changes together make the system more robust against changes in repository configuration and improve the efficiency of repository processing.
Checklist ✅
Feature,Improvement, orBug.