Skip to content

fix(git-integration): insertions deletions memory consumption [CM-724]#3505

Merged
mbani01 merged 4 commits into
mainfrom
fix/insertions_deletions_memory_consumption
Oct 13, 2025
Merged

fix(git-integration): insertions deletions memory consumption [CM-724]#3505
mbani01 merged 4 commits into
mainfrom
fix/insertions_deletions_memory_consumption

Conversation

@mbani01
Copy link
Copy Markdown
Contributor

@mbani01 mbani01 commented Oct 13, 2025

This pull request refactors the commit processing pipeline in commit_service.py to streamline how commit metadata and numstat (insertions/deletions) data are extracted and handled. The main change is the unification of commit and numstat extraction into a single git log command and the corresponding update to downstream processing logic. Several memory optimizations and minor bug fixes are also included.

Commit and Numstat Extraction Refactor:

  • Replaced the use of separate commit and numstat splitters with unified COMMIT_START_SPLITTER and NUMSTAT_SPLITTER, and updated the git log formatting to include both metadata and numstat in a single command. (commit_service.py) [1] [2] [3] [4] [5] [6]
  • Changed the commit processing logic to split each commit's text at the new numstat splitter, parsing insertions and deletions per-commit instead of using a pre-parsed map. (process_commits_chunk, _construct_commit_dict, _parse_numstats) [1] [2] [3] [4] [5] [6] [7] [8]

Memory and Performance Improvements:

  • Added explicit deletion of large variables (e.g., commit, commit_lines, numstats_text, chunk_activities_db, chunk_activities_queue) after use to reduce memory usage during batch processing. [1] [2] [3] [4]

Interface and Argument Changes:

  • Updated function signatures to reflect the new single-source commit/numstat extraction, removing now-unnecessary arguments and simplifying method calls. [1] [2] [3] [4]

Bug Fixes and Minor Improvements:

  • Fixed encoding fallback order in _safe_decode to prefer iso-8859-1 before cp1252, which is more robust for legacy content. (utils.py)
  • Added error replacement to Kafka message encoding to prevent crashes on invalid characters. (queue_service.py)

@mbani01 mbani01 self-assigned this Oct 13, 2025
@github-actions
Copy link
Copy Markdown
Contributor

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

  • feat: add user authentication (CM-123)
  • feat: add user authentication (IN-123)

Projects:

  • CM: Community Data Platform
  • IN: Insights

Please add a Jira issue key to your PR title.

@mbani01 mbani01 changed the title fix(git-integration): insertions deletions memory consumption fix(git-integration): insertions deletions memory consumption [CM-724] Oct 13, 2025
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the final PR Bugbot will review for you during this billing cycle

Your free Bugbot reviews will reset on November 12

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

if CommitService.should_skip_commit(full_commit_text, edge_commit_hash):
continue

commit_text, numstats_text = full_commit_text.split(CommitService.NUMSTAT_SPLITTER)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Commit Splitting Fails on Incorrect NUMSTAT Splitter Usage

The split() operation on full_commit_text expects exactly two parts: commit metadata and numstat lines, separated by NUMSTAT_SPLITTER. If the splitter appears zero or multiple times (e.g., within a commit message), unpacking the result will raise a ValueError, causing commit processing to fail.

Fix in Cursor Fix in Web

finally:
del commit
del commit_lines
del numstats_text
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Unconditional Deletion in finally Block

The finally block in process_commits_chunk attempts to del variables like commit, commit_lines, and numstats_text unconditionally. This can cause a NameError if an exception occurs before these variables are defined within an iteration, or if commit was already deleted in a previous loop iteration.

Fix in Cursor Fix in Web

@mbani01 mbani01 merged commit dfdea8c into main Oct 13, 2025
16 checks passed
@mbani01 mbani01 deleted the fix/insertions_deletions_memory_consumption branch October 13, 2025 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant