Skip to content

feat: Add diff hunk parsing with line-by-line change extraction#7

Merged
nick-galluzzo merged 8 commits intomainfrom
feature/parse-hunks
Jul 30, 2025
Merged

feat: Add diff hunk parsing with line-by-line change extraction#7
nick-galluzzo merged 8 commits intomainfrom
feature/parse-hunks

Conversation

@nick-galluzzo
Copy link
Copy Markdown
Owner

Enhances git diff parsing to extract actual code changes at the line level, providing structured data optimized for AI processing. This foundational improvement enables more intelligent commit message generation by giving AI models the actual changed content rather than just metadata.

Key Changes

Enhanced Data Models

  • Added DiffHunk and HunkLine classes for structured diff representation
  • Enhanced FileDiff with hunks field containing actual change content
  • Added utility methods for accessing added/removed content

Improved Diff Parser

  • Modified GitDiffParser to extract line-by-line changes using unidiff
  • Preserved existing functionality while adding detailed hunk parsing
  • Added support for hunk headers and section context

AI-Optimized Output

  • Added get_ai_context() method that outputs minimal git diff format
  • Research-backed approach focusing on essential changes without git plumbing metadata
  • Format optimized for LLM training data alignment

Technical Details

  • Uses unidiff library for robust diff parsing instead of custom regex
  • Maintains backward compatibility with existing CommitAnalysis.to_ai_context()
  • Clean separation between structured data (for analysis) and AI format (for generation)

Technical Decisions

  • The get_ai_context() method returns minimal git diff format (matching git diff output) based on research showing LLMs perform best with familiar training data formats
  • Chose unidiff library over custom parsing for reliability and maintainability. Unidiff allows us to break apart each hunk and better categorize and data engineer our git diff data components.
  • MVP approach prioritizes proven formats over custom optimizations

Testing

  • Manual testing with various staged changes (add/modify/delete/rename)
  • Verified output matches standard git diff format exactly

- Serialize HunkLine and add HunkLine model
- Serialize DiffHunk and add DiffHunk model
- Add HunkLine and DiffHunk models for detailed diff parsing
- Implement properties for accessing added, removed , and context lines
- Add `get_ai_context` method for AI-friendly diff representation for context
- Connect hunks to FileDiff model for complete diff structure
- Reorganize model definitions for better code organization
@nick-galluzzo nick-galluzzo merged commit bd64c2b into main Jul 30, 2025
3 checks passed
@nick-galluzzo nick-galluzzo deleted the feature/parse-hunks branch July 30, 2025 03:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant