-
Notifications
You must be signed in to change notification settings - Fork 45
Add Splitter classes #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds new splitter classes to the graphgen project and refactors data types. The main purpose is to introduce text splitting functionality while consolidating data type definitions.
- Adds new text splitter classes for character-based, recursive character, and markdown splitting
- Consolidates data types (
ChunkandQAPair) into a newbases/datatypes.pymodule - Updates imports throughout the codebase to use the new datatype locations
Reviewed Changes
Copilot reviewed 20 out of 23 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
graphgen/bases/datatypes.py |
New consolidated datatype definitions for Chunk and QAPair |
graphgen/bases/base_splitter.py |
New abstract base class for text splitters |
graphgen/models/splitter/*.py |
New splitter implementations for different text splitting strategies |
tests/integration_tests/models/splitter/*.py |
Integration tests for the new splitter classes |
| Various model files | Import updates to use consolidated datatypes |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| index = text.find(chunk, max(0, offset)) | ||
| metadata["start_index"] = index | ||
| previous_chunk_len = len(chunk) | ||
| new_chunk = Chunk(content=chunk, metadata=metadata) |
Copilot
AI
Sep 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating Chunk instances without providing the required id field will cause runtime errors. The Chunk dataclass requires all three fields (id, content, metadata) but only content and metadata are being provided.
tests/integration_tests/models/splitter/test_markdown_splitter.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 27 out of 30 changed files in this pull request and generated 3 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| return [ | ||
| re.sub(r"\n{2,}", "\n", chunk.strip()) | ||
| for chunk in final_chunks | ||
| if chunk.strip() != "" | ||
| ] |
Copilot
AI
Sep 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This transformation logic should be configurable or documented. The hardcoded regex replacement of multiple newlines with single newlines may not be appropriate for all Chinese text processing scenarios.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
No description provided.