Skip to content

Conversation

@ChenZiHong-Gavin
Copy link
Collaborator

No description provided.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds new splitter classes to the graphgen project and refactors data types. The main purpose is to introduce text splitting functionality while consolidating data type definitions.

  • Adds new text splitter classes for character-based, recursive character, and markdown splitting
  • Consolidates data types (Chunk and QAPair) into a new bases/datatypes.py module
  • Updates imports throughout the codebase to use the new datatype locations

Reviewed Changes

Copilot reviewed 20 out of 23 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
graphgen/bases/datatypes.py New consolidated datatype definitions for Chunk and QAPair
graphgen/bases/base_splitter.py New abstract base class for text splitters
graphgen/models/splitter/*.py New splitter implementations for different text splitting strategies
tests/integration_tests/models/splitter/*.py Integration tests for the new splitter classes
Various model files Import updates to use consolidated datatypes

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

index = text.find(chunk, max(0, offset))
metadata["start_index"] = index
previous_chunk_len = len(chunk)
new_chunk = Chunk(content=chunk, metadata=metadata)
Copy link

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating Chunk instances without providing the required id field will cause runtime errors. The Chunk dataclass requires all three fields (id, content, metadata) but only content and metadata are being provided.

Copilot uses AI. Check for mistakes.
ChenZiHong-Gavin and others added 4 commits September 24, 2025 16:17
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 27 out of 30 changed files in this pull request and generated 3 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +145 to +149
return [
re.sub(r"\n{2,}", "\n", chunk.strip())
for chunk in final_chunks
if chunk.strip() != ""
]
Copy link

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This transformation logic should be configurable or documented. The hardcoded regex replacement of multiple newlines with single newlines may not be appropriate for all Chinese text processing scenarios.

Copilot uses AI. Check for mistakes.
ChenZiHong-Gavin and others added 2 commits September 24, 2025 17:49
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@ChenZiHong-Gavin ChenZiHong-Gavin merged commit da90335 into main Sep 24, 2025
2 checks passed
@ChenZiHong-Gavin ChenZiHong-Gavin deleted the splitter branch September 24, 2025 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants