Add Splitter classes #51

ChenZiHong-Gavin · 2025-09-24T08:15:40Z

No description provided.

Copilot

Pull Request Overview

This PR adds new splitter classes to the graphgen project and refactors data types. The main purpose is to introduce text splitting functionality while consolidating data type definitions.

Adds new text splitter classes for character-based, recursive character, and markdown splitting
Consolidates data types (Chunk and QAPair) into a new bases/datatypes.py module
Updates imports throughout the codebase to use the new datatype locations

Reviewed Changes

Copilot reviewed 20 out of 23 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`graphgen/bases/datatypes.py`	New consolidated datatype definitions for `Chunk` and `QAPair`
`graphgen/bases/base_splitter.py`	New abstract base class for text splitters
`graphgen/models/splitter/*.py`	New splitter implementations for different text splitting strategies
`tests/integration_tests/models/splitter/*.py`	Integration tests for the new splitter classes
Various model files	Import updates to use consolidated datatypes

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

graphgen/bases/datatypes.py

Copilot · 2025-09-24T08:16:50Z

graphgen/bases/base_splitter.py

+                    index = text.find(chunk, max(0, offset))
+                    metadata["start_index"] = index
+                    previous_chunk_len = len(chunk)
+                new_chunk = Chunk(content=chunk, metadata=metadata)


Creating Chunk instances without providing the required id field will cause runtime errors. The Chunk dataclass requires all three fields (id, content, metadata) but only content and metadata are being provided.

tests/integration_tests/models/splitter/test_markdown_splitter.py

graphgen/models/splitter/recursive_character_splitter.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull Request Overview

Copilot reviewed 27 out of 30 changed files in this pull request and generated 3 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-24T09:43:04Z

graphgen/models/splitter/recursive_character_splitter.py

+        return [
+            re.sub(r"\n{2,}", "\n", chunk.strip())
+            for chunk in final_chunks
+            if chunk.strip() != ""
+        ]


This transformation logic should be configurable or documented. The hardcoded regex replacement of multiple newlines with single newlines may not be appropriate for all Chinese text processing scenarios.

graphgen/models/splitter/markdown_splitter.py

graphgen/graphgen.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

ChenZiHong-Gavin added 3 commits September 24, 2025 11:20

fix: update __init__.py in models

fad24d0

refactor: add datatypes

a430183

feat: add splitter classes

0bffcfc

ChenZiHong-Gavin requested a review from Copilot September 24, 2025 08:15

Copilot AI reviewed Sep 24, 2025

View reviewed changes

ChenZiHong-Gavin and others added 4 commits September 24, 2025 16:17

Update graphgen/bases/datatypes.py

b02307f

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update tests/integration_tests/models/splitter/test_markdown_splitter.py

86e9082

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update graphgen/models/splitter/recursive_character_splitter.py

797781d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

feat(webui): update webui with splitter config

6a6cb34

ChenZiHong-Gavin requested a review from Copilot September 24, 2025 09:41

Copilot AI reviewed Sep 24, 2025

View reviewed changes

ChenZiHong-Gavin and others added 2 commits September 24, 2025 17:49

Update graphgen/models/splitter/markdown_splitter.py

d439262

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update graphgen/graphgen.py

fdaef0e

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

ChenZiHong-Gavin merged commit da90335 into main Sep 24, 2025
2 checks passed

ChenZiHong-Gavin deleted the splitter branch September 24, 2025 09:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Splitter classes #51

Add Splitter classes #51

Uh oh!

ChenZiHong-Gavin commented Sep 24, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Sep 24, 2025

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add Splitter classes #51

Add Splitter classes #51

Uh oh!

Conversation

ChenZiHong-Gavin commented Sep 24, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants