Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large documents cause stack overflow in TextChunker.SplitPlainTextParagraphs() #1633

Closed
bdobabraham opened this issue Jun 21, 2023 · 0 comments · Fixed by #1709
Closed

Large documents cause stack overflow in TextChunker.SplitPlainTextParagraphs() #1633

bdobabraham opened this issue Jun 21, 2023 · 0 comments · Fixed by #1709

Comments

@bdobabraham
Copy link

Describe the bug
When calling TextChunker.SplitPlainTextLines() followed by TextChunker.SplitPlainTextParagraphs() with a large document (7k+ lines output from SplitPlainTextLines() call.) The internal chunking logic causes a stack overflow exception in the final line of TextChunker.BuildParagraph().

To Reproduce
Steps to reproduce the behavior:

  1. Setup CoPilot demo
  2. Find a large document with lots of text
  3. Upload document to CoPilot app
  4. Crash will happen during upload process

Expected behavior
Document should upload, chunk, and process without issue.

Screenshots
image

Desktop (please complete the following information):

  • OS: Windows
  • IDE: Visual Studio 17.6.3
  • NuGet Package Version: 0.16.230615.1-preview

Additional context
Document stats:
Characters: 650k+
Words: 115k+
Output from TextChunker.SplitPlainTextLines(): 7,214 lines

github-merge-queue bot pushed a commit that referenced this issue Jun 27, 2023
### Motivation and Context

TextChunker.BuildParagraph is unnecessarily recursive and can lead to
stack overflows on large inputs.

### Description

Fixes #1633

Also happens to make it more efficient with a few tweaks.

### Contribution Checklist
<!-- Before submitting this PR, please make sure: -->
- [x] The code builds clean without any errors or warnings
- [x] The PR follows SK Contribution Guidelines
(https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
- [x] The code follows the .NET coding conventions
(https://learn.microsoft.com/dotnet/csharp/fundamentals/coding-style/coding-conventions)
verified with `dotnet format`
- [x] All unit tests pass, and I have added new tests where possible
- [x] I didn't break anyone 😄

---------

Co-authored-by: Dmytro Struk <13853051+dmytrostruk@users.noreply.github.com>
shawncal pushed a commit to shawncal/semantic-kernel that referenced this issue Jul 6, 2023
### Motivation and Context

TextChunker.BuildParagraph is unnecessarily recursive and can lead to
stack overflows on large inputs.

### Description

Fixes microsoft#1633

Also happens to make it more efficient with a few tweaks.

### Contribution Checklist
<!-- Before submitting this PR, please make sure: -->
- [x] The code builds clean without any errors or warnings
- [x] The PR follows SK Contribution Guidelines
(https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
- [x] The code follows the .NET coding conventions
(https://learn.microsoft.com/dotnet/csharp/fundamentals/coding-style/coding-conventions)
verified with `dotnet format`
- [x] All unit tests pass, and I have added new tests where possible
- [x] I didn't break anyone 😄

---------

Co-authored-by: Dmytro Struk <13853051+dmytrostruk@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants