-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large documents cause stack overflow in TextChunker.SplitPlainTextParagraphs() #1633
Comments
5 tasks
github-merge-queue bot
pushed a commit
that referenced
this issue
Jun 27, 2023
### Motivation and Context TextChunker.BuildParagraph is unnecessarily recursive and can lead to stack overflows on large inputs. ### Description Fixes #1633 Also happens to make it more efficient with a few tweaks. ### Contribution Checklist <!-- Before submitting this PR, please make sure: --> - [x] The code builds clean without any errors or warnings - [x] The PR follows SK Contribution Guidelines (https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md) - [x] The code follows the .NET coding conventions (https://learn.microsoft.com/dotnet/csharp/fundamentals/coding-style/coding-conventions) verified with `dotnet format` - [x] All unit tests pass, and I have added new tests where possible - [x] I didn't break anyone 😄 --------- Co-authored-by: Dmytro Struk <13853051+dmytrostruk@users.noreply.github.com>
shawncal
pushed a commit
to shawncal/semantic-kernel
that referenced
this issue
Jul 6, 2023
### Motivation and Context TextChunker.BuildParagraph is unnecessarily recursive and can lead to stack overflows on large inputs. ### Description Fixes microsoft#1633 Also happens to make it more efficient with a few tweaks. ### Contribution Checklist <!-- Before submitting this PR, please make sure: --> - [x] The code builds clean without any errors or warnings - [x] The PR follows SK Contribution Guidelines (https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md) - [x] The code follows the .NET coding conventions (https://learn.microsoft.com/dotnet/csharp/fundamentals/coding-style/coding-conventions) verified with `dotnet format` - [x] All unit tests pass, and I have added new tests where possible - [x] I didn't break anyone 😄 --------- Co-authored-by: Dmytro Struk <13853051+dmytrostruk@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
When calling
TextChunker.SplitPlainTextLines()
followed byTextChunker.SplitPlainTextParagraphs()
with a large document (7k+ lines output fromSplitPlainTextLines()
call.) The internal chunking logic causes a stack overflow exception in the final line ofTextChunker.BuildParagraph()
.To Reproduce
Steps to reproduce the behavior:
Expected behavior
Document should upload, chunk, and process without issue.
Screenshots
![image](https://private-user-images.githubusercontent.com/135271843/247610269-66398e13-be9f-455c-ad9c-8462062f2259.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjA0NzQzMTYsIm5iZiI6MTcyMDQ3NDAxNiwicGF0aCI6Ii8xMzUyNzE4NDMvMjQ3NjEwMjY5LTY2Mzk4ZTEzLWJlOWYtNDU1Yy1hZDljLTg0NjIwNjJmMjI1OS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzA4JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcwOFQyMTI2NTZaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1iZTY5MzIzYjM4MGIxZGFhZDljN2UyMWQ0ZmU4NmExZWVkM2FjZGU0MDg5MDBiYTlkZmQyMTQ1YThmYzJjYTkwJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.cCOOMzO2YFFepQRI76jOrDRqOHuspE9HcrAx3zWnzYQ)
Desktop (please complete the following information):
Additional context
Document stats:
Characters: 650k+
Words: 115k+
Output from
TextChunker.SplitPlainTextLines()
: 7,214 linesThe text was updated successfully, but these errors were encountered: