-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunking code makes odd assumption #23
Comments
@Aemon-Algiz yeah, the chunking for different types of documents can be improved significantly... Some thoughts:
If anyone has any other thoughts, I'd be happy to hear them. |
I modified this locally to use a NLP library to detect sentences, I think it's a bit better at dealing with different media types, no PR because my local version is a kludgy mess atm, but here's my replacement
And a test:
(edit: forgot the import) |
@Aemon-Algiz @lonelycode good idea using an NLP library. Does this library support most languages? I ended up going with github.com/neurosnap/sentences – check out my fix here 2bff175 |
|
Based on the source, this seems to assume that the minimum document will have 20 sentences. Anything less than 20 sentences does not appear to create any embeddings. This probably isn't the desired result. It would probably be better to chunk based on token count rather than sentence count.
The text was updated successfully, but these errors were encountered: