Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: remove blank space between words when use pdf-loader #3292

Closed
wants to merge 2 commits into from

Conversation

ppxu
Copy link
Contributor

@ppxu ppxu commented Nov 16, 2023

Extends #3218

Copy link

vercel bot commented Nov 16, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchainjs-api-refs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 16, 2023 11:30am
langchainjs-docs ✅ Ready (Inspect) Visit Preview Nov 16, 2023 11:30am

@jacoblee93
Copy link
Collaborator

jacoblee93 commented Nov 16, 2023

I had a PDF that seemed to need this - see the test case. I generated it from Google docs so I assume it's pretty typical.

There may be some better way to parse the loader output?

@jacoblee93 jacoblee93 added the hold On hold label Nov 16, 2023
@jacoblee93
Copy link
Collaborator

Ww could make it a config option?

@ppxu
Copy link
Contributor Author

ppxu commented Nov 16, 2023

i see, i always use pdf-loader to load chinese pdfs, so this question is very typical
i add a skipBlank option in pdf-loader for needed

@jacoblee93
Copy link
Collaborator

Thanks for being persistent, will merge #3306 instead to allow potentially for other types of separators.

@jacoblee93 jacoblee93 closed this Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto:bug Related to a bug, vulnerability, unexpected error with an existing feature hold On hold
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants