Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 doc_type='pdf' no longer works #75

Closed
matthewmcintire opened this issue Aug 20, 2020 · 1 comment · Fixed by #77
Closed

🐛 doc_type='pdf' no longer works #75

matthewmcintire opened this issue Aug 20, 2020 · 1 comment · Fixed by #77

Comments

@matthewmcintire
Copy link

Describe the bug
After the latest update, pdf mode no longer works. New lines seem to always get recognized as new sentences.
To Reproduce
Steps to reproduce the behavior:
Input text - "This is a sentence\ncut off in the middle because pdf."

Expected behavior
Expected output - "This is a sentence\ncut off in the middle because pdf."

@nipunsadvilkar
Copy link
Owner

@matthewmcintire Hey it's recommended to use doc_type="pdf" mode along with clean=True since cleaner trims those intermediate newlines and you would no longer be able to use char_span functionality since the original text gets modified.

Thanks for pointing out.
I will update tests to raise an exception and force the user to follow the above-mentioned usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants