Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A PR to fix some bug when dealing with pdf file. header&footer ,missing information when splitting chapter,etc. #193

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Wall-ee
Copy link

@Wall-ee Wall-ee commented Apr 23, 2023

1, add funtion to remove header and footer
2, fix bug of missing to deal with chapter which is the final keys of chapter list
3, update the replace method to replace some useless utf-8 char in some paper
4, fix bug of merging the text together when chapter is over 1 page

提交一些PR来修复pdf 处理中一些棘手的问题。这几个问题在理论性强一些的文献中比较重要。

2, fix bug of missing to deal with chapter which is the final keys of chapter list
3, update the replace method to replace some useless utf-8 char in some paper
4, fix bug of merging the text together when chapter is over 1 page
@kaixindelele
Copy link
Owner

您好,感谢您的付出和pull,关于pdf的解析问题,目前开源版的代码,算是我们测试出的比较通用的一个版本了,过多的特征提取,会导致错乱的情况。所以暂时不会合并您的pull,我们想加入的一个功能是,如果没有提取到abs和intro这两个章节,然后直接输入前两页的text,这样也能做一个比较好的总结,不知道您有没有兴趣和时间去做这个工作?另外麻烦pull的时候,加上一定量的测试例子,截图和文字描述

@Wall-ee
Copy link
Author

Wall-ee commented Apr 30, 2023

这个是一个好主意,我可以处理一下,不过刚才提的例子当中,有一些是通用性的问题,我找个示例pdf把。我这边主要是生物医药的论文,排版都比较奇葩一些,pymupdf 有时候默认顺序会出错,所以当一个章节跨页的时候,txt会拼接错误

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants