PDF-summarizer

pdf 에서 text 추출후에 이것을 요약하는 프로그램. 논문요약등에 활용가능.
텍스트추출은 수백페이지도 가능, 텍스트요약기능은 논문등 길지않은 pdf문서들에 적합.
요약기능은 처리시간이 다소 오래걸립니다.

A program that extracts text from a PDF and summarizes it afterwards. :
It is a summary process that divides the given text into 9 parts, summarizes each part, and then combines them.
Text extraction can cover hundreds of pages.
Summarize is appropriate for papers.
Summarization takes a bit longer to process.

Dependency (Tested on an M1 Mac) : cpu version

conda install pytorch torchvision torchaudio -c pytorch

pip install pdfminer.six

pip install spacy

python -m spacy download en_core_web_sm

fairseq install on mac osx

git clone https://github.com/pytorch/fairseq

cd fairseq

CFLAGS="-stdlib=libc++" pip install --editable ./

If the above installation doesn't work on Mac, try installing from conda base.

fairseq install using conda

conda install -c conda-forge fairseq

fairseq install from source

git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./

Key Features

1.Extract text from PDF and save as a different name txt file.

2.Extract text from PDF, summarize and save as a different name txt file.

PDF에서 텍스트를 추출하여 다른 이름의 txt 파일로 저장합니다.
PDF에서 텍스트를 추출하고 요약하여 다른 이름의 txt 파일로 저장합니다.

Run Code

python V_072.py

Limitation

Large PDF files produce strange results when using the text summarization function.

용량큰 pdf파일은 텍스트요약시에 이상한 결과를 배출.

To Do

gpu version

Web app

ADD the T5X algorithm (https://github.com/google-research/t5x)

Credit

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension :https://github.com/facebookresearch/fairseq/blob/main/examples/bart/README.md

pdfminer.six : https://github.com/pdfminer/pdfminer.six

spaCy : https://github.com/explosion/spaCy

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
MAIN		MAIN
pic		pic
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF-summarizer

Dependency (Tested on an M1 Mac) : cpu version

Key Features

Run Code

Limitation

To Do

Credit

About

Releases

Packages

Languages

License

leeseomin/PDF-summarizer

Folders and files

Latest commit

History

Repository files navigation

PDF-summarizer

Dependency (Tested on an M1 Mac) : cpu version

Key Features

Run Code

Limitation

To Do

Credit

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages