Skip to content

A program that extracts text from a PDF and summarizes it afterwards

License

Notifications You must be signed in to change notification settings

leeseomin/PDF-summarizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PDF-summarizer

pdf 에서 text 추출후에 이것을 요약하는 프로그램. 논문요약등에 활용가능.
텍스트추출은 수백페이지도 가능, 텍스트요약기능은 논문등 길지않은 pdf문서들에 적합.
요약기능은 처리시간이 다소 오래걸립니다.

A program that extracts text from a PDF and summarizes it afterwards. :
It is a summary process that divides the given text into 9 parts, summarizes each part, and then combines them.
Text extraction can cover hundreds of pages.
Summarize is appropriate for papers.
Summarization takes a bit longer to process.


대표




Dependency (Tested on an M1 Mac) : cpu version

conda install pytorch torchvision torchaudio -c pytorch

pip install pdfminer.six

pip install spacy

python -m spacy download en_core_web_sm

fairseq install on mac osx

git clone https://github.com/pytorch/fairseq

cd fairseq

CFLAGS="-stdlib=libc++" pip install --editable ./

If the above installation doesn't work on Mac, try installing from conda base.

fairseq install using conda

conda install -c conda-forge fairseq

fairseq install from source

git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./



Key Features

1.Extract text from PDF and save as a different name txt file.

2.Extract text from PDF, summarize and save as a different name txt file.

  1. PDF에서 텍스트를 추출하여 다른 이름의 txt 파일로 저장합니다.

  2. PDF에서 텍스트를 추출하고 요약하여 다른 이름의 txt 파일로 저장합니다.

Run Code

python V_072.py

대표




Limitation

Large PDF files produce strange results when using the text summarization function.

용량큰 pdf파일은 텍스트요약시에 이상한 결과를 배출.

To Do

gpu version

Web app

ADD the T5X algorithm (https://github.com/google-research/t5x)

Credit

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension :https://github.com/facebookresearch/fairseq/blob/main/examples/bart/README.md

pdfminer.six : https://github.com/pdfminer/pdfminer.six

spaCy : https://github.com/explosion/spaCy

About

A program that extracts text from a PDF and summarizes it afterwards

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages