Skip to content

jaehyeongAN/DocumentExtracter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔖 DocumentExtracter

Functions

  • Extract text from .pdf, .docx, .hwp, .txt format (but, the text extraction is not perfect.)
  • REST API

Install

pip install -r requirements.txt

If you want use the tika, please install jdk. (tika usually works better.)

# Install OpenJDK
## 1. In MacOS
brew install cask
brew install --cask adoptopenjdk8 # or adoptopenjdk11

## 2. In Linux
apt-get -y install --no-install-recommends default-jdk-headless

Usages

1. Only Use TextExtaction

from DocEx import pdf, hwp, docx, txt

# Extract text from PDF
extracted_text = pdf.get_pdf_text(file_save_path, backend='tika') # backend options: ['tika', 'pdfminer']

# Extract text from DOCX
extracted_text = docx.get_docx_text(file_save_path)

# Extract text from HWP
extracted_text = hwp.get_hwp_text(file_save_path)

# Extract text from TXT
extracted_text = txt.get_txt_text(file_save_path)

2. Run as REST API (by FastAPI)

# In local
uvicorn main:app

# If you want to run with external IP and background running.
nohup uvicorn main:app --host 0.0.0.0 &

Endpoint

  • /pdf-extract
  • /docx-extract
  • /hwp-extract
  • /txt-extract

References


checklist

  • docx2python - Extract docx headers, footers, text, footnotes, endnotes, properties, and images to a Python object.

About

Extract text from .pdf, .docx, .hwp, .txt format

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published