🔖 DocumentExtracter

Functions

Extract text from .pdf, .docx, .hwp, .txt format (but, the text extraction is not perfect.)
REST API

Install

pip install -r requirements.txt

If you want use the `tika`, please install jdk. (`tika` usually works better.)

To use this library, you need to have Java 7+ installed on your system
Refer to https://github.com/chrismattmann/tika-python

# Install OpenJDK
## 1. In MacOS
brew install cask
brew install --cask adoptopenjdk8 # or adoptopenjdk11

## 2. In Linux
apt-get -y install --no-install-recommends default-jdk-headless

Usages

1. Only Use TextExtaction

from DocEx import pdf, hwp, docx, txt

# Extract text from PDF
extracted_text = pdf.get_pdf_text(file_save_path, backend='tika') # backend options: ['tika', 'pdfminer']

# Extract text from DOCX
extracted_text = docx.get_docx_text(file_save_path)

# Extract text from HWP
extracted_text = hwp.get_hwp_text(file_save_path)

# Extract text from TXT
extracted_text = txt.get_txt_text(file_save_path)

2. Run as REST API (by FastAPI)

# In local
uvicorn main:app

# If you want to run with external IP and background running.
nohup uvicorn main:app --host 0.0.0.0 &

Endpoint

/pdf-extract
/docx-extract
/hwp-extract
/txt-extract

References

checklist

docx2python - Extract docx headers, footers, text, footnotes, endnotes, properties, and images to a Python object.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
DocEx		DocEx
.DS_Store		.DS_Store
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
log_config.py		log_config.py
main.py		main.py
requirements.txt		requirements.txt
run_server.sh		run_server.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DocEx

DocEx

.DS_Store

.DS_Store

.gitignore

.gitignore

Dockerfile

Dockerfile

README.md

README.md

log_config.py

log_config.py

main.py

main.py

requirements.txt

requirements.txt

run_server.sh

run_server.sh

Repository files navigation

🔖 DocumentExtracter

Functions

Install

If you want use the `tika`, please install jdk. (`tika` usually works better.)

Usages

1. Only Use TextExtaction

2. Run as REST API (by FastAPI)

Endpoint

References

About

Releases

Packages

Languages

jaehyeongAN/DocumentExtracter

Folders and files

Latest commit

History

Repository files navigation

🔖 DocumentExtracter

Functions

Install

If you want use the tika, please install jdk. (tika usually works better.)

Usages

1. Only Use TextExtaction

2. Run as REST API (by FastAPI)

Endpoint

References

About

Resources

Stars

Watchers

Forks

Languages

If you want use the `tika`, please install jdk. (`tika` usually works better.)