Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/release.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ jobs:
- name: Push to dockerhub
if: ${{ success() }}
run: |
docker build -f docker/Dockerfile -t dedocproject/dedoc:$GITHUB_REF_NAME .
docker build -f Dockerfile -t dedocproject/dedoc:$GITHUB_REF_NAME .
docker login -u ${{ secrets.DOCKERHUB_USERNAME }} -p ${{ secrets.DOCKERHUB_PASSWORD }}
docker tag dedocproject/dedoc:$GITHUB_REF_NAME dedocproject/dedoc:latest
docker push dedocproject/dedoc:$GITHUB_REF_NAME
Expand Down
56 changes: 39 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,17 @@

![Dedoc](https://github.com/ispras/dedoc/raw/master/dedoc_logo.png)

Dedoc is an open universal system for converting documents to a unified output format. It extracts a document’s logical structure and content, its tables, text formatting and metadata. The document’s content is represented as a tree storing headings and lists of any level. Dedoc can be integrated in a document contents and structure analysis system as a separate module.
Dedoc is an open universal system for converting documents to a unified output format.
It extracts a document’s logical structure and content, its tables, text formatting and metadata.
The document’s content is represented as a tree storing headings and lists of any level.
Dedoc can be integrated in a document contents and structure analysis system as a separate module.

Relevant documentation of the dedoc is available [here](https://dedoc.readthedocs.io).

## Features and advantages
Dedoc is implemented in Python and works with semi-structured data formats (DOC/DOCX, ODT, XLS/XLSX, CSV, TXT, JSON) and none-structured data formats like images (PNG, JPG etc.), archives (ZIP, RAR etc.), PDF and HTML formats. Document structure extraction is fully automatic regardless of input data type. Metadata and text formatting is also extracted automatically.
Dedoc is implemented in Python and works with semi-structured data formats (DOC/DOCX, ODT, XLS/XLSX, CSV, TXT, JSON) and none-structured data formats like images (PNG, JPG etc.), archives (ZIP, RAR etc.), PDF and HTML formats.
Document structure extraction is fully automatic regardless of input data type.
Metadata and text formatting are also extracted automatically.

In 2022, the system won a grant to support the development of promising AI projects from the [Innovation Assistance Foundation (Фонд содействия инновациям)](https://fasie.ru/).

Expand All @@ -16,36 +23,35 @@ In 2022, the system won a grant to support the development of promising AI proje
* Support for extracting document structure out of nested documents having different formats.
* Extracting various text formatting features (indentation, font type, size, style etc.).
* Working with documents of various origin (statements of work, legal documents, technical reports, scientific papers) allowing flexible tuning for new domains.
* Working with PDF documents containinng a text layer:
* Support to automatically determine the correctness of the text layer in PDF documents;
* Extract containing and formatting from PDF-documents with a text layer using the developed interpreter of the virtual stack machine for printing graphics according to the format specification.
Extracting table data from DOC/DOCX, PDF, HTML, CSV and image formats:
* Working with PDF documents containing a textual layer:
* Support to automatically determine the correctness of the textual layer in PDF documents;
* Extract containing and formatting from PDF-documents with a textual layer using the developed interpreter of the virtual stack machine for printing graphics according to the format specification.
* Extracting table data from DOC/DOCX, PDF, HTML, CSV and image formats:
* Recognizing a physical structure and a cell text for complex multipage tables having explicit borders with the help of contour analysis.
* Working with scanned documents (image formats and PDF without text layer):
* Using Tesseract, an actively developed OCR engine from Google, together with image preprocessing methods.
* Utilizing modern machine learning approaches for detecting a document orientation, detecting single/multicolumn document page, detecting bold text and extracting hierarchical structure based on the classification of features extracted from document images.


This project may be useful as a first step of automatic document analysis pipeline (e.g. before the NLP part)
This project may be useful as a first step of automatic document analysis pipeline (e.g. before the NLP part).

This project has REST Api and you can run it in Docker container
To read full Dedoc documentation run the project and go to localhost:1231.
This project has REST Api and you can run it in Docker container.
Also, dedoc can be installed as a library via `pip`.
To read full Dedoc documentation go [here](https://dedoc.readthedocs.io).


## Run the project
How to build and run the project

Ensure you have Git and Docker installed
### Install and run dedoc using docker

Clone the project
```bash
git clone https://github.com/ispras/dedoc.git

cd dedoc/
cd dedoc
```

Ensure you have Docker installed.
Start 'Dedoc' on the port 1231:
Start `dedoc` on the port `1231`:
```bash
docker-compose up --build
```
Expand All @@ -55,6 +61,22 @@ Start Dedoc with tests:
test="true" docker-compose up --build
```

Now you can go to the localhost:1231 and look at the docs and examples.
Now you can go to the `localhost:1231` and look at the docs and examples.
You can change the port and host in the config file `dedoc/config.py`.

### Install dedoc using pip

One may install the dedoc library via `pip`.
To fulfil all the library requirements, you should have `torch~=1.11.0` and `torchvision~=0.12.0` installed.
You can install suitable for you versions of these libraries and install dedoc using `pip` command:

```bash
pip install dedoc
```

Or you can install dedoc with torch and torchvision included:
```bash
pip install "dedoc[torch]"
```

You can change the port and host in the config file 'dedoc/config.py'
Go [here](https://dedoc.readthedocs.io/en/latest/getting_started/installation.html) to get more details about dedoc installation.
9 changes: 9 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,8 @@ Reading documents using dedoc

Dedoc allows to get the common intermediate representation for the documents of various formats.
The resulting output of any reader is a class :class:`~dedoc.data_structures.UnstructuredDocument`.
See :ref:`readers' annotations <readers_annotations>` and :ref:`readers' line types <readers_line_types>`
to get more details about information that can be extracted by each available reader.

.. _table_formats:

Expand Down Expand Up @@ -220,6 +222,13 @@ For a document of unknown or unsupported domain there is an option to use defaul
dedoc_api_usage/return_format


.. toctree::
:maxdepth: 1
:caption: Readers output

readers_output/annotations
readers_output/line_types

.. toctree::
:maxdepth: 1
:caption: Structure types
Expand Down
159 changes: 159 additions & 0 deletions docs/source/readers_output/annotations.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
.. _readers_annotations:

Text annotations
================

Below the readers are enlisted that can return non-empty list of annotations for document lines:

* `+` means that the reader can return the annotation.
* `-` means that the reader doesn't return the annotation due to complexity of the task or lack of information provided by the format.

.. _table_annotations:

.. list-table:: Annotations returned by each reader
:widths: 20 10 10 10 10 10 10
:class: tight-table

* - **Annotation**
- :class:`~dedoc.readers.DocxReader`
- :class:`~dedoc.readers.HtmlReader`, :class:`~dedoc.readers.MhtmlReader`, :class:`~dedoc.readers.EmailReader`
- :class:`~dedoc.readers.RawTextReader`
- :class:`~dedoc.readers.PdfImageReader`
- :class:`~dedoc.readers.PdfTabbyReader`
- :class:`~dedoc.readers.PdfTxtlayerReader`

* - :class:`~dedoc.data_structures.AttachAnnotation`
- `+`
- `-`
- `-`
- `-`
- `-`
- `+`

* - :class:`~dedoc.data_structures.TableAnnotation`
- `+`
- `-`
- `-`
- `+`
- `+`
- `+`

* - :class:`~dedoc.data_structures.LinkedTextAnnotation`
- `+`
- `+`
- `-`
- `-`
- `+`
- `+`

* - :class:`~dedoc.data_structures.BBoxAnnotation`
- `-`
- `-`
- `-`
- `+`
- `+`
- `+`

* - :class:`~dedoc.data_structures.AlignmentAnnotation`
- `+`
- `+`
- `-`
- `-`
- `-`
- `-`

* - :class:`~dedoc.data_structures.IndentationAnnotation`
- `+`
- `-`
- `+`
- `+`
- `+`
- `+`

* - :class:`~dedoc.data_structures.SpacingAnnotation`
- `+`
- `-`
- `+`
- `+`
- `+`
- `+`

* - :class:`~dedoc.data_structures.BoldAnnotation`
- `+`
- `+`
- `-`
- `+`
- `+`
- `+`

* - :class:`~dedoc.data_structures.ItalicAnnotation`
- `+`
- `+`
- `-`
- `-`
- `+`
- `+`

* - :class:`~dedoc.data_structures.UnderlinedAnnotation`
- `+`
- `+`
- `-`
- `-`
- `-`
- `-`

* - :class:`~dedoc.data_structures.StrikeAnnotation`
- `+`
- `+`
- `-`
- `-`
- `-`
- `-`

* - :class:`~dedoc.data_structures.SubscriptAnnotation`
- `+`
- `+`
- `-`
- `-`
- `-`
- `-`

* - :class:`~dedoc.data_structures.SuperscriptAnnotation`
- `+`
- `+`
- `-`
- `-`
- `-`
- `-`

* - :class:`~dedoc.data_structures.ColorAnnotation`
- `-`
- `-`
- `-`
- `+`
- `-`
- `+`

* - :class:`~dedoc.data_structures.SizeAnnotation`
- `+`
- `+`
- `-`
- `+`
- `+`
- `+`

* - :class:`~dedoc.data_structures.StyleAnnotation`
- `+`
- `+`
- `-`
- `-`
- `+`
- `+`

* - :class:`~dedoc.data_structures.ConfidenceAnnotation`
- `-`
- `-`
- `-`
- `+`
- `-`
- `-`
65 changes: 65 additions & 0 deletions docs/source/readers_output/line_types.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
.. _readers_line_types:

Types of textual lines
======================

Each reader returns :class:`~dedoc.data_structures.UnstructuredDocument` with textual lines.
Readers don't fill `hierarchy_level` metadata field (structure extractors do this), but they can fill `hierarchy_level_tag` with information about line types.
Below the readers are enlisted that can return non-empty `hierarchy_level_tag` in document lines metadata:

* `+` means that the reader can return lines of this type.
* `-` means that the reader doesn't return lines of this type due to complexity of the task or lack of information provided by the format.

.. _table_line_types:

.. list-table:: Line types returned by each reader
:widths: 20 20 20 20 20
:class: tight-table

* - **Reader**
- **header**
- **list_item**
- **raw_text, unknown**
- **key**

* - :class:`~dedoc.readers.DocxReader`
- `+`
- `+`
- `+`
- `-`

* - :class:`~dedoc.readers.HtmlReader`, :class:`~dedoc.readers.MhtmlReader`, :class:`~dedoc.readers.EmailReader`
- `+`
- `+`
- `+`
- `-`

* - :class:`~dedoc.readers.RawTextReader`
- `-`
- `+`
- `+`
- `-`

* - :class:`~dedoc.readers.JsonReader`
- `-`
- `+`
- `+`
- `+`

* - :class:`~dedoc.readers.PdfImageReader`
- `-`
- `+`
- `+`
- `-`

* - :class:`~dedoc.readers.PdfTabbyReader`
- `+`
- `+`
- `+`
- `-`

* - :class:`~dedoc.readers.PdfTxtlayerReader`
- `-`
- `+`
- `+`
- `-`