diff --git a/.github/workflows/release.yaml b/.github/workflows/release.yaml index 2594ee25..3aaf7639 100644 --- a/.github/workflows/release.yaml +++ b/.github/workflows/release.yaml @@ -41,7 +41,7 @@ jobs: - name: Push to dockerhub if: ${{ success() }} run: | - docker build -f docker/Dockerfile -t dedocproject/dedoc:$GITHUB_REF_NAME . + docker build -f Dockerfile -t dedocproject/dedoc:$GITHUB_REF_NAME . docker login -u ${{ secrets.DOCKERHUB_USERNAME }} -p ${{ secrets.DOCKERHUB_PASSWORD }} docker tag dedocproject/dedoc:$GITHUB_REF_NAME dedocproject/dedoc:latest docker push dedocproject/dedoc:$GITHUB_REF_NAME diff --git a/README.md b/README.md index f821b75d..105502e3 100644 --- a/README.md +++ b/README.md @@ -4,10 +4,17 @@ ![Dedoc](https://github.com/ispras/dedoc/raw/master/dedoc_logo.png) -Dedoc is an open universal system for converting documents to a unified output format. It extracts a document’s logical structure and content, its tables, text formatting and metadata. The document’s content is represented as a tree storing headings and lists of any level. Dedoc can be integrated in a document contents and structure analysis system as a separate module. +Dedoc is an open universal system for converting documents to a unified output format. +It extracts a document’s logical structure and content, its tables, text formatting and metadata. +The document’s content is represented as a tree storing headings and lists of any level. +Dedoc can be integrated in a document contents and structure analysis system as a separate module. + +Relevant documentation of the dedoc is available [here](https://dedoc.readthedocs.io). ## Features and advantages -Dedoc is implemented in Python and works with semi-structured data formats (DOC/DOCX, ODT, XLS/XLSX, CSV, TXT, JSON) and none-structured data formats like images (PNG, JPG etc.), archives (ZIP, RAR etc.), PDF and HTML formats. Document structure extraction is fully automatic regardless of input data type. Metadata and text formatting is also extracted automatically. +Dedoc is implemented in Python and works with semi-structured data formats (DOC/DOCX, ODT, XLS/XLSX, CSV, TXT, JSON) and none-structured data formats like images (PNG, JPG etc.), archives (ZIP, RAR etc.), PDF and HTML formats. +Document structure extraction is fully automatic regardless of input data type. +Metadata and text formatting are also extracted automatically. In 2022, the system won a grant to support the development of promising AI projects from the [Innovation Assistance Foundation (Фонд содействия инновациям)](https://fasie.ru/). @@ -16,36 +23,35 @@ In 2022, the system won a grant to support the development of promising AI proje * Support for extracting document structure out of nested documents having different formats. * Extracting various text formatting features (indentation, font type, size, style etc.). * Working with documents of various origin (statements of work, legal documents, technical reports, scientific papers) allowing flexible tuning for new domains. -* Working with PDF documents containinng a text layer: - * Support to automatically determine the correctness of the text layer in PDF documents; - * Extract containing and formatting from PDF-documents with a text layer using the developed interpreter of the virtual stack machine for printing graphics according to the format specification. -Extracting table data from DOC/DOCX, PDF, HTML, CSV and image formats: +* Working with PDF documents containing a textual layer: + * Support to automatically determine the correctness of the textual layer in PDF documents; + * Extract containing and formatting from PDF-documents with a textual layer using the developed interpreter of the virtual stack machine for printing graphics according to the format specification. +* Extracting table data from DOC/DOCX, PDF, HTML, CSV and image formats: * Recognizing a physical structure and a cell text for complex multipage tables having explicit borders with the help of contour analysis. * Working with scanned documents (image formats and PDF without text layer): * Using Tesseract, an actively developed OCR engine from Google, together with image preprocessing methods. * Utilizing modern machine learning approaches for detecting a document orientation, detecting single/multicolumn document page, detecting bold text and extracting hierarchical structure based on the classification of features extracted from document images. -This project may be useful as a first step of automatic document analysis pipeline (e.g. before the NLP part) +This project may be useful as a first step of automatic document analysis pipeline (e.g. before the NLP part). -This project has REST Api and you can run it in Docker container -To read full Dedoc documentation run the project and go to localhost:1231. +This project has REST Api and you can run it in Docker container. +Also, dedoc can be installed as a library via `pip`. +To read full Dedoc documentation go [here](https://dedoc.readthedocs.io). ## Run the project -How to build and run the project -Ensure you have Git and Docker installed - +### Install and run dedoc using docker + Clone the project ```bash git clone https://github.com/ispras/dedoc.git - -cd dedoc/ +cd dedoc ``` Ensure you have Docker installed. -Start 'Dedoc' on the port 1231: +Start `dedoc` on the port `1231`: ```bash docker-compose up --build ``` @@ -55,6 +61,22 @@ Start Dedoc with tests: test="true" docker-compose up --build ``` -Now you can go to the localhost:1231 and look at the docs and examples. +Now you can go to the `localhost:1231` and look at the docs and examples. +You can change the port and host in the config file `dedoc/config.py`. + +### Install dedoc using pip + +One may install the dedoc library via `pip`. +To fulfil all the library requirements, you should have `torch~=1.11.0` and `torchvision~=0.12.0` installed. +You can install suitable for you versions of these libraries and install dedoc using `pip` command: + +```bash +pip install dedoc +``` + +Or you can install dedoc with torch and torchvision included: +```bash +pip install "dedoc[torch]" +``` -You can change the port and host in the config file 'dedoc/config.py' +Go [here](https://dedoc.readthedocs.io/en/latest/getting_started/installation.html) to get more details about dedoc installation. diff --git a/docs/source/index.rst b/docs/source/index.rst index 8978f173..1cf30272 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -50,6 +50,8 @@ Reading documents using dedoc Dedoc allows to get the common intermediate representation for the documents of various formats. The resulting output of any reader is a class :class:`~dedoc.data_structures.UnstructuredDocument`. +See :ref:`readers' annotations ` and :ref:`readers' line types ` +to get more details about information that can be extracted by each available reader. .. _table_formats: @@ -220,6 +222,13 @@ For a document of unknown or unsupported domain there is an option to use defaul dedoc_api_usage/return_format +.. toctree:: + :maxdepth: 1 + :caption: Readers output + + readers_output/annotations + readers_output/line_types + .. toctree:: :maxdepth: 1 :caption: Structure types diff --git a/docs/source/readers_output/annotations.rst b/docs/source/readers_output/annotations.rst new file mode 100644 index 00000000..e177ed53 --- /dev/null +++ b/docs/source/readers_output/annotations.rst @@ -0,0 +1,159 @@ +.. _readers_annotations: + +Text annotations +================ + +Below the readers are enlisted that can return non-empty list of annotations for document lines: + +* `+` means that the reader can return the annotation. +* `-` means that the reader doesn't return the annotation due to complexity of the task or lack of information provided by the format. + +.. _table_annotations: + +.. list-table:: Annotations returned by each reader + :widths: 20 10 10 10 10 10 10 + :class: tight-table + + * - **Annotation** + - :class:`~dedoc.readers.DocxReader` + - :class:`~dedoc.readers.HtmlReader`, :class:`~dedoc.readers.MhtmlReader`, :class:`~dedoc.readers.EmailReader` + - :class:`~dedoc.readers.RawTextReader` + - :class:`~dedoc.readers.PdfImageReader` + - :class:`~dedoc.readers.PdfTabbyReader` + - :class:`~dedoc.readers.PdfTxtlayerReader` + + * - :class:`~dedoc.data_structures.AttachAnnotation` + - `+` + - `-` + - `-` + - `-` + - `-` + - `+` + + * - :class:`~dedoc.data_structures.TableAnnotation` + - `+` + - `-` + - `-` + - `+` + - `+` + - `+` + + * - :class:`~dedoc.data_structures.LinkedTextAnnotation` + - `+` + - `+` + - `-` + - `-` + - `+` + - `+` + + * - :class:`~dedoc.data_structures.BBoxAnnotation` + - `-` + - `-` + - `-` + - `+` + - `+` + - `+` + + * - :class:`~dedoc.data_structures.AlignmentAnnotation` + - `+` + - `+` + - `-` + - `-` + - `-` + - `-` + + * - :class:`~dedoc.data_structures.IndentationAnnotation` + - `+` + - `-` + - `+` + - `+` + - `+` + - `+` + + * - :class:`~dedoc.data_structures.SpacingAnnotation` + - `+` + - `-` + - `+` + - `+` + - `+` + - `+` + + * - :class:`~dedoc.data_structures.BoldAnnotation` + - `+` + - `+` + - `-` + - `+` + - `+` + - `+` + + * - :class:`~dedoc.data_structures.ItalicAnnotation` + - `+` + - `+` + - `-` + - `-` + - `+` + - `+` + + * - :class:`~dedoc.data_structures.UnderlinedAnnotation` + - `+` + - `+` + - `-` + - `-` + - `-` + - `-` + + * - :class:`~dedoc.data_structures.StrikeAnnotation` + - `+` + - `+` + - `-` + - `-` + - `-` + - `-` + + * - :class:`~dedoc.data_structures.SubscriptAnnotation` + - `+` + - `+` + - `-` + - `-` + - `-` + - `-` + + * - :class:`~dedoc.data_structures.SuperscriptAnnotation` + - `+` + - `+` + - `-` + - `-` + - `-` + - `-` + + * - :class:`~dedoc.data_structures.ColorAnnotation` + - `-` + - `-` + - `-` + - `+` + - `-` + - `+` + + * - :class:`~dedoc.data_structures.SizeAnnotation` + - `+` + - `+` + - `-` + - `+` + - `+` + - `+` + + * - :class:`~dedoc.data_structures.StyleAnnotation` + - `+` + - `+` + - `-` + - `-` + - `+` + - `+` + + * - :class:`~dedoc.data_structures.ConfidenceAnnotation` + - `-` + - `-` + - `-` + - `+` + - `-` + - `-` diff --git a/docs/source/readers_output/line_types.rst b/docs/source/readers_output/line_types.rst new file mode 100644 index 00000000..826a4807 --- /dev/null +++ b/docs/source/readers_output/line_types.rst @@ -0,0 +1,65 @@ +.. _readers_line_types: + +Types of textual lines +====================== + +Each reader returns :class:`~dedoc.data_structures.UnstructuredDocument` with textual lines. +Readers don't fill `hierarchy_level` metadata field (structure extractors do this), but they can fill `hierarchy_level_tag` with information about line types. +Below the readers are enlisted that can return non-empty `hierarchy_level_tag` in document lines metadata: + +* `+` means that the reader can return lines of this type. +* `-` means that the reader doesn't return lines of this type due to complexity of the task or lack of information provided by the format. + +.. _table_line_types: + +.. list-table:: Line types returned by each reader + :widths: 20 20 20 20 20 + :class: tight-table + + * - **Reader** + - **header** + - **list_item** + - **raw_text, unknown** + - **key** + + * - :class:`~dedoc.readers.DocxReader` + - `+` + - `+` + - `+` + - `-` + + * - :class:`~dedoc.readers.HtmlReader`, :class:`~dedoc.readers.MhtmlReader`, :class:`~dedoc.readers.EmailReader` + - `+` + - `+` + - `+` + - `-` + + * - :class:`~dedoc.readers.RawTextReader` + - `-` + - `+` + - `+` + - `-` + + * - :class:`~dedoc.readers.JsonReader` + - `-` + - `+` + - `+` + - `+` + + * - :class:`~dedoc.readers.PdfImageReader` + - `-` + - `+` + - `+` + - `-` + + * - :class:`~dedoc.readers.PdfTabbyReader` + - `+` + - `+` + - `+` + - `-` + + * - :class:`~dedoc.readers.PdfTxtlayerReader` + - `-` + - `+` + - `+` + - `-`