Skip to content

users_guide_eng

IlyaKozlov edited this page Dec 8, 2020 · 3 revisions

Dedoc

Dedoc reads documents in different formats (doc, docx, odt, csv and others) and extracts:

  • Metadata
  • Text + texts metadata (bold, italic, font size)
  • Logical structure (optional).

Dedoc prefers to work through rest api but can works through python.

Installation

You can run dedoc in docker container or not in docker.

Run in docker.

  1. Ensure that docker is installed docker
  2. Clone project
git clone https://gitlab.at.ispras.ru/Ilya/dedoc_project.git

cd dedoc_project/
  1. Build container and run.
docker build . -t dedoc_container
docker run -p 1231:1231 --rm dedoc_container:latest python3.5 /dedoc/main.py

Now you can check if dedoc running, just open localhost:1231 and ensure you can read online documentation. If dedoc is running you can read about output format here.

Run not in docker

Dedoc was tested on ubuntu 18 and python3.5. You can install dedoc on your system similar to describe actions in the Dockerfile.

Verify installation

We have launched dedoc and want to make sure that it works correctly:

  • go to http://localhost:1231
  • click Supported Formats (bottom of the page)
  • click any "result in html"
  • Expected result should look like this:

Parse our own document

Example with python + requests.

import json
import os

import requests

"""specify file name and directory"""
directory_path = "..."
file_name = "..."

file_path = os.path.join(directory_path, file_name)

with open(file_path, 'rb') as file:
    
    files = {'file': (file_name, file)}
    """put additional parameter to data dict, you can look to the online docs for additional parameters"""
    data = {}
    """send request and get response"""
    response = requests.post("http://localhost:1231/upload", files=files, data=data)

    """Check if everything is OK"""
    if response.status_code != 200:
        raise Exception("Fail to parse file {}".format(response.status_code))
    """parse result from json """
    result = json.loads(response.content.decode())