CVAT does not work when annotating PDFs #915

philippschw · 2019-12-08T10:48:24Z

Hi,

I am trying to annotate pdf documents instead of images with cvat and noticed a number of problems that I am not able to resolve alone. I am using the develop branch, because on the master branch the Docker image of cvat does not build successfully.

I am only able to upload a single pdf document (with many pages) but not several pdf documents. The error Code explains that I can only upload a single pdf but it would be helpful to understand the rational for this:
ValueError: Only one video, archive, pdf or many image, directory can be used simultaneously, but 0 image(s), 0 video(s), 0 archive(s), 2 pdf(s), 0 directory(s) found.
The conversion from pdf to image with pdf2image is not working, because poppler is missing from the Dockerfile. I fixed it by adding it to the Dockerfile:

# Install poppler for working with pdfs
RUN apt-get update && apt install -y poppler-utils

After annotating a few items, I attempted to dump the annotation and no matter which format I use it fails, here is the error message. Note, dumping annotated png images works perfectly, seems to be a problem specific to pdfs.

2019-12-07 23:45:12,475 DEBG 'rqworker_default_1' stderr output:
23:45:12 default: cvat.apps.engine.annotation.dump_task_data('5', <SimpleLazyObject: <User: admin>>, '/home/django/data/5/5_IDP.admin.2019_12_07_23_45_12.zip', <AnnotationDumper: AnnotationDumper object (YOLO ZIP 1.0)>, 'http', 'localhost:8080') (admin@/api/v1/tasks/5/annotations/YOLO ZIP 1.0/5_IDP)

2019-12-07 23:45:12,574 DEBG 'rqworker_default_1' stderr output:
23:45:12 cvat.apps.engine.utils.InterpreterError: ValueError at line 308: '.upload' is not in list
Traceback (most recent call last):
  File "/home/django/cvat/apps/engine/utils.py", line 45, in execute_python_code
    exec(source_code, global_vars, local_vars)
  File "<string>", line 1, in <module>
  File "<string>", line 104, in dump
  File "/home/django/cvat/apps/annotation/annotation.py", line 325, in group_by_frame
    _get_frame(annotations, shape).labeled_shapes.append(self._export_labeled_shape(shape))
  File "/home/django/cvat/apps/annotation/annotation.py", line 308, in _get_frame
    rpath = os.path.sep.join(rpath[rpath.index(".upload")+1:])
ValueError: '.upload' is not in list

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/rq/worker.py", line 812, in perform_job
    rv = job.perform()
  File "/usr/local/lib/python3.5/dist-packages/rq/job.py", line 588, in perform
    self._result = self._execute()
  File "/usr/local/lib/python3.5/dist-packages/rq/job.py", line 594, in _execute
    return self.func(*self.args, **self.kwargs)
  File "/home/django/cvat/apps/engine/annotation.py", line 135, in dump_task_data
    annotation.dump(filename, dumper, scheme, host)
  File "/home/django/cvat/apps/engine/annotation.py", line 740, in dump
    execute_python_code("{}(file_object, annotations)".format(dumper.handler), global_vars)
  File "/home/django/cvat/apps/engine/utils.py", line 60, in execute_python_code
    raise InterpreterError("{} at line {}: {}".format(error_class, line_number, details))
cvat.apps.engine.utils.InterpreterError: ValueError at line 308: '.upload' is not in list
Traceback (most recent call last):
  File "/home/django/cvat/apps/engine/utils.py", line 45, in execute_python_code
    exec(source_code, global_vars, local_vars)
  File "<string>", line 1, in <module>
  File "<string>", line 104, in dump
  File "/home/django/cvat/apps/annotation/annotation.py", line 325, in group_by_frame
    _get_frame(annotations, shape).labeled_shapes.append(self._export_labeled_shape(shape))
  File "/home/django/cvat/apps/annotation/annotation.py", line 308, in _get_frame
    rpath = os.path.sep.join(rpath[rpath.index(".upload")+1:])
ValueError: '.upload' is not in list

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/rq/worker.py", line 812, in perform_job
    rv = job.perform()
  File "/usr/local/lib/python3.5/dist-packages/rq/job.py", line 588, in perform
    self._result = self._execute()
  File "/usr/local/lib/python3.5/dist-packages/rq/job.py", line 594, in _execute
    return self.func(*self.args, **self.kwargs)
  File "/home/django/cvat/apps/engine/annotation.py", line 135, in dump_task_data
    annotation.dump(filename, dumper, scheme, host)
  File "/home/django/cvat/apps/engine/annotation.py", line 740, in dump
    execute_python_code("{}(file_object, annotations)".format(dumper.handler), global_vars)
  File "/home/django/cvat/apps/engine/utils.py", line 60, in execute_python_code
    raise InterpreterError("{} at line {}: {}".format(error_class, line_number, details))
cvat.apps.engine.utils.InterpreterError: ValueError at line 308: '.upload' is not in list

2019-12-07 23:45:15,528 DEBG 'runserver' stderr output:
[Sat Dec 07 23:45:15.528224 2019] [wsgi:error] [pid 151:tid 139962191009536] [remote 172.19.0.1:33606] [2019-12-07 23:45:15,528] ERROR django.request: Internal Server Error: /api/v1/tasks/5/annotations/5_IDP

The text was updated successfully, but these errors were encountered:

nmanovic · 2019-12-08T14:29:50Z

@philippschw , thanks for the report. It looks like a bug.

benhoff · 2019-12-08T20:54:10Z

I can't speak to the dumper errors.

As far as the rationale behind only being able to load a single PDF, I submitted this while working a job for a client. All the client needed was the ability to upload a single PDF per task. And I had many, many other responsibilities :) .

The upload code can easily be extended to account for your use case.

You would need to wrap lines 92 - 97 in a for loop. Line 92 is linked below:

https://github.com/opencv/cvat/blob/1ec89b5f6a445aaa86854356cb73deb7e070d346/cvat/apps/engine/media_extractors.py#L92

I think the DirectoryExtractor has a somewhat relevant example, the only difference being that file_ = convert_from_path(self._source_path) is a little mis-labeled. I believe file_ is a list of multiple file paths of images that each need to be handled.

The relevant section of DirectoryExtractor code is linked below.

https://github.com/opencv/cvat/blob/1ec89b5f6a445aaa86854356cb73deb7e070d346/cvat/apps/engine/media_extractors.py#L129

Below is a take on my comments from above.

# Note: The following code would replace the existing code starting at:
# https://github.com/opencv/cvat/blob/1ec89b5f6a445aaa86854356cb73deb7e070d346/cvat/apps/engine/media_extractors.py#L91

self._dimensions = []
count = 0
for source in source_path:
    for root, _, files in os.walk(source):
        paths = os.path.join(root, f) for f in files]
        paths = filter(lambda x: get_mime(x) == 'pdf')
        for path in paths:
            pages = convert_from_path(path)
            for page in pages:
                # Note: There's probably a better way to assign a name than using `count`
                output = os.path.join(self._temp_directory, str(count) + '.jpg')
                count += 1
                self._dimensions.append(page.size)
                page.save(output, 'JPEG')

self._length = len(os.listdir(self._temp_directory))

# Note: you would need to redefine the below method for `PDFExtractor`
def _get_imagepath(self, k):
    img_path = os.path.join(self._temp_directory, str(k) + '.jpg')
    return img_path

# Note: You would need to change `unique` to be `False` in the following line:
# https://github.com/opencv/cvat/blob/1ec89b5f6a445aaa86854356cb73deb7e070d346/cvat/apps/engine/media_extractors.py#L276

But I don't have any way to test the above code currently.

philippschw · 2019-12-09T17:03:51Z

Thanks for your detailed Response:

Unfortunately, the code fails silently. Though it says task has been created, the task is not there in the overview ready for the annotation.

The following output from the logs, shows that not frames have been created for the task.

019-12-09 16:49:23,218 DEBG 'rqworker_default_1' stderr output:
16:49:23 default: cvat.apps.engine.task._create_thread(1, {'server_files': [], 'remote_files': [], 'client_files': ['DP_Telekom_Lexware_Unbekannt.pdf', 'DATEV.PDF']}) (/api/v1/tasks/1)

2019-12-09 16:49:23,230 DEBG 'rqworker_default_1' stderr output:
[2019-12-09 16:49:23,230] INFO cvat.server: create task #1

2019-12-09 16:49:23,245 DEBG 'rqworker_default_1' stderr output:
[2019-12-09 16:49:23,245] INFO cvat.server: Founded frames 0 for task #1

What is more, in the data folder no .jpg file is getting saved when I upload pdfs (projectid 1) but when I upload images (projectid 2) , it works as expected:

django@2e82356a9f21:~/data$ ls 1/data/
django@2e82356a9f21:~/data$ ls 2/data/0/0/
0.jpg  1.jpg
django@2e82356a9f21:~/data$

I use your code only minimally adapted:
cvat/cvat/apps/engine/media_extractors.py

        self._dimensions = []
        count = 0
        for source in source_path:
            for root, _, files in os.walk(source):
                paths = [os.path.join(root, f) for f in files]
                paths = filter(lambda x: get_mime(x) == 'pdf')
                for path in paths:
                    pages = convert_from_path(path)
                    for page in pages:
                        # Note: There's probably a better way to assign a name than using `count`
                        output = os.path.join(self._temp_directory, str(count) + '.jpg')
                        count += 1
                        self._dimensions.append(page.size)
                        page.save(output, 'JPEG')

        self._length = len(os.listdir(self._temp_directory))

    def _get_imagepath(self, k):
        img_path = os.path.join(self._temp_directory, str(k) + '.jpg')
        return img_path

Complete Code:
https://github.com/philippschw/cvat

benhoff · 2019-12-09T18:19:27Z

I'm sorry about that!

I think you need to change the constructor as well. In the class __init__ method, I grab the first index of source_path via source_path[0]. See here:

https://github.com/philippschw/cvat/blob/9f39b55fb1e80f4906ebfb67d00f79769e428083/cvat/apps/engine/media_extractors.py#L83

I think the constructor needs to look like the ImageListExtractor. Instead of grabbing the 0 index, they apply the sorted method instead. See here:

https://github.com/philippschw/cvat/blob/9f39b55fb1e80f4906ebfb67d00f79769e428083/cvat/apps/engine/media_extractors.py#L42

Have you thought about submitting a pull request to CVAT? You could start the title of the pull request with [WIP] to show that you're still iterating on it.

benhoff · 2019-12-10T13:08:31Z

I think the for loop needs to be tweaked some. Added too many steps.

        for source in source_path:
            pages = convert_from_path(source)
            for page in pages:
                output = os.path.join(self._temp_directory, str(count) + '.jpg')
                count += 1
                self._dimensions.append(page.size)
                page.save(output, 'JPEG')

I think the above code should replace the loop starting at this line:

https://github.com/philippschw/cvat/blob/9f39b55fb1e80f4906ebfb67d00f79769e428083/cvat/apps/engine/media_extractors.py#L93

nmanovic · 2021-11-28T17:28:45Z

I will close the issue as outdated. I was able to create a task with a book in pdf format.

nmanovic added the bug Something isn't working label Dec 8, 2019

nmanovic added this to the 1.0.0 - Release milestone Dec 8, 2019

This was referenced Jan 21, 2020

fix multiple pdfs bug #1083

Closed

Add feature to upload and annotate multiple pdfs #1088

Closed

nmanovic modified the milestones: 1.0.0-release, 1.1.0-release May 23, 2020

nmanovic removed this from the 1.1.0-release milestone Nov 28, 2021

nmanovic closed this as completed Nov 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CVAT does not work when annotating PDFs #915

CVAT does not work when annotating PDFs #915

philippschw commented Dec 8, 2019 •

edited

Loading

nmanovic commented Dec 8, 2019

benhoff commented Dec 8, 2019 •

edited

Loading

philippschw commented Dec 9, 2019 •

edited

Loading

benhoff commented Dec 9, 2019 •

edited

Loading

benhoff commented Dec 10, 2019 •

edited

Loading

nmanovic commented Nov 28, 2021

CVAT does not work when annotating PDFs #915

CVAT does not work when annotating PDFs #915

Comments

philippschw commented Dec 8, 2019 • edited Loading

nmanovic commented Dec 8, 2019

benhoff commented Dec 8, 2019 • edited Loading

philippschw commented Dec 9, 2019 • edited Loading

benhoff commented Dec 9, 2019 • edited Loading

benhoff commented Dec 10, 2019 • edited Loading

nmanovic commented Nov 28, 2021

philippschw commented Dec 8, 2019 •

edited

Loading

benhoff commented Dec 8, 2019 •

edited

Loading

philippschw commented Dec 9, 2019 •

edited

Loading

benhoff commented Dec 9, 2019 •

edited

Loading

benhoff commented Dec 10, 2019 •

edited

Loading