Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CVAT does not work when annotating PDFs #915

Closed
philippschw opened this issue Dec 8, 2019 · 6 comments
Closed

CVAT does not work when annotating PDFs #915

philippschw opened this issue Dec 8, 2019 · 6 comments
Labels
bug Something isn't working

Comments

@philippschw
Copy link

philippschw commented Dec 8, 2019

Hi,

I am trying to annotate pdf documents instead of images with cvat and noticed a number of problems that I am not able to resolve alone. I am using the develop branch, because on the master branch the Docker image of cvat does not build successfully.

  1. I am only able to upload a single pdf document (with many pages) but not several pdf documents. The error Code explains that I can only upload a single pdf but it would be helpful to understand the rational for this:
    ValueError: Only one video, archive, pdf or many image, directory can be used simultaneously, but 0 image(s), 0 video(s), 0 archive(s), 2 pdf(s), 0 directory(s) found.

  2. The conversion from pdf to image with pdf2image is not working, because poppler is missing from the Dockerfile. I fixed it by adding it to the Dockerfile:

# Install poppler for working with pdfs
RUN apt-get update && apt install -y poppler-utils
  1. After annotating a few items, I attempted to dump the annotation and no matter which format I use it fails, here is the error message. Note, dumping annotated png images works perfectly, seems to be a problem specific to pdfs.
2019-12-07 23:45:12,475 DEBG 'rqworker_default_1' stderr output:
23:45:12 default: cvat.apps.engine.annotation.dump_task_data('5', <SimpleLazyObject: <User: admin>>, '/home/django/data/5/5_IDP.admin.2019_12_07_23_45_12.zip', <AnnotationDumper: AnnotationDumper object (YOLO ZIP 1.0)>, 'http', 'localhost:8080') (admin@/api/v1/tasks/5/annotations/YOLO ZIP 1.0/5_IDP)

2019-12-07 23:45:12,574 DEBG 'rqworker_default_1' stderr output:
23:45:12 cvat.apps.engine.utils.InterpreterError: ValueError at line 308: '.upload' is not in list
Traceback (most recent call last):
  File "/home/django/cvat/apps/engine/utils.py", line 45, in execute_python_code
    exec(source_code, global_vars, local_vars)
  File "<string>", line 1, in <module>
  File "<string>", line 104, in dump
  File "/home/django/cvat/apps/annotation/annotation.py", line 325, in group_by_frame
    _get_frame(annotations, shape).labeled_shapes.append(self._export_labeled_shape(shape))
  File "/home/django/cvat/apps/annotation/annotation.py", line 308, in _get_frame
    rpath = os.path.sep.join(rpath[rpath.index(".upload")+1:])
ValueError: '.upload' is not in list

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/rq/worker.py", line 812, in perform_job
    rv = job.perform()
  File "/usr/local/lib/python3.5/dist-packages/rq/job.py", line 588, in perform
    self._result = self._execute()
  File "/usr/local/lib/python3.5/dist-packages/rq/job.py", line 594, in _execute
    return self.func(*self.args, **self.kwargs)
  File "/home/django/cvat/apps/engine/annotation.py", line 135, in dump_task_data
    annotation.dump(filename, dumper, scheme, host)
  File "/home/django/cvat/apps/engine/annotation.py", line 740, in dump
    execute_python_code("{}(file_object, annotations)".format(dumper.handler), global_vars)
  File "/home/django/cvat/apps/engine/utils.py", line 60, in execute_python_code
    raise InterpreterError("{} at line {}: {}".format(error_class, line_number, details))
cvat.apps.engine.utils.InterpreterError: ValueError at line 308: '.upload' is not in list
Traceback (most recent call last):
  File "/home/django/cvat/apps/engine/utils.py", line 45, in execute_python_code
    exec(source_code, global_vars, local_vars)
  File "<string>", line 1, in <module>
  File "<string>", line 104, in dump
  File "/home/django/cvat/apps/annotation/annotation.py", line 325, in group_by_frame
    _get_frame(annotations, shape).labeled_shapes.append(self._export_labeled_shape(shape))
  File "/home/django/cvat/apps/annotation/annotation.py", line 308, in _get_frame
    rpath = os.path.sep.join(rpath[rpath.index(".upload")+1:])
ValueError: '.upload' is not in list

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/rq/worker.py", line 812, in perform_job
    rv = job.perform()
  File "/usr/local/lib/python3.5/dist-packages/rq/job.py", line 588, in perform
    self._result = self._execute()
  File "/usr/local/lib/python3.5/dist-packages/rq/job.py", line 594, in _execute
    return self.func(*self.args, **self.kwargs)
  File "/home/django/cvat/apps/engine/annotation.py", line 135, in dump_task_data
    annotation.dump(filename, dumper, scheme, host)
  File "/home/django/cvat/apps/engine/annotation.py", line 740, in dump
    execute_python_code("{}(file_object, annotations)".format(dumper.handler), global_vars)
  File "/home/django/cvat/apps/engine/utils.py", line 60, in execute_python_code
    raise InterpreterError("{} at line {}: {}".format(error_class, line_number, details))
cvat.apps.engine.utils.InterpreterError: ValueError at line 308: '.upload' is not in list

2019-12-07 23:45:15,528 DEBG 'runserver' stderr output:
[Sat Dec 07 23:45:15.528224 2019] [wsgi:error] [pid 151:tid 139962191009536] [remote 172.19.0.1:33606] [2019-12-07 23:45:15,528] ERROR django.request: Internal Server Error: /api/v1/tasks/5/annotations/5_IDP
@nmanovic nmanovic added the bug Something isn't working label Dec 8, 2019
@nmanovic nmanovic added this to the 1.0.0 - Release milestone Dec 8, 2019
@nmanovic
Copy link
Contributor

nmanovic commented Dec 8, 2019

@philippschw , thanks for the report. It looks like a bug.

@benhoff
Copy link
Contributor

benhoff commented Dec 8, 2019

I can't speak to the dumper errors.

As far as the rationale behind only being able to load a single PDF, I submitted this while working a job for a client. All the client needed was the ability to upload a single PDF per task. And I had many, many other responsibilities :) .

The upload code can easily be extended to account for your use case.

You would need to wrap lines 92 - 97 in a for loop. Line 92 is linked below:

https://github.com/opencv/cvat/blob/1ec89b5f6a445aaa86854356cb73deb7e070d346/cvat/apps/engine/media_extractors.py#L92

I think the DirectoryExtractor has a somewhat relevant example, the only difference being that file_ = convert_from_path(self._source_path) is a little mis-labeled. I believe file_ is a list of multiple file paths of images that each need to be handled.

The relevant section of DirectoryExtractor code is linked below.

https://github.com/opencv/cvat/blob/1ec89b5f6a445aaa86854356cb73deb7e070d346/cvat/apps/engine/media_extractors.py#L129

Below is a take on my comments from above.

# Note: The following code would replace the existing code starting at:
# https://github.com/opencv/cvat/blob/1ec89b5f6a445aaa86854356cb73deb7e070d346/cvat/apps/engine/media_extractors.py#L91

self._dimensions = []
count = 0
for source in source_path:
    for root, _, files in os.walk(source):
        paths = os.path.join(root, f) for f in files]
        paths = filter(lambda x: get_mime(x) == 'pdf')
        for path in paths:
            pages = convert_from_path(path)
            for page in pages:
                # Note: There's probably a better way to assign a name than using `count`
                output = os.path.join(self._temp_directory, str(count) + '.jpg')
                count += 1
                self._dimensions.append(page.size)
                page.save(output, 'JPEG')

self._length = len(os.listdir(self._temp_directory))

# Note: you would need to redefine the below method for `PDFExtractor`
def _get_imagepath(self, k):
    img_path = os.path.join(self._temp_directory, str(k) + '.jpg')
    return img_path

# Note: You would need to change `unique` to be `False` in the following line:
# https://github.com/opencv/cvat/blob/1ec89b5f6a445aaa86854356cb73deb7e070d346/cvat/apps/engine/media_extractors.py#L276

But I don't have any way to test the above code currently.

@philippschw
Copy link
Author

philippschw commented Dec 9, 2019

Thanks for your detailed Response:

Unfortunately, the code fails silently. Though it says task has been created, the task is not there in the overview ready for the annotation.

The following output from the logs, shows that not frames have been created for the task.

019-12-09 16:49:23,218 DEBG 'rqworker_default_1' stderr output:
16:49:23 default: cvat.apps.engine.task._create_thread(1, {'server_files': [], 'remote_files': [], 'client_files': ['DP_Telekom_Lexware_Unbekannt.pdf', 'DATEV.PDF']}) (/api/v1/tasks/1)

2019-12-09 16:49:23,230 DEBG 'rqworker_default_1' stderr output:
[2019-12-09 16:49:23,230] INFO cvat.server: create task #1

2019-12-09 16:49:23,245 DEBG 'rqworker_default_1' stderr output:
[2019-12-09 16:49:23,245] INFO cvat.server: Founded frames 0 for task #1

What is more, in the data folder no .jpg file is getting saved when I upload pdfs (projectid 1) but when I upload images (projectid 2) , it works as expected:

django@2e82356a9f21:~/data$ ls 1/data/
django@2e82356a9f21:~/data$ ls 2/data/0/0/
0.jpg  1.jpg
django@2e82356a9f21:~/data$

I use your code only minimally adapted:
cvat/cvat/apps/engine/media_extractors.py

        self._dimensions = []
        count = 0
        for source in source_path:
            for root, _, files in os.walk(source):
                paths = [os.path.join(root, f) for f in files]
                paths = filter(lambda x: get_mime(x) == 'pdf')
                for path in paths:
                    pages = convert_from_path(path)
                    for page in pages:
                        # Note: There's probably a better way to assign a name than using `count`
                        output = os.path.join(self._temp_directory, str(count) + '.jpg')
                        count += 1
                        self._dimensions.append(page.size)
                        page.save(output, 'JPEG')

        self._length = len(os.listdir(self._temp_directory))

    def _get_imagepath(self, k):
        img_path = os.path.join(self._temp_directory, str(k) + '.jpg')
        return img_path

Complete Code:
https://github.com/philippschw/cvat

@benhoff
Copy link
Contributor

benhoff commented Dec 9, 2019

I'm sorry about that!

I think you need to change the constructor as well. In the class __init__ method, I grab the first index of source_path via source_path[0]. See here:

https://github.com/philippschw/cvat/blob/9f39b55fb1e80f4906ebfb67d00f79769e428083/cvat/apps/engine/media_extractors.py#L83

I think the constructor needs to look like the ImageListExtractor. Instead of grabbing the 0 index, they apply the sorted method instead. See here:

https://github.com/philippschw/cvat/blob/9f39b55fb1e80f4906ebfb67d00f79769e428083/cvat/apps/engine/media_extractors.py#L42

Have you thought about submitting a pull request to CVAT? You could start the title of the pull request with [WIP] to show that you're still iterating on it.

@benhoff
Copy link
Contributor

benhoff commented Dec 10, 2019

I think the for loop needs to be tweaked some. Added too many steps.

        for source in source_path:
            pages = convert_from_path(source)
            for page in pages:
                output = os.path.join(self._temp_directory, str(count) + '.jpg')
                count += 1
                self._dimensions.append(page.size)
                page.save(output, 'JPEG')

I think the above code should replace the loop starting at this line:

https://github.com/philippschw/cvat/blob/9f39b55fb1e80f4906ebfb67d00f79769e428083/cvat/apps/engine/media_extractors.py#L93

@nmanovic nmanovic modified the milestones: 1.0.0-release, 1.1.0-release May 23, 2020
@nmanovic nmanovic removed this from the 1.1.0-release milestone Nov 28, 2021
@nmanovic
Copy link
Contributor

I will close the issue as outdated. I was able to create a task with a book in pdf format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants