Add PDF extractor #557

benhoff · 2019-07-09T14:16:29Z

No description provided.

benhoff · 2019-07-09T16:51:45Z

@azhavoro what is _dest_path for?
https://github.com/opencv/cvat/blob/5104cc08c13d325e172d8523b7d2b7347c61c87d/cvat/apps/engine/media_extractors.py#L109

nmanovic

Thanks for the contribution as usual! Could you please add information about the feature to CHANGELOG.md?

nmanovic · 2019-07-09T20:21:34Z

cvat/apps/engine/media_extractors.py

+        from pdf2image import convert_from_path
+        self._temp_directory = tempfile.mkdtemp(prefix='cvat-')
+        super().__init__(
+            source_path=source_path[0],


Why source_path[0]?

I was following the implementation for VideoExtractor See: https://github.com/opencv/cvat/blob/5104cc08c13d325e172d8523b7d2b7347c61c87d/cvat/apps/engine/media_extractors.py#L111

I'm not sure why the source_path is a list here, but I believe due to the implementation, this will get a list with a single item in it. I didn't dive into the overall architecture for custom extractors.

@benhoff , you specified in description of pdf extractor that multiple pdf documents can be uploaded. For video extractor unique flag is True.

Extractor's constructor always receive a list as source_path argument. I don't see any problem here, but the extractor is responsible to correctly handle passed source list. Could you please adjust description according to extractor behaviour? I mean case if you try to create task with several pdf files but only one will be used.

'pdf': { ... 'unique': **False** },

Maybe it will be better to change behaviour and pass to the constructor a list or single item according its description. I'll think about that.

I changed unique to True. For my case, PDF's can have multiple pages and I want to be able to flip through them, like a video. Let me know if you think the implementation should be different.

nmanovic · 2019-07-09T20:26:58Z

cvat/apps/engine/media_extractors.py

@@ -72,6 +72,48 @@ def save_image(self, k, dest_path):
        image.close()
        return width, height

+class PDFExtractor(MediaExtractor):


I would look at ArchiveExtractor implementation and inherit the class from DirectoryExtractor. Let's implement here _extract method ... What do you think?

I was using VideoExtractor as a basis here because in my case, PDF's could have multiple pages. Is there a better way to handle multiple page PDF's?

benhoff · 2019-07-09T23:45:30Z

Thanks for the contribution as usual! Could you please add information about the feature to CHANGELOG.md?

Added!

nmanovic · 2019-07-10T06:34:45Z

@azhavoro , please review the patch and leave your comments.

azhavoro

@azhavoro what is _dest_path for?
https://github.com/opencv/cvat/blob/5104cc08c13d325e172d8523b7d2b7347c61c87d/cvat/apps/engine/media_extractors.py#L109

I'll fix it as soon as possible.

azhavoro · 2019-07-10T16:39:48Z

cvat/apps/engine/media_extractors.py

+        return self._get_imagepath(k)
+
+    def __len__(self):
+        return len(os.listdir(self._temp_directory))


let's calculate the length in the __init__ method.

azhavoro · 2019-07-10T16:40:11Z

cvat/apps/engine/media_extractors.py

+        from pdf2image import convert_from_path
+        self._temp_directory = tempfile.mkdtemp(prefix='cvat-')
+        super().__init__(
+            source_path=source_path[0],


Extractor's constructor always receive a list as source_path argument. I don't see any problem here, but the extractor is responsible to correctly handle passed source list. Could you please adjust description according to extractor behaviour? I mean case if you try to create task with several pdf files but only one will be used.

'pdf': { ... 'unique': **False** },

Maybe it will be better to change behaviour and pass to the constructor a list or single item according its description. I'll think about that.

* develop: (112 commits) fixed attribute processing in auto_annotation (cvat-ai#577) CVAT.js API Tests (cvat-ai#578) Fixed exception in attribute annotation mode (cvat-ai#571) CVAT.js API methods were implemented (cvat-ai#572) Dashboard components basic styles (cvat-ai#574) Handle invalid json labelmap file case correctly during create/update DL model stage. (cvat-ai#573) Upgrade Numpy to avoid Arbitrary Code Execution. Upgrade Django to avoid MitM (cvat-ai#575) Run functional tests for REST API during a build (cvat-ai#506) CVAT.js other implemented API methods and bug fixes (cvat-ai#569) CVAT.js implemented API methods and bug fixes (cvat-ai#564) added in handeling for openvino 2019 (cvat-ai#545) added in command line auto annotation runner (cvat-ai#563) Fixed PDF extractor syntax error (cvat-ai#565) Update README.md added in pdf extractor (cvat-ai#557) Basic dashboard components (cvat-ai#562) Saving of annotations on the server (cvat-ai#561) Code was devided by files (cvat-ai#558) CVAT.js: Save and delete for shapes/tracks/tags (cvat-ai#555) Fixed '=' to '==' for numpy in requirments (cvat-ai#556) ... # Conflicts: # .gitignore

benhoff changed the title ~~[WIP] add PDF extractor~~ Add PDF extractor Jul 9, 2019

benhoff mentioned this pull request Jul 9, 2019

Load large TIFF images #531

Open

nmanovic reviewed Jul 9, 2019

View reviewed changes

nmanovic requested a review from azhavoro July 10, 2019 06:34

azhavoro reviewed Jul 10, 2019

View reviewed changes

added in pdf extractor

e0bf866

nmanovic approved these changes Jul 11, 2019

View reviewed changes

nmanovic merged commit ccbbf33 into cvat-ai:develop Jul 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PDF extractor #557

Add PDF extractor #557

benhoff commented Jul 9, 2019

benhoff commented Jul 9, 2019

nmanovic left a comment

nmanovic Jul 9, 2019

benhoff Jul 9, 2019

benhoff Jul 9, 2019 •

edited

Loading

nmanovic Jul 10, 2019

azhavoro Jul 10, 2019

benhoff Jul 11, 2019

nmanovic Jul 9, 2019 •

edited

Loading

benhoff Jul 9, 2019 •

edited

Loading

benhoff commented Jul 9, 2019

nmanovic commented Jul 10, 2019

azhavoro left a comment

azhavoro Jul 10, 2019

benhoff Jul 11, 2019

azhavoro Jul 10, 2019

Add PDF extractor #557

Add PDF extractor #557

Conversation

benhoff commented Jul 9, 2019

benhoff commented Jul 9, 2019

nmanovic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benhoff Jul 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nmanovic Jul 9, 2019 • edited Loading

Choose a reason for hiding this comment

benhoff Jul 9, 2019 • edited Loading

Choose a reason for hiding this comment

benhoff commented Jul 9, 2019

nmanovic commented Jul 10, 2019

azhavoro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benhoff Jul 9, 2019 •

edited

Loading

nmanovic Jul 9, 2019 •

edited

Loading

benhoff Jul 9, 2019 •

edited

Loading