Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PDF extractor #557

Merged
merged 1 commit into from
Jul 11, 2019
Merged

Add PDF extractor #557

merged 1 commit into from
Jul 11, 2019

Conversation

benhoff
Copy link
Contributor

@benhoff benhoff commented Jul 9, 2019

No description provided.

@benhoff
Copy link
Contributor Author

benhoff commented Jul 9, 2019

@benhoff benhoff changed the title [WIP] add PDF extractor Add PDF extractor Jul 9, 2019
@benhoff benhoff mentioned this pull request Jul 9, 2019
Copy link
Contributor

@nmanovic nmanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution as usual! Could you please add information about the feature to CHANGELOG.md?

from pdf2image import convert_from_path
self._temp_directory = tempfile.mkdtemp(prefix='cvat-')
super().__init__(
source_path=source_path[0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why source_path[0]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@benhoff benhoff Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why the source_path is a list here, but I believe due to the implementation, this will get a list with a single item in it. I didn't dive into the overall architecture for custom extractors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benhoff , you specified in description of pdf extractor that multiple pdf documents can be uploaded. For video extractor unique flag is True.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extractor's constructor always receive a list as source_path argument. I don't see any problem here, but the extractor is responsible to correctly handle passed source list. Could you please adjust description according to extractor behaviour? I mean case if you try to create task with several pdf files but only one will be used.

'pdf': {
  ...
  'unique': **False**
},

Maybe it will be better to change behaviour and pass to the constructor a list or single item according its description. I'll think about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed unique to True. For my case, PDF's can have multiple pages and I want to be able to flip through them, like a video. Let me know if you think the implementation should be different.

@@ -72,6 +72,48 @@ def save_image(self, k, dest_path):
image.close()
return width, height

class PDFExtractor(MediaExtractor):
Copy link
Contributor

@nmanovic nmanovic Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would look at ArchiveExtractor implementation and inherit the class from DirectoryExtractor. Let's implement here _extract method ... What do you think?

Copy link
Contributor Author

@benhoff benhoff Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using VideoExtractor as a basis here because in my case, PDF's could have multiple pages. Is there a better way to handle multiple page PDF's?

@benhoff
Copy link
Contributor Author

benhoff commented Jul 9, 2019

Thanks for the contribution as usual! Could you please add information about the feature to CHANGELOG.md?

Added!

@nmanovic
Copy link
Contributor

@azhavoro , please review the patch and leave your comments.

@nmanovic nmanovic requested a review from azhavoro July 10, 2019 06:34
Copy link
Contributor

@azhavoro azhavoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return self._get_imagepath(k)

def __len__(self):
return len(os.listdir(self._temp_directory))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's calculate the length in the __init__ method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

from pdf2image import convert_from_path
self._temp_directory = tempfile.mkdtemp(prefix='cvat-')
super().__init__(
source_path=source_path[0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extractor's constructor always receive a list as source_path argument. I don't see any problem here, but the extractor is responsible to correctly handle passed source list. Could you please adjust description according to extractor behaviour? I mean case if you try to create task with several pdf files but only one will be used.

'pdf': {
  ...
  'unique': **False**
},

Maybe it will be better to change behaviour and pass to the constructor a list or single item according its description. I'll think about that.

@nmanovic nmanovic merged commit ccbbf33 into cvat-ai:develop Jul 11, 2019
MultifokalHirn added a commit to signatrix/cvat that referenced this pull request Jul 22, 2019
* develop: (112 commits)
  fixed attribute processing in auto_annotation (cvat-ai#577)
  CVAT.js API Tests (cvat-ai#578)
  Fixed exception in attribute annotation mode (cvat-ai#571)
  CVAT.js API methods were implemented (cvat-ai#572)
  Dashboard components basic styles (cvat-ai#574)
  Handle invalid json labelmap file case correctly during create/update DL model stage. (cvat-ai#573)
  Upgrade Numpy to avoid Arbitrary Code Execution. Upgrade Django to avoid MitM (cvat-ai#575)
  Run functional tests for REST API during a build (cvat-ai#506)
  CVAT.js other implemented API methods and bug fixes (cvat-ai#569)
  CVAT.js implemented API methods and bug fixes (cvat-ai#564)
  added in handeling for openvino 2019 (cvat-ai#545)
  added in command line auto annotation runner (cvat-ai#563)
  Fixed PDF extractor syntax error (cvat-ai#565)
  Update README.md
  added in pdf extractor (cvat-ai#557)
  Basic dashboard components (cvat-ai#562)
  Saving of annotations on the server (cvat-ai#561)
  Code was devided by files (cvat-ai#558)
  CVAT.js: Save and delete for shapes/tracks/tags (cvat-ai#555)
  Fixed '=' to '==' for numpy in requirments (cvat-ai#556)
  ...

# Conflicts:
#	.gitignore
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants