-
Notifications
You must be signed in to change notification settings - Fork 7
Description
The Tika image detection component extracts images from PDF files, writes them to disk, and reports those paths in the JSON output object. We need to develop a way to perform detections on these images, for example, face detection and OCR.
In general, OpenMPF is designed so that a job is associated with a single piece of media. Currently, we cannot create a pipeline that takes a PDF as input for the first stage, and then processes multiple images in subsequent stages.
Separate out each piece of extracted media into its own “media” element.
We should add a “parentMediaId” field to the JSON output object. Consider this example Tika Image Detection component output:
media": [
{ "mediaId": 51,
"path": "<path>/XYZ.pdf",
"parentMediaId": -1,
"mediaMetadata" : []
...
},
{ "mediaId": 52,
"path": "<path>/page-1/image0.jpg",
"parentMediaId": 51,
"mediaMetadata" : [ "PAGE_NUM": "1", "CAPTION": "Here are some dogs" ]
...
},
{ "mediaId": 53,
"path": "<path>/page-1/image1.jpg",
"parentMediaId": 51,
"mediaMetadata" : [ "PAGE_NUM": "1", "CAPTION": "Cats go meow" ]
...
},
{ "mediaId": 54,
"path": "<path>/page-2/image0.jpg",
"parentMediaId": 51,
"mediaMetadata" : [ "PAGE_NUM": "2", "CAPTION": "Zebras are not horses" ]
...
},
{ "mediaId": 55,
"path": "<path>/page-1/crop-1/subimage1.jpg",
"parentMediaId": 53,
"mediaMetadata" : []
...
}
Representing each extracted image as a separate track has an advantage in that it allows a component to add track-level properties that are unique to that image. (I’m not saying we should do this now, but if an image has a caption for the image in the PDF, then the Tika image detection component could add a CAPTION track-level property, in addition to the PAGE_NUM property.)
A lot of this boils down to traceability – tracing a face detection back to the original PDF media, and beyond that, to the page of that PDF – and making sure the JSON output is not too redundant or verbose.
The “parentMediaId” field enables us to support extractions from extractions if we have multiple pipeline stages that perform image extraction. In the above example, media 55 is extracted from media 53, which is extracted from media 51.
For clarity:
-
For each derivative media track, the WFM is not going copy over track-level properties to the “mediaMetadata” field for each piece of extracted “media”.
- If a consumer wants to know which page the extracted image is associated with, they can use the “parentMediaId” to parse out page information from the reported tracks for the parent media.
-
The Tika Image Detection component will need to be updated to generate one track per piece of extracted media (see Update Tika Image Detection to generate one track per piece of extracted media #803).
- Each track will need to have a “PAGES” track-level property that lists which page(s) of the PDF the extracted media appears on. (This replaces the current “PAGE_NUM” property.)
For reference, consider this more generic example output generated by running a face detection component on extracted images:
mediaPath: <path>/XYZ.pdf
mediaId: 0
IMAGE:
+ Image A track
+ SAVED_MEDIA: <path>/page-1/image0.jpg
+ Image B track
+ SAVED_MEDIA: <path>/page-1/image1.jpg
+ Image C track
+ SAVED_MEDIA: <path>/page-2/image0.jpg
mediaPath: <path>/page-1/image0.jpg
mediaId: 1
parentMediaId: 0
IMAGE:
+ Image A track
+ SAVED_MEDIA: <path>/page-1/image0/subimage0.jpg
+ Image B track
+ SAVED_MEDIA: <path>/page-1/image0/subimage1.jpg
mediaPath: <path>/page-1/image1.jpg
mediaId: 2
parentMediaId: 0
IMAGE:
+ Image A track
+ SAVED_MEDIA: <path>/page-1/image1/subimage0.jpg
mediaPath: <path>/page-2/image0.jpg
mediaId: 3
parentMediaId: 0
IMAGE:
+ Image A track
+ SAVED_MEDIA: <path>/page-2/image0/subimage0.jpg
mediaPath: <path>/page-1/image0/subimage0.jpg
mediaId: 4
parentMediaId: 1
FACE:
+ Face A track
+ Face B track
mediaPath: <path>/page-1/image0/subimage1.jpg
mediaId: 5
parentMediaId: 1
FACE:
+ Face A track
mediaPath: <path>/page-1/image1/subimage0.jpg
mediaId: 6
parentMediaId: 2
FACE:
+ Face A track
mediaPath: <path>/page-2/image0/subimage0.jpg
mediaId: 7
parentMediaId: 3
FACE:
+ Face A track
Metadata
Metadata
Assignees
Labels
Type
Projects
Status