define key researcher use-cases for story image extraction and storage #708

rahulbot · 2020-05-20T18:01:18Z

To make some technical decisions I think we need to more concretely design the primary use cases we have in mind so far. Here's my stab at a list, and the underlying tech feature it might rely on:

review a summary of visual language across a small corpus (ie. top stories) - maybe use image tree map
review a summary of visual language across a large corpus (ie. a timespan) - some high-level view of clusters, like Leon's mosaic does
trace the appearance of an image over time in a topic - search by image similarity
search for stories using images similar to one the researcher identifies - search by image similarity

This is the thinking shining that led me to #658.

hroberts · 2020-05-20T18:14:15Z

these all look great to me. is there some way we can produce each of these on a one off basis to evaluate before building them into the platform? we have arguably already done #1. alternatively, we could make a bet that this is the set of products we want and build the minimal platform to deliver them. a key difference I see is that the first two only require us to collect and process a small subset of the images, whereas the last two require us to process all images in a topic and also build an indexing system to be able to find them. maybe start with the first two and build from there?

…

-hal

On Wed, May 20, 2020 at 1:01 PM rahulbot ***@***.***> wrote: To make some technical decisions I think we need to more concretely design the primary use cases we have in mind so far. Here's my stab at a list, and the underlying tech feature it might rely on: - review a summary of visual language across a small corpus (ie. top stories) - maybe use image tree map - review a summary of visual language across a large corpus (ie. a timespan) - some high-level view of clusters, like Leon's mosaic does - trace the appearance of an image over time in a topic - search by image similarity - search for stories using images similar to one the researcher identifies - search by image similarity — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_708&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=ZG5zQSvWqtnTH0wRvyFYC_d6zrfVvEgDE9i6fIBNFxc&s=LMWy1F37DNQeugeIm30z3dpJrkAyP4vPYMs_TQjPaTQ&e=>, or unsubscribe <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T7NEETPBVOQCV72CD3RSQLH3ANCNFSM4NGF4OKQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=ZG5zQSvWqtnTH0wRvyFYC_d6zrfVvEgDE9i6fIBNFxc&s=G3Iw3qj3NrHpJO4DR1C1pTrb1pDu9USBLHrroFCzA0U&e=> .

cindyloo · 2020-05-20T18:38:52Z

note: we also have the potential to analyze by facial detection and identification..

I think we've proved the desire and feasibility for use case #1. Minimally surfacing/storing the image and url at least regarding 1 and 2 would make for a flexible initial implementation

the ability to search by image similarity would be an incredible capability as there is little out there to do such things, but no trivial implementation

rahulbot · 2020-05-20T20:22:25Z

Glad this list feels like a good start. I think #2 has been fairly validated as useful too (see @cindyloo repo MediaCloud-Image-Tests).

I think you're right that this argues for extracting and surfacing the URL of the top image as a way to get started with 1 & 2. It would also let us try out some out-of-band approaches to 3 and 4 more quickly (with the top image at least). We kind of discussed this in #593, but also more recently.

To be concrete: I'm proposing we take a first step towards image support by adding a pipeline stage to every story in a topic that extracts and stores the top image URL (via Newspaper3k because we have validated that). This should be returned in topic-story-list results so it can be used easily. I can split this off to a new issue to discuss details if folks generally agree.

The key point this is pushing me towards is that separating URLs from images can help us implement a first stage faster and give us a non-critical-path playground to more easily try out solutions for some of these features.

rahulbot mentioned this issue May 20, 2020

design top images mosaic, tree map or other display for topic top stories mediacloud/web-tools#1814

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

define key researcher use-cases for story image extraction and storage #708

define key researcher use-cases for story image extraction and storage #708

rahulbot commented May 20, 2020 •

edited

Loading

hroberts commented May 20, 2020 via email

cindyloo commented May 20, 2020 •

edited

Loading

rahulbot commented May 20, 2020

define key researcher use-cases for story image extraction and storage #708

define key researcher use-cases for story image extraction and storage #708

Comments

rahulbot commented May 20, 2020 • edited Loading

hroberts commented May 20, 2020 via email

cindyloo commented May 20, 2020 • edited Loading

rahulbot commented May 20, 2020

rahulbot commented May 20, 2020 •

edited

Loading

cindyloo commented May 20, 2020 •

edited

Loading