-
Notifications
You must be signed in to change notification settings - Fork 87
define key researcher use-cases for story image extraction and storage #708
Comments
these all look great to me. is there some way we can produce each of these
on a one off basis to evaluate before building them into the platform? we
have arguably already done #1.
alternatively, we could make a bet that this is the set of products we want
and build the minimal platform to deliver them.
a key difference I see is that the first two only require us to collect and
process a small subset of the images, whereas the last two require us to
process all images in a topic and also build an indexing system to be able
to find them. maybe start with the first two and build from there?
…-hal
On Wed, May 20, 2020 at 1:01 PM rahulbot ***@***.***> wrote:
To make some technical decisions I think we need to more concretely design
the primary use cases we have in mind so far. Here's my stab at a list, and
the underlying tech feature it might rely on:
- review a summary of visual language across a small corpus (ie. top
stories) - maybe use image tree map
- review a summary of visual language across a large corpus (ie. a
timespan) - some high-level view of clusters, like Leon's mosaic does
- trace the appearance of an image over time in a topic - search by
image similarity
- search for stories using images similar to one the researcher
identifies - search by image similarity
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_708&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=ZG5zQSvWqtnTH0wRvyFYC_d6zrfVvEgDE9i6fIBNFxc&s=LMWy1F37DNQeugeIm30z3dpJrkAyP4vPYMs_TQjPaTQ&e=>,
or unsubscribe
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T7NEETPBVOQCV72CD3RSQLH3ANCNFSM4NGF4OKQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=ZG5zQSvWqtnTH0wRvyFYC_d6zrfVvEgDE9i6fIBNFxc&s=G3Iw3qj3NrHpJO4DR1C1pTrb1pDu9USBLHrroFCzA0U&e=>
.
|
note: we also have the potential to analyze by facial detection and identification.. I think we've proved the desire and feasibility for use case #1. Minimally surfacing/storing the image and url at least regarding 1 and 2 would make for a flexible initial implementation the ability to search by image similarity would be an incredible capability as there is little out there to do such things, but no trivial implementation |
Glad this list feels like a good start. I think #2 has been fairly validated as useful too (see @cindyloo repo MediaCloud-Image-Tests). I think you're right that this argues for extracting and surfacing the URL of the top image as a way to get started with 1 & 2. It would also let us try out some out-of-band approaches to 3 and 4 more quickly (with the top image at least). We kind of discussed this in #593, but also more recently. To be concrete: I'm proposing we take a first step towards image support by adding a pipeline stage to every story in a topic that extracts and stores the top image URL (via Newspaper3k because we have validated that). This should be returned in topic-story-list results so it can be used easily. I can split this off to a new issue to discuss details if folks generally agree. The key point this is pushing me towards is that separating URLs from images can help us implement a first stage faster and give us a non-critical-path playground to more easily try out solutions for some of these features. |
To make some technical decisions I think we need to more concretely design the primary use cases we have in mind so far. Here's my stab at a list, and the underlying tech feature it might rely on:
This is the thinking shining that led me to #658.
The text was updated successfully, but these errors were encountered: