plan topic-level image support #658

rahulbot · 2020-01-28T14:12:38Z

Building on #602 and #593, we now want to figure out the requirements for allowing image-based analysis within topics. I reviewed all the old conversations I could find and collected these notes worth keeping in mind:

people like the transfer-learning-based clustering approaches
the same images are very often used, but cropped differently in different stories
the social-sharing image (og:image) is often different from all the others
the idea of "top-image" is conceptually helpful for analysis
the image ResNet50 similarity stuff uses small (224px square) images
people like the ability to see full size images

With regards to back-end pipeline, I'd translate those notes into requirements and next steps like this:

extract all the story-related image URLs from each story in a topic
- should we make this optional at the topic level? perhaps to save cost?
- decide whether to use Newspaper3k or roll-our-own
- make sure we only extract them once for each story
- decide whether deduplication is worth solving or not
within a story, mark the "top image" and "social sharing image(s)"
- create a DB table structure that allows for this
store full size images and 224px size images by default
- @pypt suggests an S3 store for this, re-using a solution we use for other things
- do some tests to estimate ongoing cost and growth rate
specify API endpoints for retrieval of said images
- my first thought is to just add an images property to any topic story list results (that'd let us render image tree maps quickly)

A separate task is to design an approach to automatically training an image-embeddings model based on the ResNet50 transfer learning approach we learned from Leon (for each snapshot). I think that still needs investigating and research work; particularly on which similarity algorithm to use and on what to present users to support research. Sometimes they say they want his "mosaics", but other times it seems they want clusters.

What did I miss? Thoughts on these requirements?

The text was updated successfully, but these errors were encountered:

rahulbot · 2020-02-18T13:45:00Z

More notes on the related project board: https://github.com/berkmancenter/mediacloud/projects/3

rahulbot added the enhancement label Jan 28, 2020

rahulbot assigned pypt, rahulbot and hroberts Jan 28, 2020

rahulbot mentioned this issue May 20, 2020

define key researcher use-cases for story image extraction and storage #708

Open

rahulbot removed their assignment Jul 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

plan topic-level image support #658

plan topic-level image support #658

rahulbot commented Jan 28, 2020

rahulbot commented Feb 18, 2020

plan topic-level image support #658

plan topic-level image support #658

Comments

rahulbot commented Jan 28, 2020

rahulbot commented Feb 18, 2020