How long does it take for content published by news organisations to be available in Google search?
This broadens Ophan's Google Search Index Checker to check for content published by many news organisations, not just the Guardian. We're trying to work out if the intermittent multi-hour delays we've seen for some Guardian articles to be available in Google Search are typical for other news organisations too, or if there's actually something particular to the Guardian that needs to be fixed.
It's an 'observatory' in the same way that the EFF SSL Observatory is - creating and collating observations of distant sites and processes that are visible to us but beyond our control.
- Fetch the Sitemap XML for a news site
- Hit the Google Custom Search Site Restricted JSON API to check if the content listed is available in Google search. API Consumption & Cost 💰💰💰 for this can be monitored in the Google Cloud console.
- Stores whether each article is available (or not) in an AWS DynamoDb table.
These mostly match the pre-requisites for running Ophan locally -
specifically Java 11 & sbt
, but also especially the requirement to have
ophan
AWS credentials
from Janus.
Execute this on the command line:
$ sbt run