MediaWatch

Overview

See a live version here : https://media.herve.info/

This project is my attempt at a quasi-automated system that figures out who was invited and when on various French radio/TV political shows, with as little human intervention as possible.

The goal is to replicate the basic functionality of this website : https://www.politiquemedia.com/

Technical details

In more details, this application runs on a server 24/7 and regularly polls an inventory of various sources with vastly different layouts (xml podcast feed, html pages, ...) in order to extract relevant information. This information is stored in a structured SQLite database and presented on a website.

It is currently capable of :

making snapshots of the data published for each show
retrieving the shows' metadata (its name, a banner, ...) by HTML parsing
making regular snapshots of the data published for each occurrence of a show
retrieving the metadata for each occurrence of a show (the title, a summary, ...)
extracting a list of guests
presenting all this information in a website (with a page per show / day / person)

Automated extraction of guest information

It is a trivial task for a human-being to tell which guests were invited in a given show by just reading the show's webpage or simply listening to/watching some seconds of it.

The whole challenge of this project is to automate as much of this process as possible, but it remains quite fuzzy and incertain.

Firstly because some sources won't even publish any relevant information in text form about each show (see e.g. this show that displays the same exact headline/summary every week).

Secondly because when they do, it is done in natural language (e.g. "Notre invité aujourd'hui est NOM_DE_L_INVITE", "Ce matin nous accueillons NOM_DE_L_INVITE", "NOM_DE_L_INVITE : 'Ce qu'a dit X est inacceptable'") ; it therefore requires some analysis that is quite hard to automate (currently done with NLP provided by the open-source library spaCy).

Third because the name of the guests is never presented in a consistent way ; a given person will appear as "Firstname Lastname", or "F. Lastname", or "M./Mme Lastname" or even sometimes only "Lastname". Wikidata's API has been used to try and ensure consistency.

Maintenance costs

Please note that, even though the goal is to minimize human intervention, this project still requires some maintenance in order to keep on working reliably on long-term, mainly :

validating the data guessed by the NLP pipeline (should be done on a daily basis)
updating the inventory (e.g. when a show's podcast URL changes)
updating the parsing process (e.g. when the layout of a show's main page changes)

This lack of maintenance explains why some information is wrong or has stopped updating in the live version, though it has been running consistently for ~1.5yr.

Connect as admin [reminder for future me]

Run a remote Elixir shell on the host machine : sudo su media_watch -c "/home/media_watch/otp/bin/media_watch remote"
Generate an admin_key MediaWatch.Auth.generate_admin_key()
Login to the site using URL "/admin?token=xxxx"

Name		Name	Last commit message	Last commit date
Latest commit History 338 Commits
assets		assets
config		config
lib		lib
priv		priv
test		test
.formatter.exs		.formatter.exs
.gitignore		.gitignore
.iex.exs		.iex.exs
CHANGELOG.md		CHANGELOG.md
README.md		README.md
mix.exs		mix.exs
mix.lock		mix.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MediaWatch

Overview

Technical details

Automated extraction of guest information

Maintenance costs

Connect as admin [reminder for future me]

About

Releases

Packages

Languages

jherve/media_watch

Folders and files

Latest commit

History

Repository files navigation

MediaWatch

Overview

Technical details

Automated extraction of guest information

Maintenance costs

Connect as admin [reminder for future me]

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages