Media Cloud: Studying Entites
=============================

At this point you're ready to query Media Cloud for data. You can use boolean query syntax - [read our query guide](https://mediacloud.org/support/query-guide) for more details about the exact syntax (it runs a [SOLR search](https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.html#the-standard-query-parser) under the hood). **This notebook demonstrates how to quickly evaluate the entities being mentioned**.

Looking at the entities mentioned in media stories can be very helpful to understand whose narrative is dominating the framing of an issue. Every English-language story in our system is run through our [Cliff-Clavin tool](https://cliff.mediacloud.org), which extracts **people**, **places** and **orginations** mentioned in text (via [Stanford's CoreNLP](https://stanfordnlp.github.io/CoreNLP/)). The core add-on is the extra added layer of heuristics that disambiguates the geographic mentions (more details in [our NewsKDD paper from 2014](https://dspace.mit.edu/handle/1721.1/123451)). Sadly this only works in English.

Entities are stored as *tags* on stories. Tags are collected into tag sets to support aggregation and comparison. There are 3 tag sets related to entities, all captured as constants in `mediacloud.tags`:
* People - tag set 2389 (`mediacloud.tags.TAG_SET_CLIFF_PEOPLE`)
* Organizations - tag set 2388 (`mediacloud.tags.TAG_SET_CLIFF_ORGS`)
* Places (countries and states/provinces) - tag set 1011 (`mediacloud.tags.TAG_SET_CLIFF_PLACES`)

There is one key API method to support studying entity use, exposed via our Python API:

* `storyTagCount`: Returns lists of the most used tags within a tag set in the stories matching your query ([documentation](https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/api_2_0_spec.md#apiv2stories_publictag_count))


In [None]:
# Grab your API key from the environment variable and create a client for talking to Media Cloud
import os, mediacloud.api
from dotenv import load_dotenv
from IPython.display import JSON
load_dotenv()  # load config from .env file
mc = mediacloud.api.MediaCloud(os.getenv('MC_API_KEY'))
mediacloud.__version__

In [None]:
import datetime
us_query = '"climate change" and tags_id_media:34412234'
date_range_2019 = mc.dates_as_query_clause(datetime.date(2019,1,1), datetime.date(2019,1,31)) # inclusive

## Who is Being Mentioned?

Looking at "newsmakers" is a classic approach to understanding whose message is getting out about an issue.

In [None]:
# let's see who is being mentioned most in stories about "climate change" in US national sources
import mediacloud.tags
results = mc.storyTagCount(us_query, date_range_2019, tag_sets_id=mediacloud.tags.TAG_SET_CLIFF_PEOPLE)
results[:3]

You'll note that the top sources probably include two versions of the same person's name. People are *not* being disambiguated, mostly because we haven't found an off-the-shelf free solution that works well for this yet, and don't have a research project that has required us to solve this problem at scale.

Of course these results can simply reflect who is mentioned overall in all news reporting. You might want to normalize these against results across all media. For instance, is Person X mentioned more often than you'd expect based on news reporting in general.

## What Organizations are Being Mentioned?

In [None]:
# let's which organizations are being mentioned most in stories about "climate change" in US national sources
import mediacloud.tags
results = mc.storyTagCount(us_query, date_range_2019, tag_sets_id=mediacloud.tags.TAG_SET_CLIFF_ORGS)
results[:3]

## What Places are Being Talked About?

We take the raw results of places mentioned, disambiguate them to specific locations based on heuristics, and then tag stories with the countries and states/provinces that are most mentioned in each story. So you can't search for all stories mentioning a small town, and probably shouldn't anyway because the false positivites will be frequent (ie. which "London"?). You can search for stories about a country or a state/province. You can browse and download lists of the tags for every countries and subdivisions in our system [on our tag support page](https://mediacloud.org/support/list-of-tags).

In [None]:
# let's which places are being mentioned most in stories about "climate change" in US national sources
import mediacloud.tags
results = mc.storyTagCount(us_query, date_range_2019, tag_sets_id=mediacloud.tags.TAG_SET_CLIFF_PLACES)
results[:3]

There are few things worth noting in these results

* The tag text (the `tag` property) includes the geonames id of the place. We use [Geonames.org](https://www.geonames.org) under the hood as our index of unique places. You can use this ID a unique key for the place.
* The tag `description` captures some other useful geographic information about, like the 2-letter ISO abbreviation.
* There are a mix of countries and subdivisions in this tag set, so you if you want just countries you'll have the filter for those. You can [download a list of all the country-level tags](https://mediacloud.org/s/imynpclbsboalbltpq9w0uvligjg5h) on our support page.