Media Cloud: Studying Language
==============================

At this point you're ready to query Media Cloud for data. You can use boolean query syntax - [read our query guide](https://mediacloud.org/support/query-guide) for more details about the exact syntax (it runs a [SOLR search](https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.html#the-standard-query-parser) under the hood). **This notebook demonstrates how to quickly evaluate the language used by media covering an issue**.

Looking at the attention paid to an issue is helpful, but understanding the particular framing requires deeper investigation into the langauge used when discussing that issue.

There are two API methods to support studying language, exposed via our Python API:

* `wordCount`: Returns lists of the top words used in a sample of storeis matching your query ([documentation](https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/api_2_0_spec.md#apiv2wclist))
* `storyWordMatrix`: Returns a sparse matrix of term use in each docunent ([documentation](https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/api_2_0_spec.md#apiv2stories_publicword_matrix))

In [None]:
# Grab your API key from the environment variable and create a client for talking to Media Cloud
import os, mediacloud.api
from dotenv import load_dotenv
from IPython.display import JSON
load_dotenv()  # load config from .env file
mc = mediacloud.api.MediaCloud('MC_API_KEY')
mediacloud.__version__

## Simple Word Counts
Let's start by looking at how a single media source talks about an issue. This builds on the queries from the "Attention" notebook.

In [None]:
import datetime
my_query = '"climate change" and media_id:2'
date_range_2019 = mc.dates_as_query_clause(datetime.date(2019,1,1), datetime.date(2019,1,31)) # inclusive

In [None]:
results = mc.wordCount(my_query, date_range_2019)
results[:10]

A couple of things to notice here:
* Stemming: Words are stemmed by solr before being counted. The term returned is the most used version of the stem in the sample.
* Sampling: We sample results to improve speed. The default sample size is 1,000, but you can go up to 10,000 by specifying a `sample_size` in your call. This means results can change between calls. We find that terms don't shift more than a few up and down even when using just 1,000 as your sample size.

## Fetching a Term/Document Usage Matrix

If you are doing more specific analysis as part of an NLP pipline you can also fetch a sparse term/document usage matrix. This can help you do things like TF-IDF stages and such. Fetching a word matrix like this is *much, much faster* than downloading word counts for individual stories one at a time (folks were trying to do this and struggling with the time required, which is what led to this api end point).

In [None]:
# check how many stories were about this issue
jan_2019 = mc.dates_as_query_clause(datetime.date(2019,1,1), datetime.date(2019,1,31))
story_count = mc.storyCount(my_query, jan_2019)['count']
story_count

In [None]:
# by default this matrix checks 1000 stories, but you can do up to 100,000 via the `rows` parameter
# NOTE: this can take a few minutes to return
doc_term_matrix = mc.storyWordMatrix(my_query, date_range_2019, rows=story_count)

In [None]:
JSON(doc_term_matrix)

These results are split into two items:

* `results['word_list']`: an ordered `list` of the top words found, each item including the stem and the most common verson of that stem
* `results['word_matrix']`: a `dictionary` keyed by `stories_id`, with the value being a `dictionary` from word index to frequency of use

In [None]:
# see how many documents a particular term is used in
term_index = 20
docs_including_term = [stories_id 
                       for stories_id, term_lookup in doc_term_matrix['word_matrix'].items()
                       if str(term_index) in term_lookup.keys()]
'{} docs include the term "{}"'.format(len(docs_including_term), doc_term_matrix['word_list'][term_index][1])

## Categorizing Stories by Theme


To support some of our projects we've built a system that labels stories with which "themes" they are about. To build these models, we took the approach of transfer learning - starting with the [Google News word2vec models](https://code.google.com/archive/p/word2vec/) and then adapting them to produce based on the [New York Times annotated corpus](https://catalog.ldc.upenn.edu/ldc2008t19). We score each story against the most common 600 descriptors from the NYT corpus. Any descriptors that score above 0.2 probability are listed as theme(s) for the story.  For more technical details about the development of this classifier, please see [Yasmin Rubniovitz's MS thesis (section 4.3)](https://dspace.mit.edu/handle/1721.1/112544). You can [browse and download a list of these themes on our website's support page](https://mediacloud.org/support/theme-list).

All English-language stories are tagged with the themes that our model thinks they are about (any label with a score > 0.2). We use this to compare coverage at a high level between studies, but don't trust it as an authoritative statement of what each story is about. These labels don't adapt to current events, so we don't expect them to age well for more recent events.

In [None]:
# check the top themes in coverage of "climate change"
import mediacloud.tags
results = mc.storyTagCount(my_query, date_range_2019, tag_sets_id=mediacloud.tags.TAG_SET_NYT_THEMES)
JSON(results)

See the "entities" notebook has more background on the above `storyTagCount` API call.