Media Cloud: Studying Language
==============================

At this point you're ready to query Media Cloud for data. You can use boolean query syntax - [read our query guide](https://mediacloud.org/support/query-guide) for more details about the exact syntax (it runs a [SOLR search](https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.html#the-standard-query-parser) under the hood). **This notebook demonstrates how to quickly evaluate the language used by media covering an issue**.

Looking at the attention paid to an issue is helpful, but understanding the particular framing requires deeper investigation into the langauge used when discussing that issue.

There are two API methods to support studying language, exposed via our Python API:

* `wordCount`: Returns lists of the top words used in a sample of storeis matching your query ([documentation](https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/api_2_0_spec.md#apiv2wclist))
* `storyWordMatrix`: Returns a sparse matrix of term use in each docunent ([documentation](https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/api_2_0_spec.md#apiv2stories_publicword_matrix))

In [None]:
# Grab your API key from the environment variable and create a client for talking to Media Cloud
import os, mediacloud.api
from dotenv import load_dotenv
from IPython.display import JSON
load_dotenv()  # load config from .env file
mc = mediacloud.api.MediaCloud(os.getenv('MC_API_KEY'))
mediacloud.__version__

## Simple Word Counts
Let's start by looking at how a single media source talks about an issue. This builds on the queries from the "Attention" notebok.

In [None]:
import datetime
my_query = '"climate change" and media_id:2'
start_date = datetime.date(2019,1,1)
end_date = datetime.date(2020,1,1)
date_range_2019 = mc.publish_date_query(start_date, end_date) # default is start inclusive, end exclusive

In [None]:
results = mc.wordCount(my_query, date_range_2019)
results[:10]

A couple of things to notice here:
* Stemming: Words are stemmed by solr before being counted. The term returned is the not used version of the stem in the sample.
* Sampling: We sample results to improve speed. The default sample size is 1,000, but you can go up to 10,000 by specifgin a `sample_size` in your call. This means results can change between calls. We find that terms don't shift more than a few up and down even when using just 1,000 as your sample size.

## Categorizing Stories by Theme


In [None]:
# TODO

## Fetching a Term/Document Usage Matrix

If you are doing more specific analysis as part of an NLP pipline you can also fetch a sparse term/document usage matrix. This can help you do things like TF-IDF stages and such. We don't use this very much, but it is there in case you need it.

In [None]:
# check how many stories were about this issue
story_count = mc.storyCount(my_query, date_range_2019)['count']
story_count

In [None]:
# by default this matrix checks 1000 stories, but you can do up to 100,000 via the `rows` parameter
# NOTE: this can take a few minutes to return
doc_term_matrix = mc.storyWordMatrix(my_query, date_range_2019, rows=story_count)

In [None]:
JSON(doc_term_matrix)

These results are split into two items:

* `results['word_list']`: an ordered `list` of the top words found, each item including the stem and the most common verson of that stem
* `results['word_matrix']`: a `dictionary` keyed by `stories_id`, with the value being a `dictionary` from word index to frequency of use

In [None]:
# see how many documents a particular term is used in
term_index = 20
docs_including_term = [stories_id 
                       for stories_id, term_lookup in doc_term_matrix['word_matrix'].items()
                       if str(term_index) in term_lookup.keys()]
'{} docs include the term "{}"'.format(len(docs_including_term), doc_term_matrix['word_list'][term_index][1])