Media Cloud: Studying Language
==============================

At this point you're ready to query Media Cloud for data. You can use boolean query syntax - [read our query guide](https://www.mediacloud.org/documentation/query-guide) for more details. **This notebook demonstrates how to quickly evaluate the language used by media covering an issue**.

Looking at the attention paid to an issue is helpful, but understanding the particular framing requires deeper investigation into the langauge used when discussing that issue.

There are two API methods to support studying language, exposed via our Python API:

* `words`: Returns top terms from a sample of stories matching your query
* `languages`: Returns lists of the top languages used in stories matching your query

In [None]:
# Set up your API key and import needed things
import os, mediacloud.api
from importlib.metadata import version
from dotenv import load_dotenv
import datetime as dt
from IPython.display import JSON
import bokeh.io
bokeh.io.reset_output()
bokeh.io.output_notebook()
MC_API_KEY = 'MY_API_KEY'
search_api = mediacloud.api.SearchApi(MC_API_KEY)
f'Using Media Cloud python client v{version("mediacloud")}'

## Simple Word Counts
Let's start by looking at how a single media source talks about an issue. This builds on the queries from the "Attention" notebook.

In [None]:
# check how many stories include the phrase "climate change" in the Washington Post (media id #2)
my_query = '"climate change"' # note the double quotes used to indicate use of the whole phrase
start_date = dt.date(2023, 11, 1)
end_date = dt.date(2023, 12,1)
sources = [2]
results = search_api.words(my_query, start_date, end_date, source_ids=sources)
JSON(results)

A couple of things to notice here:
* Stemming: Words are stemmed by Elasticsearch before being counted. The term returned is the most used version of the stem in the sample.
* Sampling: (coming soon...)

## Languages Used

We try to detect the langauge of every story in our system by using the [pylangid3 module](https://pypi.org/project/py3langid/). You can use this to see what langauges are most used in stories, and filter for stories in specific languages.

In [None]:
# See top languages used in articles
INDIA_NATIONAL = 34412118
results = search_api.languages('*', start_date, end_date, collection_ids=[INDIA_NATIONAL])
JSON(results)

In [None]:
# Retrieve latest stories in Hindi
page, _ = search_api.story_list('* and language:hi', start_date, end_date, collection_ids=[INDIA_NATIONAL])
page[:3]