Media Cloud: Measuring Attention
================================

At this point you should be ready to query Media Cloud for data. You can use boolean query syntax - [read our query guide](https://mediacloud.org/support/query-guide) for more details about the exact syntax (it runs an [ElasticSearch search](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html) under the hood). **This notebook demonstrates how to quickly measure attention paid to an issue by the media**.

Studying media attention is critical for understanding how much readers are exposed to an issue, and has a long tradition. Media Cloud supports investigating attention within individual sources we track, or within collections or sources. We have wide global coverage with both national-level and regional/state-level collections for most countries. You can [browse our geographic collections](https://search.mediacloud.org/collections/news/geographic) to see more.

Our Python API exposes two methods that are particularly helpful for studying attention: 

* `story_count`: return the total number of stories in our database matching your query
* `story_sample`: retrun a sampling of storeis matching your query (up to 1000)
* `story_count_over_time`: return the total number of stories in our database matching your query _by day_
* `story_list`: page through the actual stories that match your query

In [6]:
# Set up your API key and import needed things
import os, mediacloud.api
import pandas as pd
from importlib.metadata import version
from dotenv import load_dotenv
import datetime as dt
from IPython.display import JSON
import bokeh.io
bokeh.io.reset_output()
bokeh.io.output_notebook()
MC_API_KEY = 'MY_KEY'
search_api = mediacloud.api.SearchApi(MC_API_KEY)
f'Using Media Cloud python client v{version("mediacloud")}'

'Using Media Cloud python client v4.4.0'

## Attention from a Single Media Source
You can start by looking at attention from a single media source to a topic you are interested in. We have almost a million media sources in our system, but only about 100,000 of them are ones that we regularly collect stories from, via RSS feeds or more recently from their sitemaps. You can get the internal id number for any source by searching for it in our [Directory](https://search.mediacloud.org/directory) and noting the ID number just under the large title on that page.

In [7]:
# check how many stories include the phrase "climate change" in the Washington Post (media id #2)
my_query = '"climate change"' # note the double quotes used to indicate use of the whole phrase
start_date = dt.date(2023, 11, 1)
end_date = dt.date(2023, 12,1)
sources = [2]
search_api.story_count(my_query, start_date, end_date, source_ids=sources)

{'relevant': 196, 'total': 5779}

In [8]:
# you can see this count by day as well
results = search_api.story_count_over_time(my_query, start_date, end_date, source_ids=sources)
results[0:3] # just show the first 3 days

[{'date': datetime.date(2023, 11, 1),
  'total_count': 211,
  'count': 7,
  'ratio': 0.03317535545023697},
 {'date': datetime.date(2023, 11, 2),
  'total_count': 214,
  'count': 8,
  'ratio': 0.037383177570093455},
 {'date': datetime.date(2023, 11, 3),
  'total_count': 164,
  'count': 6,
  'ratio': 0.036585365853658534}]

In [9]:
# and you can chart attention over time with some simple notebook work (using Bokeh here)
import pandas as pd
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource
df = pd.DataFrame.from_dict(results)
df['date']= pd.to_datetime(df['date'])
source = ColumnDataSource(df)
p = figure(x_axis_type="datetime", width=900, height=250)
p.line(x='date', y='count', line_width=2, source=source)  # your could use `ratio` instead of `count` to see normalized attention
show(p)

### Normalizing within a Source

Looking at absolute attention at the story level is intriguing, but you probably want to normalize this in some way to support comparisons between sources. To do this, we typically compare attention to the total number of stories we have from a source within that same timespan. That's why the `story_count` endpoint returns two numbers. `relevant` is the number of stories that matched your query.
 * `relevant` is the number that matched your search with all the conditions
 * `total` is the number that matched your search without the query terms

In [10]:
results = search_api.story_count(my_query, start_date, end_date)
source_ratio = results['relevant'] / results['total']
'{:.2%} of the Washington Post stories were about "climate change"'.format(source_ratio)

'0.55% of the Washington Post stories were about "climate change"'

## Research Within a Country - using collections

[We have wide global coverage](https://sources.mediacloud.org/#/collections/country-and-state), with sources published in a country grouped into collections. For many of these countries we also have collections of media sources published in the various states and provinces. Lets compare the source-level attention to country-level attention.

In [11]:
# check in our collection of country-level US National media sources
my_query = '"climate change"'
US_NATIONAL_COLLECTION = 34412234
results = search_api.story_count(my_query, start_date, end_date, collection_ids=[US_NATIONAL_COLLECTION])
us_country_ratio = results['relevant'] / results['total']
'{:.2%} of stories from national-level US media sources mentioneded "climate change"'.format(us_country_ratio)

'2.09% of stories from national-level US media sources mentioneded "climate change"'

In [12]:
# now we can compare this to the source-level coverage
coverage_ratio = 1 / (source_ratio / us_country_ratio)
'"climate change" received {:.2} times less coverage in WashPo than you might expect based on other US national papers'.format(coverage_ratio)

'"climate change" received 3.8 times less coverage in WashPo than you might expect based on other US national papers'

In [14]:
# or compare to another country (India in this case)
INDIA_NATIONAL = 34412118
results = search_api.story_count('"climate change"', start_date, end_date, collection_ids=[INDIA_NATIONAL])
india_country_ratio = results['relevant'] / results['total']
'{:.2%} of stories from national-level Indian media sources in 2023 mentioned "climate change"'.format(india_country_ratio)

'0.61% of stories from national-level Indian media sources in 2023 mentioned "climate change"'

In [15]:
coverage_ratio =  1 / (india_country_ratio / us_country_ratio)
'at the national level "climate change" is covered {:.2} times less in India than the US'.format(coverage_ratio)

'at the national level "climate change" is covered 3.4 times less in India than the US'

## Listing Stories

Story counts are fine, but often what you really want is the story themselves. Note that **we cannot provide story content** due to copyright restrictions. However, you can get a list of all the URLs and then fetch them yourself. We can also return word counts down to the story level (see the "language" notebook for more info on that).

In [17]:
# show a sample of stories matching the query
stories, _ = search_api.story_list(my_query, start_date, end_date)
df = pd.DataFrame(stories)
df.head()

Unnamed: 0,id,indexed_date,language,media_name,media_url,publish_date,title,url
0,8bf6e792e4f682fca5b9c04e0daa3a5222ca707103b5ad...,2025-05-27 22:50:20.386867+00:00,en,pew.org,pew.org,2023-11-09,Plastics Disclosure and Reporting,https://www.pew.org/en/research-and-analysis/a...
1,11940646c2d514bf5f18508db12fc9d5e7b0def3ada455...,2025-05-27 05:22:24.344245+00:00,en,vonradio.com,vonradio.com,2023-11-29,Semi Airborne Drone Geophysical Survey current...,https://vonradio.com/semi-airborne-drone-geoph...
2,a3eaf4575710e514004ae9ba5a6d27cd53daa5d779f146...,2025-04-21 21:51:09.506533+00:00,en,danielmiessler.com,danielmiessler.com,2023-11-15,"UL NO. 408: OpenAI Coup Theory, SEC vs. SolarW...",https://danielmiessler.com/blog/ul-408
3,609552556d0b0fbf8fff3aaaf346bea204c4983779f497...,2025-03-29 06:17:27.800166+00:00,en,skyscrapercity.com,skyscrapercity.com,2023-11-20,Puri | Shree Jagannath International Airport,https://www.skyscrapercity.com/threads/puri-sh...
4,a5592341aeea278407d850d0a9896d392f526f8b73f345...,2025-03-28 00:17:40.430025+00:00,en,nation.lk,nation.lk,2023-11-07,Australian PM Calls General Election For May 3,https://www.nation.lk/online/australian-pm-cal...


In [18]:
# or use the listing feature grab the most recent stories about this issue
stories, _ = search_api.story_list(my_query, start_date, end_date)
stories[:3] # list first 3

[{'id': '8bf6e792e4f682fca5b9c04e0daa3a5222ca707103b5adcd1c9133cfa4209a74',
  'indexed_date': datetime.datetime(2025, 5, 27, 22, 50, 20, 386867, tzinfo=datetime.timezone.utc),
  'language': 'en',
  'media_name': 'pew.org',
  'media_url': 'pew.org',
  'publish_date': datetime.date(2023, 11, 9),
  'title': 'Plastics Disclosure and Reporting',
  'url': 'https://www.pew.org/en/research-and-analysis/articles/2023/11/09/plastics-disclosure-and-reporting'},
 {'id': '11940646c2d514bf5f18508db12fc9d5e7b0def3ada4550e2893892781ee906c',
  'indexed_date': datetime.datetime(2025, 5, 27, 5, 22, 24, 344245, tzinfo=datetime.timezone.utc),
  'language': 'en',
  'media_name': 'vonradio.com',
  'media_url': 'vonradio.com',
  'publish_date': datetime.date(2023, 11, 29),
  'title': 'Semi Airborne Drone Geophysical Survey currently taking place in St. Kitts',
  'url': 'https://vonradio.com/semi-airborne-drone-geophysical-survey-currently-taking-place-in-st-kitts/?utm_source=rss&utm_medium=rss&utm_campaign=se

If you want to list ALL the stories matching, you need to page through the results. This is accomplished via the `pagination_token` parameter. This code snippet pages through all the stories in a query.

In [19]:
# let's fetch all the stories matching our query on one day
all_stories = []
more_stories = True
pagination_token = None
while more_stories:
    page, pagination_token = search_api.story_list(my_query, dt.date(2023,11,29), dt.date(2023,11,30),
                                                   collection_ids=[US_NATIONAL_COLLECTION],
                                                   pagination_token=pagination_token)
    all_stories += page
    more_stories = pagination_token is not None
len(all_stories)

634

As you may have noted, this can take a while for long time periods. If you look closely you'll notice that it can't be easily parallelized, because it requires content in the results to make the next call. A workaround is to divide you query up by time and query in parallel for something like each day. This can speed up the response. Also **just contact us directly if you are trying to do larger data dumps, or hit up against your API quota**.

### Writing a CSV of Story Data

What you probably want is a csv of all this story data. Here's a quick exmaple of dumping that data to a CSV (like our Search tool does).

In [20]:
import csv
fieldnames = ['id', 'publish_date', 'title', 'url', 'language', 'media_name', 'media_url', 'indexed_date']
with open('story-list.csv', 'w', newline='') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames, extrasaction='ignore')
    writer.writeheader()
    for s in all_stories:
        writer.writerow(s)

In [21]:
# and let's make sure it worked by checking out by loading it up as a pandas DataFrame
df = pd.read_csv('story-list.csv')
df.head()

Unnamed: 0,id,publish_date,title,url,language,media_name,media_url,indexed_date
0,628be4448de8fc2b7ffe03d08ff5347d5f74b21c89c7dd...,2023-11-29,"Blinken in Brussels, Biden in Colorado, COP28 ...",https://www.npr.org/2023/11/29/1198909456/up-f...,en,npr.org,npr.org,2024-05-30 20:28:23.675812+00:00
1,64be6e087a79eb8fff5f252185d33eb3904d2e70a9cd69...,2023-11-29,Native forest logging ban in Tasmania could sa...,https://www.theguardian.com/australia-news/202...,en,theguardian.com,theguardian.com,2024-02-15 17:20:59.241395+00:00
2,6df7f40e5686633c1dd4b8677be1e0917bd709f60ebb06...,2023-11-29,Sustainable Aviation Fuel Market worth $16.8 b...,https://www.benzinga.com/pressreleases/23/11/n...,en,benzinga.com,benzinga.com,2024-02-15 17:02:56.664008+00:00
3,02c61b2c7a9ad452080a8e337f436d04c8bf10ea6f2722...,2023-11-29,"'Squad' divisions arise after Israel vote, BLM...",https://www.foxnews.com/us/squad-divisions-ari...,en,foxnews.com,foxnews.com,2024-02-15 17:02:43.285905+00:00
4,0af0ae37a84a1cdc8d2d2acae4b282fee12ff3c8424b85...,2023-11-29,Here's How Much Regulation Costs the Average A...,https://www.dailysignal.com/2023/11/29/hidden-...,en,dailysignal.com,dailysignal.com,2024-02-15 17:00:44.612355+00:00


## Top Media Sources

Attention within a collection is useful to know, and to compare across collections. We also offer the ability to see the
media that had the most stories matching your search.

In [22]:
# List media producing the most stories matching the search
INDIA_NATIONAL = 34412118
results = search_api.sources('"climate change"', start_date, end_date, collection_ids=[INDIA_NATIONAL])
JSON(results)

<IPython.core.display.JSON object>