Media Cloud: Sources and Collections
====================================

At this point you should be ready to query Media Cloud for data. **This notebook demonstrates how to browse and download information about the media sources and collections within Media Cloud Directory**. This explores some of the API methods under the hood of our [Directory](https://search.mediacloud.org/directory), which is used to browse sources and collections in our system.

Every open web news story is connected to a `media source`. A media source is basically a domain (with a handful of exceptions). Sources are grouped together into `collections`; one source can be part of many collections, and vice versa. Our primary collections are [geography-based](https://search.mediacloud.org/collections/news/geographic) (at the national and provider/state level).

We regularly scrape RSS feeds from a small set of our sources (around 60k as of late 2023). We are slowly rolling out the ability to ingest stories from news stories via their sitemap files (the hard part is determining which URLs arenews story pages and what are not). We don't advise searching against our entire database, without collections or sources, because it is skewed towards the topics of investigations we and collaborating researchers have done. If you have a collection you want created, just email us and ask.

Our Python API exposes a few methods that are particularly helpful for looking at sources, their associated metadata, and collections: 

* `collection_list`: search for collections by name
* `source_list`: search for sources by name of parent collection
* `source`: get information about specific source, by id
* `feed_list`: list feeds within a source

## Setup

In [1]:
# Set up your API key and import needed things
import os, mediacloud.api
from importlib.metadata import version
from dotenv import load_dotenv
from IPython.display import JSON
MC_API_KEY = 'MY_API_KEY'
directory_api = mediacloud.api.DirectoryApi(MC_API_KEY)
f'Using Media Cloud python client v{version("mediacloud")}'

'Using Media Cloud python client v4.4.0'

## Searching for Media Sources

You can search for specific media, or media matching a set of criteria.

In [2]:
# try to find a media source based on its URL
matching_sources = directory_api.source_list(name='hindustantimes.com')
JSON(matching_sources['results'][0])

<IPython.core.display.JSON object>

In [3]:
# or if you have an id already, you can just load the source info
source = directory_api.source(21511)
JSON(source) # all about the Boston Globe source

<IPython.core.display.JSON object>

## Media Source Feeds
Media Sources are created manually, or automatically generated by our system when a story is ingested from a domain we have not seen before. For the limited number of sources that we ingest from daily, we have manually and automatically created RSS feeds (see our [`feed_seeker` package](https://github.com/mediacloud/feed_seeker)).
```
media source
  ↳ feed
```

In [None]:
# learn about the first result from above, which is our canonical one for the Hindustan Times
matching_sources = directory_api.source_list(name='hindustantimes.com')
hindustan_times = matching_sources['results'][0]
JSON(hindustan_times)

In [None]:
# list up to the first 100 feeds associated with this media source
hindistan_times_feeds = directory_api.feed_list(hindustan_times['id'])
JSON(hindistan_times_feeds)

## Collections
Media Sources are grouped together into collections. We have tons of collections, but the geographic ones the most useful place to start investigating things. Drop us a line if you have a collection of media sources to suggest that you think would be broadly useful. We have topical ones, such as collections of media sources in the US based on partisanship, as well.

In [None]:
# search for a collection by name
nigerian_collections = directory_api.collection_list(name='nigeria')
[c['name'] for c in nigerian_collections['results']]

In [None]:
# page through a list of all the sources in the "Nigeria - National" collection
NIGERIA_NATIONAL = 38376341
sources = []
limit = 100
offset = 0
while True:
    response = directory_api.source_list(collection_id=NIGERIA_NATIONAL, limit=limit, offset=offset)
    sources += response['results']
    if response['next'] is None:
        break
    offset += limit
f"Found {len(sources)} media sources in Nigeria National collection geographic collections"