Media Cloud: Using Topics
=========================

When you really want to dig into an issue and investigate it more deeply, Media Cloud allows you to make a "topic" with Topic Mapper. A topic is a corpus of open web storeis that you can analyze in extended ways. This is useful in three speciic ways:

* **Discover More Content** - seed it with the open-web stories we know about, spider from those to discover more; pull in content from other platforms such as Reddit, Google search results, etc
* **Measure Influence** - look at networks of linking between stories and sources; analyze social media posting patterns
* **Slice and Dice Content** - analyze content by week or month; compare subtopics based on keyword matches, countries or focus, and more

The [API for topics](https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/topics_api_2_0_spec.md) is different than the standard API. A few key concepts are described in [the short section on "Snapshots, Timespans, and Foci"](https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/topics_api_2_0_spec.md#snapshots-timespans-and-foci) that you sould read:
* topics have different versions, so that we can go back to previous versions of a corpus to support reproduceable research (these are called `snapshots` in the API
* the time-based, and sub-topic-based slicig and dicing is captured in the idea of `timespans`
* if you do not specify a `snapshot` or `timespan`, it will use the latest overall timespan (ie. your overall corpus)


In [None]:
# Grab your API key from the environment variable and create a client for talking to Media Cloud
import os, mediacloud.api
from dotenv import load_dotenv
from IPython.display import JSON
load_dotenv()  # load config from .env file
mc = mediacloud.api.MediaCloud(os.getenv('MC_API_KEY'))
mediacloud.__version__

## Topic Metadata

Each story in a topic is part of a tree that lets you filter the corpus:
```
snapshot
  ↳ focus
    ↳ timespan
      ↳ stories
```

### Snapshots

The API can tell you about the versions (aka `snapshots`) and `timespans` within your topic:

* `snapshotList`: list all the versions within your topic [low level docs](https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/topics_api_2_0_spec.md#snapshotslist)

In [None]:
SOURDOUGH_TOPIC = 4138
# list all snapshots of a topic
snapshots = mc.topicSnapshotList(SOURDOUGH_TOPIC)
latest_snapshot_id = snapshots[0]['snapshots_id'] # grab the id of the latest snapshot
latest_snapshot_id

### Subtopics

Subtopics are captured in `foci` that are part of a `focal_set`. Many of these are generated automatically based on how your topic is set up. For instance, if you add a `platform` to discover links shared on Reddit, you will get a subtopic that includes just those links automatically. You can list all the focal sets for a `snapshot`:

* `topicFocalSetList`: lists all the `focal_sets` for a `snapshot`, including a child array listing all the `foci` that it contains (see [the docs for more details](https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/topics_api_2_0_spec.md#focal_set_definitionslist))

In [None]:
# within a snapshot you can get a list of all the subtopics
focal_sets = mc.topicFocalSetList(SOURDOUGH_TOPIC, snapshots_id=latest_snapshot_id)
reddit_foci_id = focal_sets[0]['foci'][0]['foci_id']
focal_sets[0]

### Timespans

`Timespans` are the lowest level of the tree. The are automatically generated items that list stories within a `snapshot` and optional `focus`. `Timespans` can cover one of three periods of time:

* `overall` - the entirety of the corpus (based on the topic start and end dates)
* `monthly` - one month of the corpus
* `weekly` - one week of the corpus

If you don't specify a timespan, the overall one is used. The API lets you list all the timespans with a `snapshot` and optional `focus`:
* `topicTimespanList`: lists all the `timespans` for a `snapshot` (see [the docs for more details](https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/topics_api_2_0_spec.md#timespanslist))

In [None]:
# let's see just the overall timespan for our topic
timespans = mc.topicTimespanList(SOURDOUGH_TOPIC)
overall_timespans = [t for t in timespans if t['period'] == 'overall']
overall_timespans

In [None]:
# the previous is identical to listing the timepans for the latest snapshot
timespans = mc.topicTimespanList(SOURDOUGH_TOPIC, snapshots_id=latest_snapshot_id)
overall_timespans = [t for t in timespans if t['period'] == 'overall']
overall_timespans

In [None]:
# but the overall timespan within the subtopic of stories shared on reddit is different
reddit_overall_timespan = mc.topicTimespanList(SOURDOUGH_TOPIC, snapshots_id=latest_snapshot_id, foci_id=reddit_foci_id)[0]
reddit_overall_timespan

See the other notebooks for examples of how to use these `snapshots_id`, `foci_id` and `timespans_id` to gather and analyze content within a topic.