Media Cloud: Topics: Measuring Attention
========================================

At this point you have a topic created in Media Cloud - a corpus of open-news web content related to an issue you want to investigate, discovered on mulitple platforms across the internet.

Our API lets exposes one key endpoint for analyzing attention within a topic:
* `topicStoryCount`: return the total number of stories in the topic matching your query, or return that as a time series (read [the low level documentation](https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/topics_api_2_0_spec.md#storiescount) for more details about the parameters it supports)
* `topicStoryList`:  page through the actual stories that match your query in the topic (read the [low level docs](https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/topics_api_2_0_spec.md#storieslist))

## Setup a Connection and Some Constants

In [None]:
# Grab your API key from the environment variable and create a client for talking to Media Cloud
import os, mediacloud.api
from dotenv import load_dotenv
from IPython.display import JSON
load_dotenv()  # load config from .env file
mc = mediacloud.api.MediaCloud(os.getenv('MC_API_KEY'))
mediacloud.__version__

In [None]:
# we'll use this topic for the explanantion
SOURDOUGH_TOPIC = 4138
# find the latest snapshot
snapshots = mc.topicSnapshotList(SOURDOUGH_TOPIC)
latest_snapshot_id = snapshots[0]['snapshots_id'] # grab the id of the latest snapshot
# pull out the automatically-generated monthly timespans, and the overall one
timespans = mc.topicTimespanList(SOURDOUGH_TOPIC)
overall_timespan = [t for t in timespans if t['period'] == 'overall'][0]
monthly_timespans = [t for t in timespans if t['period'] == 'monthly']
# grab a subtopic to work with as well
focal_sets = mc.topicFocalSetList(SOURDOUGH_TOPIC)
reddit_foci_id = focal_sets[0]['foci'][0]['foci_id']
# and some timespans in the reddit subtopic
reddit_timespans = mc.topicTimespanList(SOURDOUGH_TOPIC, foci_id=reddit_foci_id)
reddit_overall_timespan = [t for t in reddit_timespans if t['period'] == 'overall'][0]
reddit_monthly_timespans = [t for t in reddit_timespans if t['period'] == 'monthly']

## Measuring Total Attention

Let's start by looking at the total corpus size, and seeing how many stories of that corpus were discovered via someone sharing them on Reddit.

In [None]:
total_stories = mc.topicStoryCount(SOURDOUGH_TOPIC, timespans_id=overall_timespan['timespans_id'])
total_stories

In [None]:
stories_from_reddit = mc.topicStoryCount(SOURDOUGH_TOPIC, timespans_id=reddit_overall_timespan['timespans_id'])
stories_from_reddit

In [None]:
ratio = stories_from_reddit['count'] / total_stories['count']
'{:.2%} of our corpus was discovered from someone sharing it on Reddit'.format(ratio)

## Measuring Attention over Time

You can measure attention over time by using the `split` parameter with you call to `topicStoryList`.

In [None]:
# find out which stories were most shared on Reddit
results = mc.topicStoryCount(SOURDOUGH_TOPIC, timespans_id=overall_timespan['timespans_id'], split=True, split_period='month')
JSON(results)

You'll probably notice that many of these stories are outside of our topics start/end dates. Date guessing on the open web is hard (we [wrote our own `date_guesser` package](https://github.com/mitmedialab/date_guesser) to do it). **We get around 10% of our dates wrong**, sometimes in stupid ways. In addition, some content isn't dateable at all - wikipedia pages for instance. So you probably want to filter results like these by thier dates.

## Listing Stories

Of course you probably want to know *which* stories are in your topic. `topicStoryList` helps with that.

In [None]:
# find out which stories were most shared on Reddit
# Note the use of "post_count" here to sort by total number of posts (within the corpus)
top_stories_from_reddit = mc.topicStoryList(SOURDOUGH_TOPIC, timespans_id=reddit_overall_timespan['timespans_id'], sort='post_count')
JSON(top_stories_from_reddit)

Of course, what you probably want to do is list all these stories. You can use the 'link_id' result for paging, as describing in our [docs on paging through topic API endpoint results](https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/topics_api_2_0_spec.md#paging).

In [None]:
def all_topic_matching_stories(mc_client, topics_id, snapshots_id=None, foci_id=None, timespans_id=None, q=None):
    """
    Return all the stories matching a query within your Media Cloud topic. Page through the results automatically.
    :param mc_client: a `mediacloud.api.MediaCloud` object instantiated with your API key already
    :param topics_id: the id of the topic you are using
    :param snapshots_id: the snapshot ("version") you want to search within
    :param foci_id: the focus ("subtopic") you want to search within
    :param timespans_id: the timespan you want to search within
    :param q: a boolean query to filter stories even further
    :return: a list of media cloud story items within the topic that match
    """
    link_id = None
    more_stories = True
    stories = []
    while more_stories:
        page = mc_client.topicStoryList(topics_id,
                                        snapshots_id=snapshots_id, foci_id=foci_id, timespans_id=timespans_id,
                                        q=q, link_id=link_id, limit=500)
        stories += page['stories']
        print("  got one page with {} stories".format(len(page['stories'])))
        if 'next' in page['link_ids']:
            link_id = page['link_ids']['next']
        else:
            more_stories = False
    return stories

In [None]:
# fetch *all* the stories we discovered on Reddit
all_stories_from_reddit = all_topic_matching_stories(mc, SOURDOUGH_TOPIC, timespans_id=reddit_overall_timespan['timespans_id'])
JSON(all_stories_from_reddit)

And you probably want to dump this to a CSV:

In [None]:
# now write the CSV
import csv
fieldnames = ['stories_id', 'publish_date', 'title', 'url', 'language', 'ap_syndicated', 'facebook_share_count', 'media_id', 'media_name', 'media_url']
with open('topic-story-list.csv', 'w', newline='') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames, extrasaction='ignore')
    writer.writeheader()
    for s in all_stories_from_reddit:
        writer.writerow(s)