Media Cloud: Topics: Measuring Influce
======================================

At this point you have a topic created in Media Cloud - a corpus of open-news web content related to an issue you want to investigate, discovered on mulitple platforms across the internet. The topic includes influence metrics beyond the normal data collected by Media Cloud.

* Linking Metrics
* Social Sharing Metrics

Our API lets exposes a few key endpoints for analyzing attention within a topic:
* `topicStoryList`:  page through the actual stories that match your query in the topic (read the [low level docs](https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/topics_api_2_0_spec.md#storieslist))

## Setup a Connection and Some Constants

In [None]:
# Grab your API key from the environment variable and create a client for talking to Media Cloud
import os, mediacloud.api
from dotenv import load_dotenv
from IPython.display import JSON
load_dotenv()  # load config from .env file
mc = mediacloud.api.MediaCloud(os.getenv('MC_API_KEY'))
mediacloud.__version__

In [None]:
# we'll use this topic for the explanantion
SOURDOUGH_TOPIC = 4138
# find the latest snapshot
snapshots = mc.topicSnapshotList(SOURDOUGH_TOPIC)
latest_snapshot_id = snapshots[0]['snapshots_id'] # grab the id of the latest snapshot
# pull out the automatically-generated monthly timespans, and the overall one
timespans = mc.topicTimespanList(SOURDOUGH_TOPIC)
overall_timespan = [t for t in timespans if t['period'] == 'overall'][0]
monthly_timespans = [t for t in timespans if t['period'] == 'monthly']
# grab a subtopic to work with as well
focal_sets = mc.topicFocalSetList(SOURDOUGH_TOPIC)
reddit_foci_id = focal_sets[0]['foci'][0]['foci_id']
# and some timespans in the reddit subtopic
reddit_timespans = mc.topicTimespanList(SOURDOUGH_TOPIC, foci_id=reddit_foci_id)
reddit_overall_timespan = [t for t in reddit_timespans if t['period'] == 'overall'][0]
reddit_monthly_timespans = [t for t in reddit_timespans if t['period'] == 'monthly']

## Linking Metrics

We extract the links in every story, giving us a network map of who links to who at the story and media level. 

### Media Inlinks and Outlinks

We find link behaviour at the media source level to be far more useful than at the story level for aggregate network analysis.

In [None]:
# what was the top media sources linked to in the entire corpus?
top_media_by_media_inlink = mc.topicMediaList(SOURDOUGH_TOPIC, sort='inlink')
[m['name'] for m in top_media_by_media_inlink['media'][:10]]

In [None]:
# and what about in stories just from reddit?
top_media_by_media_inlink = mc.topicMediaList(SOURDOUGH_TOPIC, timespans_id=reddit_overall_timespan['timespans_id'], sort='inlink')
[m['url'] for m in top_media_by_media_inlink['media'][:10]]

### Inlinks and Outlinks

In [None]:
# and what was to story that had the most *media* linking to it?
top_stories_by_media_inlink = mc.topicStoryList(SOURDOUGH_TOPIC, sort='inlink')
[s['url'] for s in top_stories_by_media_inlink['stories'][:10]]

You'll note that there is some garbage in here - for instance the Instagram login page. That always happens in topics, and you just have to weed it out.

### Shortcut: Automatically Generated Link Maps

With all this information, we automatically generate link maps in .svg and .gexf format for every `timespan`. You can visit yout topic's "Export" tab to view and download those.

## Social Sharing Metrics

We've recently added "beta" support for ingesting open web news stories that are sharing on social media sites - links posted to reddit, top links from Google News, etc (more coming soon!). You add these as "platforms" while creating your topic. Each social media platform gets an automatically generated subtopic that includes sharing metrics at the story and media level.

* `post_count`: The number of *times* a link to a story, or media source, was posted in matching content we found on that platform
* `author_count`: The number of *unique authors* who posted a link to a story, or media source, was posted in matching content we found on that platform
* `channel_count`: The number of *unique "channels"* in which a link to a story, or media source, was posted in matching content we found on that platform. Channels mean different things for each platform - for reddit they are the subreddit; Google News doesn't have a channel; when we add YouTube it will be the video's channel.

We also add in the overall Facebook share count for the story. This is across all of Facebook, so it isn't an indication of *relevant* social sharing (ie. it could be a link shared by someone not actually discussing your issue). For this reason we tend to use this less often than we used to.

In [None]:
# Grab story most posted of all the stories discovered on Reddit
top_posted_on_reddit = mc.topicStoryList(SOURDOUGH_TOPIC, timespans_id=reddit_overall_timespan['timespans_id'], sort='post_count')
top_story_on_reddit = top_posted_on_reddit['stories'][0]
top_story_on_reddit['url']

In [None]:
# check if this was all from one account or not
print("posted {} times".format(top_story_on_reddit['post_count']))
print("by {} unique users".format(top_story_on_reddit['author_count']))
print("in {} subreddits".format(top_story_on_reddit['channel_count']))