Media Cloud: Topics: Measuring Language
=======================================

At this point you have a topic created in Media Cloud - a corpus of open-news web content related to an issue you want to investigate, discovered on mulitple platforms across the internet. The topic lets you analyze language in a variety of ways.

Our API lets exposes a few key endpoints for analyzing language within a topic:
* `topicWordCount`:  list the top words in stories that match your query in the topic (read the [low level docs](https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/topics_api_2_0_spec.md#wclist))
* `topicSnapshotWord2VecModel`: return an `application/octext-stream` trained word2vec model file ([low level docs](https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/topics_api_2_0_spec.md#snapshotssnapshots_idword2vec_modelmodels_id-get))

## Setup a Connection and Some Constants

In [None]:
# Grab your API key from the environment variable and create a client for talking to Media Cloud
import os, mediacloud.api
from dotenv import load_dotenv
from IPython.display import JSON
load_dotenv()  # load config from .env file
mc = mediacloud.api.MediaCloud(os.getenv('MC_API_KEY'))
mediacloud.__version__

In [None]:
# we'll use this topic for the explanantion
SOURDOUGH_TOPIC = 4138
# find the latest snapshot
snapshots = mc.topicSnapshotList(SOURDOUGH_TOPIC)
latest_snapshot_id = snapshots[0]['snapshots_id'] # grab the id of the latest snapshot
# pull out the automatically-generated monthly timespans, and the overall one
timespans = mc.topicTimespanList(SOURDOUGH_TOPIC)
overall_timespan = [t for t in timespans if t['period'] == 'overall'][0]
monthly_timespans = [t for t in timespans if t['period'] == 'monthly']
# grab a subtopic to work with as well
focal_sets = mc.topicFocalSetList(SOURDOUGH_TOPIC)
reddit_foci_id = focal_sets[0]['foci'][0]['foci_id']
# and some timespans in the reddit subtopic
reddit_timespans = mc.topicTimespanList(SOURDOUGH_TOPIC, foci_id=reddit_foci_id)
reddit_overall_timespan = [t for t in reddit_timespans if t['period'] == 'overall'][0]
reddit_monthly_timespans = [t for t in reddit_timespans if t['period'] == 'monthly']

## Top Words

Like any regular query, you can do simple word counts within your topic to look at language used.

In [None]:
# analyze top words by month
import dateparser
monthly_top_words = []
for t in monthly_timespans:
    month_top_words = mc.topicWordCount(SOURDOUGH_TOPIC, timespans_id=t['timespans_id'])
    monthly_top_words.append({
        'timespans_id': t['timespans_id'],
        'start_date': dateparser.parse(t['start_date']),
        'end_date': dateparser.parse(t['end_date']),
        'top_words': month_top_words
    })

In [None]:
for m in monthly_top_words:
    print('{}: {}'.format(m['start_date'],', '.join([m['term'] for m in m['top_words'][:10]])))

## Word2Vec

Note that for every `snapshot` we train a word2vec model on the corpus. You can download and use that model to support more complicated computational language analysis.

In [None]:
# list the model trained on the entire corpus
snapshots = mc.topicSnapshotList(SOURDOUGH_TOPIC)
latest_snapshot = snapshots[0]
model_info = latest_snapshot['word2vec_models'][0]
model_info

In [None]:
# save the model locally
raw_model = mc.topicSnapshotWord2VecModel(SOURDOUGH_TOPIC, latest_snapshot_id, model_info['models_id'])
path_to_model = "topic-word2vec-{}.model".format(model_info['models_id'])
model_byte_array = bytes(raw_model)
cache_file = open(path_to_model, 'w+b')
cache_file.write(model_byte_array)
cache_file.close()

In [None]:
# load the model
import gensim
model = gensim.models.keyedvectors.KeyedVectors.load_word2vec_format(path_to_model, binary=True)

In [None]:
# grab the full embedding for a word
model['yeast']

In [None]:
# find similar words in the space
model.most_similar('yeast')