#Mining Social Data
#Lesson 3: Using Social Media APIs

This notebook assumes you have already watched Shilad's [presentation on Social Media APIs](https://www.youtube.com/watch?v=Yyo6kq3Ao4U). The presentation provides a high-level overview of methods for incorporating social data into your applications and analyses.  This notebook supplements the presentation by walking through five detailed case studies that connect to Social APIs in Python.

After you complete this lesson, you will be able to:

* Connect to Social APIs using Python - both in "native" Python and using "wrapper modules."
* Create Python code to connect to the Wikipedia, Facebook, Twitter, GitHub and Datasift Social APIs.
* Recognize and complete typically authorization scenarios for Social APIs.
* Understand the capabilities and tradeoffs of social data aggregation such as Gnip and DataSift.
* Be familiar with offline and online patterns of incorporationg data into your applications.

# 0. Preparation and setup

**Time estimate:** 30 minutes. (If you run into problems, please post to the forum!).

**Further optional readings:** [PyPI: The Python Package Index](https://pypi.python.org/pypi)

In order to complete this lesson, you'll need to create two accounts and install five Python modules

* [Sign up or register on Twitter](http://twitter.com) if you don't have an account.
* [Register on Datasift](https://datasift.com/auth/register) and sign up for the free trial (this can take a day or so.)

You will also need five Python modules that make it easier to connect to social APIs. Four installations should run smoothly:

In [9]:
def install_module(package_name):
    try:
        __import__(package_name)
        print('module ' + package_name + ' already installed')
    except ImportError:
        print('installing module ' + package_name)
        import pip
        pip.main(['install', package_name])

install_module('wikipedia')
install_module('twitter')
install_module('oauth2')
install_module('facebook-sdk')
install_module('praw')

module wikipedia already installed
module twitter already installed
module oauth2 already installed
installing module facebook-sdk
module praw already installed


The fifth produced a warning, but still worked for me. If you have issues, please let me know.

In [2]:
install_module('datasift')

installing module datasift
You are using pip version 6.0.8, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting datasift
  Downloading datasift-2.11.0.tar.gz
Collecting requests<3.0.0,>=2.8.0 (from datasift)
  Downloading requests-2.18.4-py2.py3-none-any.whl (88kB)
Collecting autobahn<0.10.0,>=0.9.4 (from datasift)
  Downloading autobahn-0.9.6.tar.gz (137kB)
    package init file 'twisted\plugins\__init__.py' not found (or not a regular file)
Collecting twisted<16.0.0,>=14.0.0 (from datasift)
  Downloading Twisted-15.5.0-cp27-none-win_amd64.whl (3.1MB)
Collecting pyopenssl<0.16.0,>=0.15.1 (from datasift)
  Downloading pyOpenSSL-0.15.1-py2.py3-none-any.whl (102kB)
Collecting service-identity>=14.0.0 (from datasift)
  Downloading service_identity-17.0.0-py2.py3-none-any.whl
Collecting requests-futures>=0.9.5 (from datasift)
  Downloading requests-futures-0.9.7.tar.gz
Collecting ndg-httpsclient>=0.4.0 (from datasift)
  

#1. Plain HTTP + JSON (The simplest possible example)

**Time estimate:** 30 minutes (excluding questions #1 and #2).

**Further optional reading:**

* [Python urllib module](https://docs.python.org/2/library/urllib.html)
* [Python json module](https://docs.python.org/2/library/json.html)
* [GitHub API documentation](https://api.github.com)
* [Wikipedia's JSON article](http://en.wikipedia.org/wiki/JSON)

We'll start with a simple example and work towards more complex ones. Let's see the recent activity on GitHub. You can actually view this by opening the url https://api.github.com/events in your browser. Go take a look!

You can instruct Python to perform this same request as follows.

In [1]:
import urllib
import json
import pprint

# open an http connection to the url and return a file for it
url = urllib.urlopen('https://api.github.com/events')

# read the http response into a string.
response = url.read()
print('my raw response is a %s: %s' % (type(response), repr(response[:400])))

my raw response is a <type 'str'>: '[{"id":"6849151979","type":"PushEvent","actor":{"id":21224191,"login":"FatFreeBeefCake","display_login":"FatFreeBeefCake","gravatar_id":"","url":"https://api.github.com/users/FatFreeBeefCake","avatar_url":"https://avatars.githubusercontent.com/u/21224191?"},"repo":{"id":101216063,"name":"FatFreeBeefCake/2670UVU","url":"https://api.github.com/repos/FatFreeBeefCake/2670UVU"},"payload":{"push_id":212'


Notice the data type of the response above is a string. This string looks a lot like a Python literal, but it not a datastructure - just a string. The string is encoded using a specification called [JSON](http://en.wikipedia.org/wiki/JSON) that uses Javascript literal syntax (very close to Python). 

In the past APIs used a variety of encoding schemes (often based on XML). These days, almost every social API supports JSON.

You could write code to *parse* the JSON and turn it from a string into a native python datastructure. However, it's better to use Python's robust [json module](https://docs.python.org/2/library/json.html). The `json.loads` method takes a string encoded using the JSON specification and returns a native python datastructure. We'll pretty print out the result so you can see what it looks like.

In [2]:
# converts the response string into a native python data structure
data = json.loads(response)

# returns a human-readable string representing the data structure.
data_str = pprint.pformat(data)
print('parsed response is a %s: %s...' % (type(data), data_str[:1000]))

parsed response is a <type 'list'>: [{u'actor': {u'avatar_url': u'https://avatars.githubusercontent.com/u/21224191?',
             u'display_login': u'FatFreeBeefCake',
             u'gravatar_id': u'',
             u'id': 21224191,
             u'login': u'FatFreeBeefCake',
             u'url': u'https://api.github.com/users/FatFreeBeefCake'},
  u'created_at': u'2017-11-13T21:02:55Z',
  u'id': u'6849151979',
  u'payload': {u'before': u'ca6004526ee95236ba2a818b8fc877f44e0ee3af',
               u'commits': [{u'author': {u'email': u'stheadman1@hotmail.com',
                                         u'name': u'Unknown'},
                             u'distinct': True,
                             u'message': u'BG music',
                             u'sha': u'ee92dc5dbf8846e20cbb66b435500ea0ea88f27e',
                             u'url': u'https://api.github.com/repos/FatFreeBeefCake/2670UVU/commits/ee92dc5dbf8846e20cbb66b435500ea0ea88f27e'}],
               u'distinct_size': 1,
          

If you're curious about the format of the GitHub API response (typically called the **payload**), you can take a look at the [GitHub API documentation about events](https://developer.github.com/v3/activity/events/). Most public social APIs have excellent documentation.

As a shortcut, you can do this in one line using the `json.load` method that directly parses the file-like object returned by `urllib.urlopen`:

In [3]:
url = urllib.urlopen('https://api.github.com/events')
data = json.load(url)
print(len(data))   # number of events

30


Of course, the decoded json is a native Python data structure, so you can directly interact with it like you would any other Python data structure:

In [4]:
for record in data[:10]:
    print(record['created_at'], record['type'], record['actor']['login'], record['repo']['url'])

(u'2017-11-13T21:03:03Z', u'WatchEvent', u'anand32138', u'https://api.github.com/repos/rstacruz/cheatsheets')
(u'2017-11-13T21:03:03Z', u'WatchEvent', u'sakopov', u'https://api.github.com/repos/sakopov/Dapper.AmbientContext')
(u'2017-11-13T21:03:04Z', u'PushEvent', u'rlewis2892', u'https://api.github.com/repos/deepdivedylan/angular5-example')
(u'2017-11-13T21:03:03Z', u'PushEvent', u'melonmj', u'https://api.github.com/repos/melonmj/parse_server_bolsa1a')
(u'2017-11-13T21:03:03Z', u'PushEvent', u'tiewei', u'https://api.github.com/repos/tiewei/netplugin')
(u'2017-11-13T21:03:03Z', u'IssueCommentEvent', u'arq5x', u'https://api.github.com/repos/arq5x/bedtools2')
(u'2017-11-13T21:03:03Z', u'PushEvent', u'tmtmtmtm', u'https://api.github.com/repos/everypolitician-scrapers/macedonia-sobranie')
(u'2017-11-13T21:03:04Z', u'PushEvent', u'samyk', u'https://api.github.com/repos/samyk/myo-osc')
(u'2017-11-13T21:03:03Z', u'IssueCommentEvent', u'mitar', u'https://api.github.com/repos/tozd/docker-nginx

# 2. Using wrapper modules to access public APIs

**Time estimate:** 30 minutes, excluding assignment questions.

**Supplemental readings:**

* [Wikipedia API](http://www.mediawiki.org/wiki/API:Main_page) 
* [Goldsmith's Python Wikipedia](https://github.com/goldsmith/Wikipedia)

The "native python" approach works well for the GitHub API because it is simple. However, this approach can become quite complicated for other APIs. For example, consider an API call to fetch the text of a Wikipedia page. Since the Wikipedia API is public, you can also see the API call and response through your browser by opening http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvlimit=1&format=json&titles=Python_%28programming_language%29

The native Python code associated with this API call follows.

In [5]:
import urllib
import json

url = urllib.urlopen('http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvlimit=1&format=json&titles=Python_%28programming_language%29')
wp_data = json.load(url) 
python_data = wp_data['query']['pages'].values()[0]
print('json keys', python_data.keys())
print('title', python_data['title'])
print('num revisions', len(python_data['revisions']))
python_rev = python_data['revisions'][0]
print('rev keys', type(python_rev), python_rev.keys())
wikitext = python_rev['*']
print('wikitext', repr(wikitext[:200]))



('json keys', [u'ns', u'pageid', u'revisions', u'title'])
('title', u'Python (programming language)')
('num revisions', 1)
('rev keys', <type 'dict'>, [u'*', u'contentmodel', u'contentformat'])
('wikitext', "u'{{Infobox programming language\\n|name                   = Python\\n|logo                   = Python logo and wordmark.svg\\n|logo size              = 260px\\n|paradigm               = [[multi-paradigm progra'")


That's a lot of work to get one Wikipedia page! As it turns out, other parts of the [Wikipedia API](http://www.mediawiki.org/wiki/API:Main_page) are even more obtuse.

Luckily, other Python programmers have created Python modules that hide this complexity. A google search for "Python Wikipedia API Library" directs you to a [List of Wikipedia Python client wrapper](http://www.mediawiki.org/wiki/API:Client_code#Python). 

When looking for a wrapper module, you'll often find many choices. One method for choosing a good client wrapper I find effective is to head to the module's GitHub repo (virtually all wrapper modules are hosted on GitHub) and choose the one that a) can be installed using pip, b) seems simple to use and c) is popular, as measured by number of GitHub stars in the upper right. 

For example, for the Wikipedia API [Goldsmith's Python Wikipedia](https://github.com/goldsmith/Wikipedia) wrapper module fits the first two criteria and has been starred by 757 users. You would typically install the module from the command line using pip:

    $ pip install wikipedia
    
However, I already asked you to install it earlier by running the `install_module` function I wrote. You can verify that it is installed:

In [19]:
install_module('wikipedia')

module wikipedia already installed


After you have installed the wikipedia module, you can call `wikipedia.page()`, which:

1. Constructs an HTTP request to the Wikipedia API, as we did above.
2. Parses the HTTP response, which is JSON, into a native python datastructure.
3. Returns a Wikipedia ["page object"](https://wikipedia.readthedocs.org/en/latest/code.html#module-wikipedia) with user-friendly attributes such as a list of sections and plain text content.

Notice how simple the code that follows looks compared to the earlier "native Python" code. In general, you'll find it much easier to use these wrapper modules - particularly for APIs which require authentication.

In [6]:
import wikipedia
python_page = wikipedia.page('Python_(programming_language)')
print(python_page, type(python_page))
print('sections', python_page.sections)
print('content', python_page.content[:500])

(<WikipediaPage 'Python (programming language)'>, <class 'wikipedia.wikipedia.WikipediaPage'>)
('sections', [])
('content', u'Python is a widely used high-level programming language for general-purpose programming, created by Guido van Rossum and first released in 1991. An interpreted language, Python has a design philosophy that emphasizes code readability (notably using whitespace indentation to delimit code blocks rather than curly brackets or keywords), and a syntax that allows programmers to express concepts in fewer lines of code than might be used in languages such as C++ or Java. The language provides construct')


# 3. Using wrapper libraries to access APIs that require authentication

**Time estimate:** 90 minutes, excluding assignments.

**Supplemental resources:**

* [Using the Twitter API with Python tutorial](http://darkmattersheep.net/2013/09/using-twitter-api-with-python/).
* Ryan Boyd's overview of OAuth 2.0: https://www.youtube.com/watch?v=YLHyeSuBspI. It's long, but very clear.
* [Twitter Developer Site](https://dev.twitter.com/).
* [Twitter application-only authentication](https://dev.twitter.com/docs/auth/application-only-auth).

GitHub, Reddit and Wikipedia's support for public API access is rare. Most APIs require you to authenticate yourself (prove who you are), typically using a schema called OAuth2. Every API is different (despite all being OAuth2).  

In general, you'll follow three steps to retrieve data using OAuth 2.0.

**Step A: Register for the service and retrieve your access tokens to it.**

1. You will create a user account with the service (Facebook, Twitter, Foursquare, etc).
2. You will request "secret" access tokens for your account. Most major social APIs provide a web interface for doing so. For many services, you'll receive several different secret different tokens.

** Part B: Install a Python API wrapper module.**

1. You will install a Python API wrapper module associated with the service. 
2. You often have several choices for these. Since almost all of these are hosted on GitHub, a useful heuristic proxy for usefulness is the number of stars the project has.

**Part C: Develop your Python program.**

4. You will provide your access tokens to the Python library to connect to the service.
5. Very often, the wrapper module will closely follow the Social Media API, so you'll often need to reference both the Python wrapper's documentation and the social media API documentation.

We'll walk through this procedure for both Facebook and Twitter. 

### 3.1. Accessing Twitter Data

**Part A: Create your app and retrieve its access tokens.**

1. Create or login to your [Twitter account](http://twitter.com).
2. Login to the [Twitter Developer Site](https://dev.twitter.com/) with the same account.
3. Visit the [Manage your apps](https://apps.twitter.com/) page.
4. Click "Create new app".
5. Create a name, description and website. The name must be unique. The website can be a placeholder for now. You can leave the callback blank.
6. Select "I agree" and click "Create application."
7. Click the "API Keys" tab and click "Create my access token."
8. Using information on this page, create a dictionary called config as described below:

In [10]:
# Replace the contents of this dictionary with your access information.
#
twitter_config = {
        # These two values appear as "API key" and "API secret" under the "Application Settings" section
        'API_KEY' :  '',           
        'API_SECRET' :  '',
        
        # These two values appear as "Access token" and "Access token secret" under the "Your access token" section
        'ACCESS_TOKEN' :  '',
        'ACCESS_SECRET' :  '',
      }

**Part B: Install the Twitter library**

In [11]:
install_module('twitter')
install_module('oauth2')

module twitter already installed
module oauth2 already installed


**Part C: Python hacking! **

First, you must authenticate your Twitter client:

In [12]:
from twitter import OAuth

oauth = OAuth(
            twitter_config['ACCESS_TOKEN'],
            twitter_config['ACCESS_SECRET'],
            twitter_config['API_KEY'],
            twitter_config['API_SECRET'],
        )

Once you authenticate the Twitter client, you can use the module to call API methods. you begin by creating a twitter object:

In [13]:
from twitter import Twitter
t = Twitter(auth=oauth)

The twitter library essentially mirrors the Twitter API. For example, consider the [search/tweets API call](https://dev.twitter.com/docs/api/1.1/get/search/tweets). We would call and pass it named parameters that correspond to the API documentation. For example, this API call requires 

In [14]:
from twitter import OAuth

oauth = OAuth(
            twitter_config['ACCESS_TOKEN'],
            twitter_config['ACCESS_SECRET'],
            twitter_config['API_KEY'],
            twitter_config['API_SECRET'],
        )

import pprint
from twitter import Twitter
t = Twitter(auth=oauth)

# The format of the parameters are detailed at https://dev.twitter.com/docs/api/1.1/get/search/tweets
# Get the first result of a query for 'Obama' in Spanish
tweets = t.search.tweets(q='trump', lang='es', count=1)

# The structure of a response is detailed at the same webpage.
trump_tweet_str = pprint.pformat(tweets)
print trump_tweet_str

first  = tweets['statuses'][0]
user = first['user']['name']
text = first['text']
tstamp = first['created_at']
print("!!!!! user,text,tstamp",user, text, tstamp)

# Get the first tweet within Crimea
# lat,long,radius  50 miles around the center of Crimea
crimea = '45.3,34.4,100mi'
tweets = t.search.tweets(q='is', geocode=crimea)

first  = tweets['statuses'][0]
user = first['user']['name']
text = first['text']
tstamp = first['created_at']
print("!!!!!crimea user,text,tstamp",user, text, tstamp)

{u'search_metadata': {u'completed_in': 0.034,
                      u'count': 1,
                      u'max_id': 930179563639975936L,
                      u'max_id_str': u'930179563639975936',
                      u'next_results': u'?max_id=930179563639975935&q=trump&lang=es&count=1&include_entities=1',
                      u'query': u'trump',
                      u'refresh_url': u'?since_id=930179563639975936&q=trump&lang=es&include_entities=1',
                      u'since_id': 0,
                      u'since_id_str': u'0'},
 u'statuses': [{u'contributors': None,
                u'coordinates': None,
                u'created_at': u'Mon Nov 13 21:04:24 +0000 2017',
                u'entities': {u'hashtags': [],
                              u'media': [{u'display_url': u'pic.twitter.com/qKrt6lbYIK',
                                          u'expanded_url': u'https://twitter.com/elpoliticonews/status/930177452956987392/photo/1',
                                          u'id': 

You can also stream a realtime sample of tweets associated with a particular query. For example:

In [15]:
from twitter import TwitterStream
 
ts = TwitterStream(auth = oauth)
openstream = ts.statuses.filter(track='obama')
for (i, item) in enumerate(openstream):
    print item['user']['screen_name'], item['created_at'], item['text']
    if i > 10:
        break

TJEstes1210 Mon Nov 13 21:04:55 +0000 2017 RT @AnnaApp91838450: https://t.co/nEB7TVpN8s
Where's ALL THE Republicans on this info ? #PERSIST JUSTICE FOR AMERICA LAW AND ORDER #MAGA… 
Ashley_9345 Mon Nov 13 21:04:55 +0000 2017 RT @keithboykin: If Obama had spent 5 years lying about George Bush’s birth certificate, had 5 kids from 3 women, and his third wife… 
crabbydick Mon Nov 13 21:04:56 +0000 2017 RT @DailyCaller: Trump’s New Labor Prosecutor Could Undo Obama-Era Union Wins In A Big Way https://t.co/5Jsjsan03w https://t.co/Wmc9Lul62W
xfranman Mon Nov 13 21:04:56 +0000 2017 @WhiteHouse Really pulling out the stops for our President. When Obama went to these things they couldn't manage a… https://t.co/eG001hbLGd
angel_leXO Mon Nov 13 21:04:56 +0000 2017 RT @keithboykin: If Obama had spent 5 years lying about George Bush’s birth certificate, had 5 kids from 3 women, and his third wife… 
JohnABarclayIV Mon Nov 13 21:04:56 +0000 2017 RT @The_Trump_Train: President Trump underestimated Bara

###3.2. Accessing Facebook Data
We perform a similar process for Facebook. I presume you already have a Facebook account. 

1. Log into your Facebook account.
2. Go to the Facebook graph explorer: https://developers.facebook.com/tools/explorer/
3. Request a new access token, and record it for use below.

In [None]:
install_module('facebook-sdk')

Next, we'll use the [Python Facebook SDK](https://github.com/pythonforfacebook/facebook-sdk) module to print out information about you.

In [48]:
import facebook
#{
#  "id": "",
#  "name": "Paul Olsztyn"
#}

# replace with access token from https://developers.facebook.com/tools/explorer/
ACCESS_TOKEN = ''
graph = facebook.GraphAPI(ACCESS_TOKEN)
print(graph.get_object("me"))

GraphAPIError: Invalid OAuth access token.

#4. Accessing data through Social Media Syndication Services.

**Time estimate:** 90 minutes (excluding assignment questions).

Interacting with social media APIs directly or through a Python wrapper module provides a cost-effective and flexible way to download data. However, it comes with limitations. It takes time to learn and write code for each API. In addition, some data is not available through APIs. For example, Twitter does not include tweets more than two weeks old in API search results.

Social Media Syndication services overcome these limitations in exchange for a significant licensing fee. Two syndication services dominate the market: [Gnip](http://gnip.com/) and [DataSift](http://datasift.com). Since Gnip pricing starts at $500, we will experiment with DataSift using a two-week trial license.

**Step A: Sign up for a DataSift trial.**

Head to https://datasift.com/auth/register and sign up for the trial registration. It may take a day to approve your trial registration.

Once you've activated your account, visit your [DataSift dashboard](http://datasift.com/dashboard). You'll notice that you have $10 of trial credits to spend. As long as you're careful with your feeds, this should be plenty for our purposes.

Spend some time browsing the available [DataSift data sources](https://datasift.com/source). Since we want to conserve money, we'll focus on the relatively inexpensive Tumblr datasource (Your $10 free credits will easily cover all class activity). Enable the [Tumblr datasource](https://app.datasift.com/source/53/tumblr), which contains all Tumblr interactions. 

Install https://github.com/msmathers/datasift-python:

**Step B: Install the datasift module**

Next, you should install the [official Datasift Python wrapper module](https://github.com/datasift/datasift-python).

I had to run the installation command below twice. The command failed to install the https secure web protocol on my computer, so you'll see that I disable https below by passing `ssl=False`.

In [4]:
install_module('datasift')

module datasift already installed


**Step C: Write Python code against DataSift**

After you've finished installing the module, you can experiment with your new data feed. 

Note that **THIS CODE WILL NOT WORK IN IPYTHON NOTEBOOK**. The multiprocessing design of datasift is incompatible with notebook's design. Instead, you should run the following code directly in ipython (from the command line or as a script).

Note the csdl query that follows looks for all public facebook posts with "good" in the text. When I ran this query, I found a post about every ten seconds or so.

```
tumblr.body contains "python" OR tumblr.tags contains "python"
```



    from datasift import Client
    
    # Fill in account credentials from https://datasift.com/settings
    client = Client( "shilad",  "XXXX ", ssl=False)
    
    @client.on_delete
    def on_delete(interaction):
        pass
    
    @client.on_open
    def on_open():
        print( 'Streaming ready, can start subscribing')
        csdl = 'tumblr.body contains "python" OR tumblr.tags contains "python"'
        stream = client.compile(csdl)['hash']
    
        @client.subscribe(stream)
        def subscribe_to_hash(msg):
            print( msg)
    
    @client.on_ds_message
    def on_ds_message(msg):
        print( 'DS Message %s' % msg)
    
    #must start stream subscriber
    client.start_stream_subscriber()
   
### An aside: Python's decorators.

You'll notice the rather strange structure of the code above. After creating the client object and saving it in the `client` variable we see a bunch of lines starting with `@` above functions.These `@` lines mark functions as **decorators.** In this clase, because the start with `@client`, they alter (or decorate) the functionality of the client variable. Jeff Knupp has authored a [fantastic tutorial on decorators](https://www.jeffknupp.com/blog/2013/11/29/improve-your-python-decorators-explained/).

For instance, the `@client.on_ds_message` annotation tells Python that the function that follows it should alter the client variable's `on_ds_message` function:

    @client.on_ds_message
        def on_ds_message(msg):
            print( 'DS Message %s' % msg)
            
This code is used to create custom **handlers** for different types of events that come through the stream. You'll notice the `on_open` decorator locates substantially more complex than the other decorators. It is triggered after the client connects to datasift's webservers, and it subscribes to the facebook query feed.