# How to Scrape Data From Facebook Page Posts for Statistical Analysis

By [Max Woolf (@minimaxir)](http://minimaxir.com/)

This notebook describes how to build a Facebook Scraper using the latest version of Facebook's Graph API (v2.4). This is the accompanyment to my blog post [How to Scrape Data From Facebook Page Posts for Statistical Analysis](http://minimaxir.com/2015/07/facebook-scraper/).

In [1]:
# import some Python dependencies

import urllib2
import json
import datetime
import csv
import time

Accessing Facebook page data requires an access token.

Since the user access token expires within an hour, we need to create a dummy application *for the sole purpose of scraping* and use the app ID and app secret generated there [as described here](https://developers.facebook.com/docs/facebook-login/access-tokens#apptokens), both of which never expire.

In [2]:
# Since the code output in this notebook leaks the app_secret,
# it has been reset by the time you read this.

app_id = "272535582777707"
app_secret = "59e7ab31b01d3a5a90ec15a7a45a5e3b" # DO NOT SHARE WITH ANYONE!

access_token = app_id + "|" + app_secret

Now we can access public Facebook data without limit. Let's do our analysis on the [New York Times Facebook page](https://www.facebook.com/nytimes), which is popular enough to yield good data.

In [3]:
page_id = 'nytimes'

Let's write a quick program to ping NYT's Facebook page to verify that the `access_token` works and the `page_id` is valid.

In [4]:
def testFacebookPageData(page_id, access_token):
    
    # construct the URL string
    base = "https://graph.facebook.com/v2.4"
    node = "/" + page_id
    parameters = "/?access_token=%s" % access_token
    url = base + node + parameters
    
    # retrieve data
    req = urllib2.Request(url)
    response = urllib2.urlopen(req)
    data = json.loads(response.read())
    
    print json.dumps(data, indent=4, sort_keys=True)
    

testFacebookPageData(page_id, access_token)

{
    "id": "5281959998", 
    "name": "The New York Times"
}


When scraping large amounts of data from public APIs, there's a high probability that you'll hit an [HTTP Error 500 (Internal Error)](http://www.checkupdown.com/status/E500.html) at some point. There is no way to avoid that on our end. 

Instead, we'll use a helper function to catch the error and try again after a few seconds, which usually works. This helper function also consolidates the data retrival code, so it kills two birds with one stone.

In [5]:
def request_until_succeed(url):
    req = urllib2.Request(url)
    success = False
    while success is False:
        try: 
            response = urllib2.urlopen(req)
            if response.getcode() == 200:
                success = True
        except Exception, e:
            print e
            time.sleep(5)
            
            print "Error for URL %s: %s" % (url, datetime.datetime.now())

    return response.read()

The data is the Facebook Page metadata however; we need to change the endpoint to the /feed endpoint.

In [6]:
def testFacebookPageFeedData(page_id, access_token):
    
    # construct the URL string
    base = "https://graph.facebook.com/v2.4"
    node = "/" + page_id + "/feed" # changed
    parameters = "/?access_token=%s" % access_token
    url = base + node + parameters
    
    # retrieve data
    data = json.loads(request_until_succeed(url))
    
    print json.dumps(data, indent=4, sort_keys=True)
    

testFacebookPageFeedData(page_id, access_token)

{
    "data": [
        {
            "created_time": "2015-07-20T01:25:01+0000", 
            "id": "5281959998_10150628157724999", 
            "message": "The planned megalopolis, a metropolitan area that would be about 6 times the size of New York\u2019s, is meant to revamp northern China\u2019s economy and become a laboratory for modern urban growth."
        }, 
        {
            "created_time": "2015-07-19T22:55:01+0000", 
            "id": "5281959998_10150628161129999", 
            "message": "\"It\u2019s safe to say that federal agencies are not where we want them to be across the board,\" said President Barack Obama's top cybersecurity adviser. \"We clearly need to be moving faster.\""
        }, 
        {
            "created_time": "2015-07-19T22:25:01+0000", 
            "id": "5281959998_10150626434639999", 
            "message": "Showcase your summer tomatoes in this elegant crostata."
        }, 
        {
            "created_time": "2015-07-19T21:55:08+0000", 

In v2.4, the default behavior is to return very, very little metadata for statuses in order to reduce bandwidth, with the expectation the user will request the necessary fields.

We don't need data on every NYT status. Yet. Let's reduce the requested fields to exactly what we need, and the number of stories returned to 1 so we can process it.

In [7]:
def getFacebookPageFeedData(page_id, access_token, num_statuses):
    
    # construct the URL string
    base = "https://graph.facebook.com"
    node = "/" + page_id + "/feed" 
    parameters = "/?fields=message,link,created_time,type,name,id,likes.limit(1).summary(true),comments.limit(1).summary(true),shares&limit=%s&access_token=%s" % (num_statuses, access_token) # changed
    url = base + node + parameters
    
    # retrieve data
    data = json.loads(request_until_succeed(url))
    
    return data
    

test_status = getFacebookPageFeedData(page_id, access_token, 1)["data"][0]
print json.dumps(test_status, indent=4, sort_keys=True)

{
    "comments": {
        "data": [
            {
                "can_remove": false, 
                "created_time": "2015-07-20T01:28:02+0000", 
                "from": {
                    "id": "859569687424896", 
                    "name": "Chris Gagne"
                }, 
                "id": "10150628157724999_10150628249759999", 
                "like_count": 9, 
                "message": "Aaaaaaaand there goes the rest of Beijing's clean air, whatever was left of it.", 
                "user_likes": false
            }
        ], 
        "paging": {
            "cursors": {
                "after": "MzE=", 
                "before": "MzE="
            }, 
            "next": "https://graph.facebook.com/v2.0/5281959998_10150628157724999/comments?order=chronological&limit=1&summary=true&access_token=272535582777707|59e7ab31b01d3a5a90ec15a7a45a5e3b&after=MzE%3D"
        }, 
        "summary": {
            "order": "ranked", 
            "total_count": 31
        }
    }

Now that we have a sample Facebook page status, we can write a function to process each field individually.

In [8]:
def processFacebookPageFeedStatus(status):
    
    # The status is now a Python dictionary, so for top-level items,
    # we can simply call the key.
    
    # Additionally, some items may not always exist,
    # so must check for existence first
    
    status_id = status['id']
    status_message = '' if 'message' not in status.keys() else status['message'].encode('utf-8')
    link_name = '' if 'name' not in status.keys() else status['name'].encode('utf-8')
    status_type = status['type']
    status_link = '' if 'link' not in status.keys() else status['link']
    
    
    # Time needs special care since a) it's in UTC and
    # b) it's not easy to use in statistical programs.
    
    status_published = datetime.datetime.strptime(status['created_time'],'%Y-%m-%dT%H:%M:%S+0000')
    status_published = status_published + datetime.timedelta(hours=-5) # EST
    status_published = status_published.strftime('%Y-%m-%d %H:%M:%S') # best time format for spreadsheet programs
    
    # Nested items require chaining dictionary keys.
    
    num_likes = 0 if 'likes' not in status.keys() else status['likes']['summary']['total_count']
    num_comments = 0 if 'comments' not in status.keys() else status['comments']['summary']['total_count']
    num_shares = 0 if 'shares' not in status.keys() else status['shares']['count']
    
    # return a tuple of all processed data
    return (status_id, status_message, link_name, status_type, status_link,
           status_published, num_likes, num_comments, num_shares)

processed_test_status = processFacebookPageFeedStatus(test_status)
print processed_test_status

(u'5281959998_10150628157724999', 'The planned megalopolis, a metropolitan area that would be about 6 times the size of New York\xe2\x80\x99s, is meant to revamp northern China\xe2\x80\x99s economy and become a laboratory for modern urban growth.', 'China Molds a Supercity Around Beijing, Promising to Change Lives', u'link', u'http://nyti.ms/1Jr6LhU', '2015-07-19 20:25:01', 278, 31, 50)


Surprisingly, we're almost done! Now we just need to:

1. Query each page of Facebook Page Statuses (100 statuses per page) using `getFacebookPageFeedData`.
2. Process all statuses on that page using `processFacebookPageFeedStatus` and writing the output to a CSV file.
3. Navigate to the next page, and repeat until no more statuses

This block implements both the writing to CSV and page navigation.

In [9]:
def scrapeFacebookPageFeedStatus(page_id, access_token):
    with open('%s_facebook_statuses.csv' % page_id, 'wb') as file:
        w = csv.writer(file)
        w.writerow(["status_id", "status_message", "link_name", "status_type", "status_link",
           "status_published", "num_likes", "num_comments", "num_shares"])
        
        has_next_page = True
        num_processed = 0   # keep a count on how many we've processed
        scrape_starttime = datetime.datetime.now()
        
        print "Scraping %s Facebook Page: %s\n" % (page_id, scrape_starttime)
        
        statuses = getFacebookPageFeedData(page_id, access_token, 100)
        
        while has_next_page:
            for status in statuses['data']:
                w.writerow(processFacebookPageFeedStatus(status))
                
                # output progress occasionally to make sure code is not stalling
                num_processed += 1
                if num_processed % 1000 == 0:
                    print "%s Statuses Processed: %s" % (num_processed, datetime.datetime.now())
                    
            # if there is no next page, we're done.
            if 'paging' in statuses.keys():
                statuses = json.loads(request_until_succeed(statuses['paging']['next']))
            else:
                has_next_page = False
                
        
        print "\nDone!\n%s Statuses Processed in %s" % (num_processed, datetime.datetime.now() - scrape_starttime)


scrapeFacebookPageFeedStatus(page_id, access_token)

Scraping nytimes Facebook Page: 2015-07-19 18:36:33.051000

1000 Statuses Processed: 2015-07-19 18:36:59.366000
2000 Statuses Processed: 2015-07-19 18:37:28.289000
3000 Statuses Processed: 2015-07-19 18:37:56.487000
4000 Statuses Processed: 2015-07-19 18:38:30.355000
5000 Statuses Processed: 2015-07-19 18:38:58.661000
6000 Statuses Processed: 2015-07-19 18:39:26.990000
7000 Statuses Processed: 2015-07-19 18:39:55.906000
8000 Statuses Processed: 2015-07-19 18:40:20.628000
9000 Statuses Processed: 2015-07-19 18:40:44.801000
10000 Statuses Processed: 2015-07-19 18:41:11.759000
11000 Statuses Processed: 2015-07-19 18:41:38.739000
12000 Statuses Processed: 2015-07-19 18:42:05.562000
13000 Statuses Processed: 2015-07-19 18:42:32.696000
14000 Statuses Processed: 2015-07-19 18:42:59.939000
15000 Statuses Processed: 2015-07-19 18:43:26.889000
16000 Statuses Processed: 2015-07-19 18:43:53.106000
17000 Statuses Processed: 2015-07-19 18:44:19.457000
18000 Statuses Processed: 2015-07-19 18:44:45.63

The CSV can be opened in all major statistical programs. Have fun! :)

You can download the [NYTimes data here](https://dl.dropboxusercontent.com/u/2017402/nytimes_facebook_statuses.zip). [4.6MB]