# How to Scrape Data From Facebook Page Posts for Statistical Analysis

Adapted from http://minimaxir.com/2015/07/facebook-scraper/

In [1]:
# import some Python dependencies

import urllib2
import json
import datetime
import csv
import time

Accessing Facebook page data requires an access token.

+ Go to https://developers.facebook.com/ and **add a new app**
+ Now get **App id** and **App secret**

In [2]:
# Since the code output in this notebook leaks the app_secret,
# it has been reset by the time you read this.

app_id = "662979177234964"
app_secret = "52cf25e144057adb969e816e38f3307e" # DO NOT SHARE WITH ANYONE!

access_token = app_id + "|" + app_secret

Now we can access public Facebook data without limit. Let's do our analysis on the [https://www.facebook.com/383Indians], which provides enough data.

In [3]:
page_id = '383Indians'    #USD383 Manhattan High School

Let's write a quick program to ping NYT's Facebook page to verify that the `access_token` works and the `page_id` is valid.

In [4]:
def testFacebookPageData(page_id, access_token):
    
    # construct the URL string
    base = "https://graph.facebook.com/v2.4"
    node = "/" + page_id
    parameters = "/?access_token=%s" % access_token
    url = base + node + parameters
    
    # retrieve data
    req = urllib2.Request(url)
    response = urllib2.urlopen(req)
    data = json.loads(response.read())
    
    print json.dumps(data, indent=4, sort_keys=True)
    

testFacebookPageData(page_id, access_token)

{
    "id": "527750833927938", 
    "name": "USD 383: Manhattan High School"
}


When scraping large amounts of data from public APIs, there's a high probability that you'll hit an [HTTP Error 500 (Internal Error)](http://www.checkupdown.com/status/E500.html) at some point. There is no way to avoid that on our end. 

Instead, we'll use a helper function to catch the error and try again after a few seconds, which usually works. This helper function also consolidates the data retrival code, so it kills two birds with one stone.

In [5]:
def request_until_succeed(url):
    req = urllib2.Request(url)
    success = False
    while success is False:
        try: 
            response = urllib2.urlopen(req)
            if response.getcode() == 200:
                success = True
        except Exception, e:
            print e
            time.sleep(5)
            
            print "Error for URL %s: %s" % (url, datetime.datetime.now())

    return response.read()

The data is the Facebook Page metadata however; we need to change the endpoint to the /feed endpoint.

In [6]:
def testFacebookPageFeedData(page_id, access_token):
    
    # construct the URL string
    base = "https://graph.facebook.com/v2.4"
    node = "/" + page_id + "/feed" # changed
    parameters = "/?access_token=%s" % access_token
    url = base + node + parameters
    
    # retrieve data
    data = json.loads(request_until_succeed(url))
    
    print json.dumps(data, indent=4, sort_keys=True)
    

testFacebookPageFeedData(page_id, access_token)

{
    "data": [
        {
            "created_time": "2017-04-13T20:04:33+0000", 
            "id": "527750833927938_1267169329986081", 
            "message": "MHS Business Professionals of America club is holding a donation car wash this Saturday, April 15, at AutoZone. They are raising money to attend the National Leadership Conference in Orlando in May. We hope to see you there from 10:00 am to 2:00 pm!"
        }, 
        {
            "created_time": "2017-04-13T12:27:13+0000", 
            "id": "527750833927938_1266787703357577", 
            "story": "USD 383: Manhattan High School added a new photo."
        }, 
        {
            "created_time": "2017-04-12T15:58:44+0000", 
            "id": "527750833927938_1265918030111211", 
            "message": "Our band program at Manhattan High School has started its annual Flower Bulb Fundraiser for the spring!!! If you are interested, please click the link below and check out what is available. Please share the link so that ev

In v2.4, the default behavior is to return very, very little metadata for statuses in order to reduce bandwidth, with the expectation the user will request the necessary fields.

We don't need data on every FB page status. Yet. Let's reduce the requested fields to exactly what we need, and the number of stories returned to 1 so we can process it.

In [7]:
def getFacebookPageFeedData(page_id, access_token, num_statuses):
    
    # construct the URL string
    base = "https://graph.facebook.com"
    node = "/" + page_id + "/feed" 
    parameters = "/?fields=message,link,created_time,type,name,id,likes.limit(1).summary(true),comments.limit(1).summary(true),shares&limit=%s&access_token=%s" % (num_statuses, access_token) # changed
    url = base + node + parameters
    
    # retrieve data
    data = json.loads(request_until_succeed(url))
    
    return data
    

test_status = getFacebookPageFeedData(page_id, access_token, 1)["data"][0]
print json.dumps(test_status, indent=4, sort_keys=True)

{
    "comments": {
        "data": [], 
        "summary": {
            "can_comment": false, 
            "order": "chronological", 
            "total_count": 0
        }
    }, 
    "created_time": "2017-04-13T20:04:33+0000", 
    "id": "527750833927938_1267169329986081", 
    "likes": {
        "data": [
            {
                "id": "10210758794378299", 
                "name": "Brenda Mayberry"
            }
        ], 
        "paging": {
            "cursors": {
                "after": "MTAyMTA3NTg3OTQzNzgyOTkZD", 
                "before": "MTAyMTA3NTg3OTQzNzgyOTkZD"
            }
        }, 
        "summary": {
            "can_like": false, 
            "has_liked": false, 
            "total_count": 2
        }
    }, 
    "link": "https://www.facebook.com/383Indians/photos/a.719860491383637.1073741828.527750833927938/1267169329986081/?type=3", 
    "message": "MHS Business Professionals of America club is holding a donation car wash this Saturday, April 15, at Au

Now that we have a sample Facebook page status, we can write a function to process each field individually.

In [8]:
def processFacebookPageFeedStatus(status):
    
    # The status is now a Python dictionary, so for top-level items,
    # we can simply call the key.
    
    # Additionally, some items may not always exist,
    # so must check for existence first
    
    status_id = status['id']
    status_message = '' if 'message' not in status.keys() else status['message'].encode('utf-8')
    link_name = '' if 'name' not in status.keys() else status['name'].encode('utf-8')
    status_type = status['type']
    status_link = '' if 'link' not in status.keys() else status['link']
    
    
    # Time needs special care since a) it's in UTC and
    # b) it's not easy to use in statistical programs.
    
    status_published = datetime.datetime.strptime(status['created_time'],'%Y-%m-%dT%H:%M:%S+0000')
    status_published = status_published + datetime.timedelta(hours=-5) # EST
    status_published = status_published.strftime('%Y-%m-%d %H:%M:%S') # best time format for spreadsheet programs
    
    # Nested items require chaining dictionary keys.
    
    num_likes = 0 if 'likes' not in status.keys() else status['likes']['summary']['total_count']
    num_comments = 0 if 'comments' not in status.keys() else status['comments']['summary']['total_count']
    num_shares = 0 if 'shares' not in status.keys() else status['shares']['count']
    
    # return a tuple of all processed data
    return (status_id, status_message, link_name, status_type, status_link,
           status_published, num_likes, num_comments, num_shares)

processed_test_status = processFacebookPageFeedStatus(test_status)
print processed_test_status

(u'527750833927938_1267169329986081', 'MHS Business Professionals of America club is holding a donation car wash this Saturday, April 15, at AutoZone. They are raising money to attend the National Leadership Conference in Orlando in May. We hope to see you there from 10:00 am to 2:00 pm!', 'Timeline Photos', u'photo', u'https://www.facebook.com/383Indians/photos/a.719860491383637.1073741828.527750833927938/1267169329986081/?type=3', '2017-04-13 15:04:33', 2, 0, 1)


Surprisingly, we're almost done! Now we just need to:

1. Query each page of Facebook Page Statuses (100 statuses per page) using `getFacebookPageFeedData`.
2. Process all statuses on that page using `processFacebookPageFeedStatus` and writing the output to a CSV file.
3. Navigate to the next page, and repeat until no more statuses

This block implements both the writing to CSV and page navigation.

In [9]:
def scrapeFacebookPageFeedStatus(page_id, access_token):
    with open('%s_facebook_statuses.csv' % page_id, 'wb') as file:
        w = csv.writer(file)
        w.writerow(["status_id", "status_message", "link_name", "status_type", "status_link",
           "status_published", "num_likes", "num_comments", "num_shares"])
        
        has_next_page = True
        num_processed = 0   # keep a count on how many we've processed
        scrape_starttime = datetime.datetime.now()
        
        print "Scraping %s Facebook Page: %s\n" % (page_id, scrape_starttime)
        
        statuses = getFacebookPageFeedData(page_id, access_token, 100)
        
        while has_next_page:
            for status in statuses['data']:
                w.writerow(processFacebookPageFeedStatus(status))
                
                # output progress occasionally to make sure code is not stalling
                num_processed += 1
                if num_processed % 1000 == 0:
                    print "%s Statuses Processed: %s" % (num_processed, datetime.datetime.now())
                    
            # if there is no next page, we're done.
            if 'paging' in statuses.keys():
                statuses = json.loads(request_until_succeed(statuses['paging']['next']))
            else:
                has_next_page = False
                
        
        print "\nDone!\n%s Statuses Processed in %s" % (num_processed, datetime.datetime.now() - scrape_starttime)


scrapeFacebookPageFeedStatus(page_id, access_token)

Scraping 383Indians Facebook Page: 2017-04-13 16:25:59.213000


Done!
452 Statuses Processed in 0:00:05.943000


The CSV can be opened in all major statistical programs. Have fun! :)

# Reference

How to Scrape Data From Facebook Page Posts for Statistical Analysis

By [Max Woolf (@minimaxir)](http://minimaxir.com/)

This notebook describes how to build a Facebook Scraper using the latest version of Facebook's Graph API (v2.4). This is the accompanyment to my blog post [How to Scrape Data From Facebook Page Posts for Statistical Analysis](http://minimaxir.com/2015/07/facebook-scraper/).