# How to Scrape Data From Facebook Page Posts for Statistical Analysis

By [Max Woolf (@minimaxir)](http://minimaxir.com/)

This notebook describes how to build a Facebook Scraper using the latest version of Facebook's Graph API (v2.4). This is the accompanyment to my blog post [How to Scrape Data From Facebook Page Posts for Statistical Analysis](http://minimaxir.com/2015/07/facebook-scraper/).

In [2]:
# import some Python dependencies

import urllib.request
import json
import datetime
import csv
import time

Accessing Facebook page data requires an access token.

Since the user access token expires within an hour, we need to create a dummy application *for the sole purpose of scraping* and use the app ID and app secret generated there [as described here](https://developers.facebook.com/docs/facebook-login/access-tokens#apptokens), both of which never expire.

In [3]:
# Since the code output in this notebook leaks the app_secret,
# it has been reset by the time you read this.

app_id = "443809049300463"
app_secret = "e6ff2a431bb3da7624faefbf39a15a3d" # DO NOT SHARE WITH ANYONE!

access_token = app_id + "|" + app_secret

Now we can access public Facebook data without limit. Let's do our analysis on the [New York Times Facebook page](https://www.facebook.com/nytimes), which is popular enough to yield good data.

In [4]:
group_id = '1717731545171536'

Let's write a quick program to ping NYT's Facebook page to verify that the `access_token` works and the `page_id` is valid.

In [8]:
def testFacebookGroupData(page_id, access_token):
    
    # construct the URL string
    base = "https://graph.facebook.com/v2.4"
    node = "/" + group_id
    parameters = "/?access_token=%s" % access_token
    url = base + node + parameters
    
    # retrieve data
    req = urllib.request.Request(url)
    response = urllib.request.urlopen(req)
    data = json.loads(response.read().decode('utf-8'))
    
    print(json.dumps(data, indent=4, sort_keys=True))
    

testFacebookGroupData(group_id, access_token)

{
    "id": "1717731545171536",
    "name": "UC Berkeley Memes for Edgy Teens",
    "privacy": "OPEN"
}


When scraping large amounts of data from public APIs, there's a high probability that you'll hit an [HTTP Error 500 (Internal Error)](http://www.checkupdown.com/status/E500.html) at some point. There is no way to avoid that on our end. 

Instead, we'll use a helper function to catch the error and try again after a few seconds, which usually works. This helper function also consolidates the data retrival code, so it kills two birds with one stone.

In [9]:
def request_until_succeed(url):
    req = urllib.request.Request(url)
    success = False
    while success is False:
        try: 
            response = urllib.request.urlopen(req)
            if response.getcode() == 200:
                success = True
        except Exception as e:
            print(e)
            time.sleep(5)
            
            print("Error for URL %s: %s" % (url, datetime.datetime.now()))

    return response.read()

The data is the Facebook Page metadata however; we need to change the endpoint to the /feed endpoint.

In [11]:
def testFacebookPageFeedData(page_id, access_token):
    
    # construct the URL string
    base = "https://graph.facebook.com/v2.4"
    node = "/" + page_id + "/feed" # changed
    parameters = "/?access_token=%s" % access_token
    url = base + node + parameters
    
    # retrieve data
    data = json.loads(request_until_succeed(url).decode('utf-8'))
    
    print(json.dumps(data, indent=4, sort_keys=True))
    

testFacebookPageFeedData(group_id, access_token)

{
    "data": [
        {
            "id": "1717731545171536_1897810397163649",
            "message": "Now available at House of Curries on Durant",
            "updated_time": "2017-04-16T02:37:09+0000"
        },
        {
            "id": "1717731545171536_1891143544497001",
            "updated_time": "2017-04-16T02:36:55+0000"
        },
        {
            "id": "1717731545171536_1897192917225397",
            "message": "thriving reacts only",
            "updated_time": "2017-04-16T02:36:36+0000"
        },
        {
            "id": "1717731545171536_1897288600549162",
            "updated_time": "2017-04-16T02:36:26+0000"
        },
        {
            "id": "1717731545171536_1897267883884567",
            "updated_time": "2017-04-16T02:35:39+0000"
        },
        {
            "id": "1717731545171536_1897787097165979",
            "message": "When you end up doing the whole group project by yourself",
            "updated_time": "2017-04-16T02:33:59+0000"
        

In v2.4, the default behavior is to return very, very little metadata for statuses in order to reduce bandwidth, with the expectation the user will request the necessary fields.

We don't need data on every NYT status. Yet. Let's reduce the requested fields to exactly what we need, and the number of stories returned to 1 so we can process it.

In [14]:
def getFacebookGroupFeedData(group_id, access_token, num_statuses):
    
    # construct the URL string
    base = "https://graph.facebook.com"
    node = "/" + group_id + "/feed" 
    parameters = "/?fields=message,link,created_time,type,name,id,likes.limit(1).summary(true),comments.limit(1).summary(true),shares&limit=%s&access_token=%s" % (num_statuses, access_token) # changed
    url = base + node + parameters
    
    # retrieve data
    data = json.loads(request_until_succeed(url).decode('utf-8'))
    
    return data
    

test_status = getFacebookGroupFeedData(group_id, access_token, 1)["data"][0]
print(json.dumps(test_status, indent=4, sort_keys=True))

{
    "comments": {
        "data": [
            {
                "created_time": "2017-04-16T00:15:25+0000",
                "from": {
                    "id": "1447313365342309",
                    "name": "Claire Wiebe"
                },
                "id": "1897813973829958",
                "message": "Jean Yenbamroong"
            }
        ],
        "paging": {
            "cursors": {
                "after": "WTI5dGJXVnVkRjlqZAFhKemIzSTZANVGc1TnpneE16azNNemd5T1RrMU9Eb3hORGt5TXpBeE56STEZD",
                "before": "WTI5dGJXVnVkRjlqZAFhKemIzSTZANVGc1TnpneE16azNNemd5T1RrMU9Eb3hORGt5TXpBeE56STEZD"
            },
            "next": "https://graph.facebook.com/v2.8/1717731545171536_1897810397163649/comments?access_token=443809049300463%7Ce6ff2a431bb3da7624faefbf39a15a3d&summary=true&limit=1&after=WTI5dGJXVnVkRjlqZAFhKemIzSTZANVGc1TnpneE16azNNemd5T1RrMU9Eb3hORGt5TXpBeE56STEZD"
        },
        "summary": {
            "can_comment": false,
            "order": "chronolog

Now that we have a sample Facebook page status, we can write a function to process each field individually.

In [15]:
def processFacebookPageFeedStatus(status):
    
    # The status is now a Python dictionary, so for top-level items,
    # we can simply call the key.
    
    # Additionally, some items may not always exist,
    # so must check for existence first
    
    status_id = status['id']
    status_message = '' if 'message' not in status.keys() else status['message'].encode('utf-8')
    link_name = '' if 'name' not in status.keys() else status['name'].encode('utf-8')
    status_type = status['type']
    status_link = '' if 'link' not in status.keys() else status['link']
    
    
    # Time needs special care since a) it's in UTC and
    # b) it's not easy to use in statistical programs.
    
    status_published = datetime.datetime.strptime(status['created_time'],'%Y-%m-%dT%H:%M:%S+0000')
    status_published = status_published + datetime.timedelta(hours=-5) # EST
    status_published = status_published.strftime('%Y-%m-%d %H:%M:%S') # best time format for spreadsheet programs
    
    # Nested items require chaining dictionary keys.
    
    num_likes = 0 if 'likes' not in status.keys() else status['likes']['summary']['total_count']
    num_comments = 0 if 'comments' not in status.keys() else status['comments']['summary']['total_count']
    num_shares = 0 if 'shares' not in status.keys() else status['shares']['count']
    
    # return a tuple of all processed data
    return (status_id, status_message, link_name, status_type, status_link,
           status_published, num_likes, num_comments, num_shares)

processed_test_status = processFacebookPageFeedStatus(test_status)
print(processed_test_status)

('1717731545171536_1897810397163649', b'Now available at House of Curries on Durant', '', 'photo', 'https://www.facebook.com/photo.php?fbid=1409161252476811&set=gm.1897810397163649&type=3', '2017-04-15 19:01:05', 460, 156, 0)


Surprisingly, we're almost done! Now we just need to:

1. Query each page of Facebook Page Statuses (100 statuses per page) using `getFacebookPageFeedData`.
2. Process all statuses on that page using `processFacebookPageFeedStatus` and writing the output to a CSV file.
3. Navigate to the next page, and repeat until no more statuses

This block implements both the writing to CSV and page navigation.

In [17]:
def scrapeFacebookPageFeedStatus(page_id, access_token):
    with open('%s_facebook_statuses.csv' % page_id, 'w') as file:
        w = csv.writer(file)
        w.writerow([b"status_id", "status_message", "link_name", "status_type", "status_link", "status_published", "num_likes", "num_comments", "num_shares"])
        
        has_next_page = True
        num_processed = 0   # keep a count on how many we've processed
        scrape_starttime = datetime.datetime.now()
        
        print("Scraping %s Facebook Page: %s\n" % (page_id, scrape_starttime))
        
        statuses = getFacebookPageFeedData(page_id, access_token, 100)
        
        while has_next_page:
            for status in statuses['data']:
                w.writerow(processFacebookPageFeedStatus(status))
                
                # output progress occasionally to make sure code is not stalling
                num_processed += 1
                if num_processed % 1000 == 0:
                    print("%s Statuses Processed: %s" % (num_processed, datetime.datetime.now()))
                    
            # if there is no next page, we're done.
            if 'paging' in statuses.keys():
                statuses = json.loads(request_until_succeed(statuses['paging']['next']).decode('utf-8'))
            else:
                has_next_page = False
                
        
        print("\nDone!\n%s Statuses Processed in %s" % (num_processed, datetime.datetime.now() - scrape_starttime))


scrapeFacebookPageFeedStatus(group_id, access_token)

Scraping 1717731545171536 Facebook Page: 2017-04-15 19:38:16.777051

1000 Statuses Processed: 2017-04-15 19:38:36.832556


KeyboardInterrupt: 

The CSV can be opened in all major statistical programs. Have fun! :)

You can download the [NYTimes data here](https://dl.dropboxusercontent.com/u/2017402/nytimes_facebook_statuses.zip). [4.6MB]

In [19]:
redd = "https://www.reddit.com/r/rarepuppers/new.json?sort=hot"
redd_json = json.loads(request_until_succeed(redd).decode('utf-8'))
print(json.dumps(redd_json, indent=4, sort_keys=True))

HTTP Error 429: Too Many Requests
Error for URL https://www.reddit.com/r/rarepuppers/new.json?sort=hot: 2017-04-15 19:39:02.322444
{
    "data": {
        "after": "t3_65m4f7",
        "before": null,
        "children": [
            {
                "data": {
                    "approved_by": null,
                    "archived": false,
                    "author": "Jdhlove",
                    "author_flair_css_class": "doggo9",
                    "author_flair_text": "",
                    "banned_by": null,
                    "brand_safe": true,
                    "clicked": false,
                    "contest_mode": false,
                    "created": 1492338856.0,
                    "created_utc": 1492310056.0,
                    "distinguished": null,
                    "domain": "i.imgur.com",
                    "downs": 0,
                    "edited": false,
                    "gilded": 0,
                    "hidden": false,
                    "hide_score": 

In [46]:
for i in range(25):
    image = redd_json['data']['children'][i]['data']['url'] 
    if image.split('.')[-1] == 'jpg':
        print(image)

http://i.imgur.com/1iq5I0m.jpg
https://i.redd.it/y09htcaxptry.jpg
https://i.redd.it/lafstfixjtry.jpg
https://i.redd.it/0svhpa7pjtry.jpg
https://i.redd.it/vxa8t6fkjtry.jpg
https://i.redd.it/de6290gnitry.jpg
http://i.imgur.com/9pCaWE5.jpg
http://i.imgur.com/ckCysSv.jpg
https://i.redd.it/bbgbqijtftry.jpg
http://i.imgur.com/THYTF28.jpg
http://i.imgur.com/AMNUd0E.jpg
http://i.imgur.com/kFJAHRC.jpg
https://i.redd.it/tpcoutdv9try.jpg
https://i.redd.it/oxq3s0yx8try.jpg
http://i.imgur.com/2mmzCf2.jpg
https://i.redd.it/cuqbqn034try.jpg
https://i.redd.it/r79jgiil0try.jpg
https://i.redd.it/gru2sksh0try.jpg
https://i.redd.it/h1brycjaxsry.jpg
