# How to Scrape Data From Facebook Page Posts for Statistical Analysis

By [Max Woolf (@minimaxir)](http://minimaxir.com/)

This notebook describes how to build a Facebook Scraper using the latest version of Facebook's Graph API (v2.4). This is the accompanyment to my blog post [How to Scrape Data From Facebook Page Posts for Statistical Analysis](http://minimaxir.com/2015/07/facebook-scraper/).

In [3]:
# import some Python dependencies

import urllib.request
import json
import datetime
import csv
import time
import pandas as pd
from PIL import Image
import requests
from io import BytesIO

Accessing Facebook page data requires an access token.

Since the user access token expires within an hour, we need to create a dummy application *for the sole purpose of scraping* and use the app ID and app secret generated there [as described here](https://developers.facebook.com/docs/facebook-login/access-tokens#apptokens), both of which never expire.

In [4]:
# Since the code output in this notebook leaks the app_secret,
# it has been reset by the time you read this.

app_id = "443809049300463"
app_secret = "e6ff2a431bb3da7624faefbf39a15a3d" # DO NOT SHARE WITH ANYONE!

access_token = app_id + "|" + app_secret
access_token

'443809049300463|e6ff2a431bb3da7624faefbf39a15a3d'

Now we can access public Facebook data without limit. Let's do our analysis on the [New York Times Facebook page](https://www.facebook.com/nytimes), which is popular enough to yield good data.

In [5]:
group_id = '1717731545171536'

Let's write a quick program to ping NYT's Facebook page to verify that the `access_token` works and the `page_id` is valid.

In [6]:
def testFacebookPageData(group_id, access_token):
    
    # construct the URL string
    base = "https://graph.facebook.com/v2.4"
    node = "/" + group_id
    parameters = "/?access_token=%s" % access_token
    url = base + node + parameters
    
    # retrieve data
    req = urllib.request.Request(url)
    response = urllib.request.urlopen(req)
    data = json.loads(response.read().decode('utf-8'))
    
    print(json.dumps(data, indent=4, sort_keys=True))
    

testFacebookPageData("1717731545171536", access_token)

{
    "id": "1717731545171536",
    "name": "UC Berkeley Memes for Edgy Teens",
    "privacy": "OPEN"
}


When scraping large amounts of data from public APIs, there's a high probability that you'll hit an [HTTP Error 500 (Internal Error)](http://www.checkupdown.com/status/E500.html) at some point. There is no way to avoid that on our end. 

Instead, we'll use a helper function to catch the error and try again after a few seconds, which usually works. This helper function also consolidates the data retrival code, so it kills two birds with one stone.

In [7]:
def request_until_succeed(url):
    req = urllib.request.Request(url)
    success = False
    while success is False:
        try: 
            response = urllib.request.urlopen(req)
            if response.getcode() == 200:
                success = True
        except Exception as e:
            print(e)
            time.sleep(5)
            
            print("Error for URL %s: %s" % (url, datetime.datetime.now()))

    return response.read()

The data is the Facebook Page metadata however; we need to change the endpoint to the /feed endpoint.

In [8]:
def testFacebookPageFeedData(page_id, access_token):
    
    # construct the URL string
    base = "https://graph.facebook.com/v2.8"
    node = "/" + page_id + "/feed" # changed
    parameters = "/?access_token=%s" % access_token
    url = base + node + parameters
    
    # retrieve data
    data = json.loads(request_until_succeed(url).decode('utf-8'))
    
    print(json.dumps(data, indent=4, sort_keys=True))
    

testFacebookPageFeedData(group_id, access_token)

{
    "data": [
        {
            "id": "1717731545171536_1896484597296229",
            "updated_time": "2017-04-20T08:26:40+0000"
        },
        {
            "id": "1717731545171536_1900202616924427",
            "message": "Does this come in California plates??",
            "updated_time": "2017-04-20T08:26:39+0000"
        },
        {
            "id": "1717731545171536_1899560873655268",
            "updated_time": "2017-04-20T08:26:38+0000"
        },
        {
            "id": "1717731545171536_1899689933642362",
            "updated_time": "2017-04-20T08:26:28+0000"
        },
        {
            "id": "1717731545171536_1899685076976181",
            "updated_time": "2017-04-20T08:24:59+0000"
        },
        {
            "id": "1717731545171536_1900213250256697",
            "message": "When that cute boy from your discussion finally messages you",
            "updated_time": "2017-04-20T08:24:47+0000"
        },
        {
            "id": "1717731545171536_1

In v2.4, the default behavior is to return very, very little metadata for statuses in order to reduce bandwidth, with the expectation the user will request the necessary fields.

We don't need data on every NYT status. Yet. Let's reduce the requested fields to exactly what we need, and the number of stories returned to 1 so we can process it.

In [9]:
def getFacebookPageFeedData(page_id, access_token, num_statuses):
    
    # construct the URL string
    base = "https://graph.facebook.com"
    node = "/" + group_id + "/feed" 
    parameters = parameters = "/?fields=message,from,link,created_time,updated_time,type,name,id,likes.limit(1).summary(true),comments.limit(1).summary(true),shares&limit=%s&access_token=%s" % (num_statuses, access_token) # changed
    url = base + node + parameters
    
    # retrieve data
    data = json.loads(request_until_succeed(url).decode('utf-8'))
    
    return data
    

test_status = getFacebookPageFeedData(group_id, access_token, 1)["data"][0]
#print(json.dumps(test_status, indent=4, sort_keys=True))

## Now that we have a sample Facebook page status, we can write a function to process each field individually.

In [10]:
test_status = getFacebookPageFeedData(group_id, access_token, 1)["data"][0]
print(json.dumps(test_status, indent=4, sort_keys=True))

{
    "comments": {
        "data": [
            {
                "created_time": "2017-04-13T22:42:51+0000",
                "from": {
                    "id": "1880563745564569",
                    "name": "Alec Rodriguez"
                },
                "id": "1896503293961026",
                "message": "Michael Gary Yi \ud83e\udd14"
            }
        ],
        "paging": {
            "cursors": {
                "after": "WTI5dGJXVnVkRjlqZAFhKemIzSTZANVGc1TmpVd016STVNemsyTVRBeU5qb3hORGt5TVRJek16Y3gZD",
                "before": "WTI5dGJXVnVkRjlqZAFhKemIzSTZANVGc1TmpVd016STVNemsyTVRBeU5qb3hORGt5TVRJek16Y3gZD"
            },
            "next": "https://graph.facebook.com/v2.8/1717731545171536_1896484597296229/comments?access_token=443809049300463%7Ce6ff2a431bb3da7624faefbf39a15a3d&summary=true&limit=1&after=WTI5dGJXVnVkRjlqZAFhKemIzSTZANVGc1TmpVd016STVNemsyTVRBeU5qb3hORGt5TVRJek16Y3gZD"
        },
        "summary": {
            "can_comment": false,
            "orde

# Now, we're going to write a function that processes the non-like reactions.
Reactions need to be iterated through and then counted for total votes. We can also figure out the most angry/sad/happy/etc poster, but we might just save it for later.

In [11]:
def processFacebookPageFeedStatus(post):
    
    # The status is now a Python dictionary, so for top-level items,
    # we can simply call the key.
    
    # Additionally, some items may not always exist,
    # so must check for existence first
    
    # Fields: post_id, post_message, from, 
    
    post_id = post['id']
    post_message = '' if 'message' not in post.keys() else post['message'].encode('utf-8')
    link_name = '' if 'name' not in post.keys() else post['name'].encode('utf-8')
    post_type = post['type']
    post_link = '' if 'link' not in post.keys() else post['link']
    poster_id = post['from']['id']
    poster_name = post['from']['name']
    
    
    # Time needs special care since a) it's in UTC and
    # b) it's not easy to use in statistical programs.
    
    post_published = datetime.datetime.strptime(post['created_time'],'%Y-%m-%dT%H:%M:%S+0000')
    post_published = post_published + datetime.timedelta(hours=-8) # PST
    post_published = post_published.strftime('%Y-%m-%d %H:%M:%S') # best time format for spreadsheet programs
    
    update_time = datetime.datetime.strptime(post['updated_time'],'%Y-%m-%dT%H:%M:%S+0000')
    update_time = update_time + datetime.timedelta(hours=-8) # PST
    
    # Nested items require chaining dictionary keys.
    
    num_likes = 0 if 'likes' not in post.keys() else post['likes']['summary']['total_count']
    num_comments = 0 if 'comments' not in post.keys() else post['comments']['summary']['total_count']
    num_shares = 0 if 'shares' not in post.keys() else post['shares']['count']
    
    # return a tuple of all processed data
    return (post_id, post_message, poster_id, poster_name, link_name, post_type, post_link,
           post_published, num_likes, num_comments, num_shares, update_time)

processed_test_status = processFacebookPageFeedStatus(test_status)
print(processed_test_status)

(datetime.datetime.now() - processed_test_status[-1]).seconds

('1717731545171536_1896484597296229', '', '431973770480713', 'Tony Lai', '', 'photo', 'https://www.facebook.com/photo.php?fbid=431743327170424&set=gm.1896484597296229&type=3', '2017-04-13 13:55:17', 1723, 189, 0, datetime.datetime(2017, 4, 20, 0, 26, 40))


3621

# Surprisingly, we're almost done! Now we just need to:

1. Query each page of Facebook Page Statuses (100 statuses per page) using `getFacebookPageFeedData`.
2. Process all statuses on that page using `processFacebookPageFeedStatus` and writing the output to a CSV file.
3. Navigate to the next page, and repeat until no more statuses

This block implements both the writing to CSV and page navigation.

(note: after the initial scrape, adjust it so that it doesn't check further back before the last scraped time)

In [12]:
def scrapeFacebookPageFeedStatus(page_id, access_token):
    with open('%s_facebook_statuses.csv' % page_id, 'w') as file:
        w = csv.writer(file)
        w.writerow([b"post_id", "post_message", "poster_id", "poster_name", 
                    "link_name", "post_type", "post_link", "post_published", 
                    "num_likes", "num_comments", "num_shares", "update_time"])
        
        has_next_page = True
        num_processed = 0   # keep a count on how many we've processed
        scrape_starttime = datetime.datetime.now()
        
        print("Scraping %s Facebook Page: %s\n" % (page_id, scrape_starttime))
        
        statuses = getFacebookPageFeedData(page_id, access_token, 100)
        
        while has_next_page:
            for status in statuses['data']:
                info = processFacebookPageFeedStatus(status)
                w.writerow(info)
                
                # output progress occasionally to make sure code is not stalling
                num_processed += 1
                if num_processed % 1000 == 0:
                    print("%s Statuses Processed: %s" % (num_processed, datetime.datetime.now()))
                    
            # if there is no next page, we're done.
            if 'paging' in statuses.keys():
                statuses = json.loads(request_until_succeed(statuses['paging']['next']).decode('utf-8'))
            else:
                has_next_page = False
                
        
        print("\nDone!\n%s Statuses Processed in %s" % (num_processed, datetime.datetime.now() - scrape_starttime))


#scrapeFacebookPageFeedStatus(group_id, access_token)

The CSV can be opened in all major statistical programs. Have fun! :)

You can download the [NYTimes data here](https://dl.dropboxusercontent.com/u/2017402/nytimes_facebook_statuses.zip). [4.6MB]

# Define a function down here that aggregates all of the likes/reacts data

We have 5 million likes on all posts. To be conservative, then 10 million total reactions to all posts in the group will take 10,000,000/5000*11 = 2000 seconds = 366 minutes = ~6 hours for the initial scrape for all reaction data.

1. We compile all reactions into a python dictionary before converting it into a json, with post_id, reaction data (as a list)
2. We'll add each of these post_ids into a mongodb for further investigation

In [13]:
def parseAllReactions(post_id, num_reactions):
    base = "https://graph.facebook.com/v2.8/"+post_id+"/reactions"
    parameters = "?access_token=%s&limit=%s" % (access_token, num_reactions)
    
    url = base + parameters
    
    
    scrape_starttime = datetime.datetime.now()
    data = json.loads(request_until_succeed(url).decode('utf-8'))
    
    full_data = []
    pages = 0
    has_next_page = True
    
    
    while has_next_page:
        full_data = full_data + data['data']
        pages+=1
        # if there is no next page, we're done.
        if 'next' in data['paging'].keys():
            data = json.loads(request_until_succeed(data['paging']['next']).decode('utf-8'))
        else:
            has_next_page = False
            
    print("\nDone!\n%s pages Processed in %s" % (pages, datetime.datetime.now() - scrape_starttime))
            
    return {'post_id' : post_id, 'reactions' : full_data}


In [14]:
x = parseAllReactions(pd.read_csv('1717731545171536_facebook_statuses.csv')["b'post_id'"][0], 100)


Done!
53 pages Processed in 0:00:08.208185


In [15]:
def getImageFromPost(post_id):
    base = "https://graph.facebook.com/v2.8/"+post_id+"/attachments"
    parameters = "/?access_token=%s" % access_token
    
    url = base + parameters
    data = json.loads(request_until_succeed(url).decode('utf-8'))
    
    return data['data'][0]['media']['image']['src']

In [21]:
base = "https://graph.facebook.com/v2.8/"+pd.read_csv('1717731545171536_facebook_statuses.csv')["b'post_id'"][0]+"/reactions"
parameters = "?access_token=%s&limit=%s" % (access_token, 100)

url = base + parameters
scrape_starttime = datetime.datetime.now()
data = json.loads(request_until_succeed(url).decode('utf-8'))
    
has_next_page = True
full_data = []
# while has_next_page:
#     full_data = full_data + data['data']
#     pages+=1
#     # if there is no next page, we're done.
#     if 'next' in data['paging'].keys():
#         data = json.loads(request_until_succeed(data['paging']['next']).decode('utf-8'))
#     else:
#         has_next_page = False

In [23]:
data['data'][]

[{'id': '1449773205074991', 'name': 'Leonino Colobong', 'type': 'LIKE'},
 {'id': '1463690963680974', 'name': 'Jacob Ramirez', 'type': 'LIKE'},
 {'id': '1464715400239331', 'name': 'Darren Huang', 'type': 'LIKE'},
 {'id': '1585578931499585', 'name': 'Aaron Chelliah', 'type': 'LIKE'},
 {'id': '613074145552790', 'name': 'Angel Rubio', 'type': 'LIKE'},
 {'id': '601813086690726', 'name': 'Juan M Arce', 'type': 'LIKE'},
 {'id': '10203176591453080', 'name': 'Kenneth Choong', 'type': 'LIKE'},
 {'id': '10206924590072981', 'name': 'OaTing Do', 'type': 'HAHA'},
 {'id': '10212807308364358', 'name': 'Roland Wen', 'type': 'LIKE'},
 {'id': '1246353628797054', 'name': 'Danielle Dirksen', 'type': 'LIKE'},
 {'id': '1354450551288698', 'name': 'Paul Ajodha', 'type': 'LIKE'},
 {'id': '1378796615477040', 'name': 'Nick Lawrence', 'type': 'LIKE'},
 {'id': '10206628821197940', 'name': 'Nathania Hartojo', 'type': 'LIKE'},
 {'id': '1339739609447068', 'name': 'Shreya De', 'type': 'LIKE'},
 {'id': '1274468372669798