# 04 - Facebook Public Page Scraper
<p class="lead">
Michelle Brown Notes v 1.4<br />


This notebook scrapes the inforamtion from a public Facebook page using Facebook's Graph API (2.6).

This will scrape all the posts and comments of a Facebook Public Page and the related metadata, including post message, post links, and counts of each reaction on the post. All this data is exported as a CSV, so it can be imported into an analysis program like Excel.

In [None]:
# import some Python dependencies
import urllib2
import json
import datetime
import csv
import time

In order to access page data, Facebook requires an access token so we are going to create a dummy applications for the sole purspose of scraping. <br>

1. Login to your Facebook account, then go to developers.facebook.com and go to the upper right corner and in the dropbox menu, click "add a new app." 
2. Give the app a name and click Create App ID
3. Click on the newly created app to pull up the Dashboard. 
4. In the Dashboard, you'll see and App ID and an APP secret. These are what you will post in between the QUOTATIONS (" ") below: 

In [None]:
app_id = "PASTEHERE" 
app_secret = "PASTEHERE" # DO NOT SHARE THIS WITH ANYONE!

access_token = app_id + "|" + app_secret

Now we can access public Facebook data without limit. Let's scrape the [NDI Facebook Page](https://www.facebook.com/National.Democratic.Institute).  Below you want to replace what is between the quotes (' ') with the ID for the facebook page or group. Once you find the page you want to scrape, cut and paste the url into this tool: <a href="https://lookup-id.com/">https://lookup-id.com/</a> and it will give you the ID to put below. 

In [None]:
group_id = 'National.Democratic.Institute'

The code below pings the Facebook page to verify that the `access_token` works and the `page_id` is valid. If it works, it should output the id and name of the page. 

In [None]:
def testFacebookPageData(group_id, access_token):
    
    # construct the URL string
    base = "https://graph.facebook.com/v2.6"
    node = "/" + group_id   
    parameters = "/?access_token=%s" % access_token
    url = base + node + parameters
    
    # retrieve data
    req = urllib2.Request(url)
    response = urllib2.urlopen(req)
    data = json.loads(response.read())
    
    print json.dumps(data, indent=4, sort_keys=True)
    
testFacebookPageData(group_id, access_token)

When scraping large amounts of data from public APIs, there's a high probability that you'll hit an [HTTP Error 500 (Internal Error)](http://www.checkupdown.com/status/E500.html) at some point. This is a helper function to catch the error and try again after a few seconds, which usually works. This helper function also consolidates the data retrival code.

In [None]:
def request_until_succeed(url):
    req = urllib2.Request(url)
    success = False
    while success is False:
        try: 
            response = urllib2.urlopen(req)
            if response.getcode() == 200:
                success = True
        except Exception, e:
            print e
            time.sleep(5)
            
            print "Error for URL %s: %s" % (url, datetime.datetime.now())

    return response.read()

<h2>We construct the string to make the request from the Facebook API and then test that we get back some data:</h2>

In [None]:
def testFacebookPageFeedData(group_id, access_token):
    
    # construct the URL string
    base = "https://graph.facebook.com/v2.6"
    node = "/" + group_id + "/feed"                     # group id
    parameters = "/?access_token=%s" % access_token
    url = base + node + parameters
    
    # retrieve data
    data = json.loads(request_until_succeed(url))
    
    print json.dumps(data, indent=4, sort_keys=True)
    
testFacebookPageFeedData(group_id, access_token)

<b>Run the functions to scrape all of the statuses'information and the reactions</b> 
When you run the code below, one function will get the feed from teh page that has information about the statuses. Another function will get the count of each reaction type for each status. Another function normalizes some of the data (especially statuses prior to 24 of February 2016 where there were fewer reaction types) for dates and times. Another function writes out the data to a comma separated file while also printing an output of the progress of the scraper until it is done running. There are are also a couple of helper functions. One that helps to normalize the unicoe and the other helper function that retrys the API if there is an error. 
it will use two helper fuctions. Next we need to normalize the unicode so we can put it into a csv file and then we get the info.<br>
Once the scraper is finished, it should say it's down and it will write out the data to a csv file.

In [None]:
# Needed to write tricky unicode correctly to csv
  ##left single quotation, right single quotation, left double quotation, right double quotation,
    #non braking space 
def unicode_normalize(text):
    return text.translate({ 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22,
                            0xa0:0x20 }).encode('utf-8')
    

# Helper function here for server request
def request_until_succeed(url):
    req = urllib2.Request(url)
    success = False
    while success is False:
        try: 
            response = urllib2.urlopen(req)
            if response.getcode() == 200:
                success = True
        except Exception, e:
            print e
            time.sleep(5)
            
            print "Error for URL %s: %s" % (url, datetime.datetime.now())

    return response.read()

def getFacebookPageFeedData(group_id, access_token, num_statuses):
    # Construct the URL string; see 
    # http://stackoverflow.com/a/37239851 for Reactions parameters
    base = "https://graph.facebook.com/v2.6"
    node = "/%s/feed" % group_id 
    fields = "/?fields=message,link,created_time,type,name,id," + \
            "comments.limit(0).summary(true),shares,reactions." + \
            "limit(0).summary(true),from"
    parameters = "&limit=%s&access_token=%s" % (num_statuses, access_token)
    url = base + node + fields + parameters

    # retrieve data
    data = json.loads(request_until_succeed(url))

    return data

def getReactionsForStatus(status_id, access_token):
    # See http://stackoverflow.com/a/37239851 for Reactions parameters
        # Reactions are only accessable at a single-post endpoint

    base = "https://graph.facebook.com/v2.6"
    node = "/%s" % status_id
    reactions = "/?fields=" \
            "reactions.type(LIKE).limit(0).summary(total_count).as(like)" \
            ",reactions.type(LOVE).limit(0).summary(total_count).as(love)" \
            ",reactions.type(WOW).limit(0).summary(total_count).as(wow)" \
            ",reactions.type(HAHA).limit(0).summary(total_count).as(haha)" \
            ",reactions.type(SAD).limit(0).summary(total_count).as(sad)" \
            ",reactions.type(ANGRY).limit(0).summary(total_count).as(angry)"
    parameters = "&access_token=%s" % access_token
    url = base + node + reactions + parameters

    # retrieve data
    data = json.loads(request_until_succeed(url))

    return data

def processFacebookPageFeedStatus(status, access_token):
    # The status is now a Python dictionary, so for top-level items,
    # we can simply call the key.

    # Additionally, some items may not always exist,
    # so must check for existence first

    status_id = status['id']
    status_message = '' if 'message' not in status.keys() else \
            unicode_normalize(status['message'])
    link_name = '' if 'name' not in status.keys() else \
            unicode_normalize(status['name'])
    status_type = status['type']
    status_link = '' if 'link' not in status.keys() else \
            unicode_normalize(status['link'])
    status_author = unicode_normalize(status['from']['name'])

    # Time needs special care since a) it's in UTC and
    # b) it's not easy to use in statistical programs.

    status_published = datetime.datetime.strptime(\
            status['created_time'],'%Y-%m-%dT%H:%M:%S+0000')
    status_published = status_published + datetime.timedelta(hours=-5) # Adjusting for Eastern Standard Time (EST)
    # best time format for spreadsheet programs:
    status_published = status_published.strftime('%Y-%m-%d %H:%M:%S')

    # Nested items require chaining dictionary keys.
    num_reactions = 0 if 'reactions' not in status else \
            status['reactions']['summary']['total_count']
    num_comments = 0 if 'comments' not in status else \
            status['comments']['summary']['total_count']
    num_shares = 0 if 'shares' not in status else \
            status['shares']['count']

    # Counts of each reaction separately; good for sentiment
    # Only check for reactions if past date when implemented:
    # http://newsroom.fb.com/news/2016/02/reactions-now-available-globally/

    reactions = getReactionsForStatus(status_id, access_token) \
            if status_published > '2016-02-24 00:00:00' else {}

    num_likes = 0 if 'like' not in reactions else \
            reactions['like']['summary']['total_count']

    # Special case: Set number of Likes to Number of reactions for pre-reaction 
    # statuses

    num_likes = num_reactions if status_published < '2016-02-24 00:00:00' else \
            num_likes

    def get_num_total_reactions(reaction_type, reactions):
        if reaction_type not in reactions:
            return 0
        else:
            return reactions[reaction_type]['summary']['total_count']

    num_loves = get_num_total_reactions('love', reactions)
    num_wows = get_num_total_reactions('wow', reactions)
    num_hahas = get_num_total_reactions('haha', reactions)
    num_sads = get_num_total_reactions('sad', reactions)
    num_angrys = get_num_total_reactions('angry', reactions)

    # return a tuple of all processed data
    return (status_id, status_message, status_author, link_name, status_type, 
            status_link, status_published, num_reactions, num_comments, 
            num_shares,  num_likes, num_loves, num_wows, num_hahas, num_sads, 
            num_angrys)

def scrapeFacebookPageFeedStatus(group_id, access_token):
    with open('%s_facebook_statuses.csv' % group_id, 'wb') as file:
        w = csv.writer(file)
        w.writerow(["status_id", "status_message", "status_author", 
            "link_name", "status_type", "status_link",
            "status_published", "num_reactions", "num_comments", 
            "num_shares", "num_likes", "num_loves", "num_wows", 
            "num_hahas", "num_sads", "num_angrys"])

        has_next_page = True
        num_processed = 0   # keep a count on how many we've processed for status
        scrape_starttime = datetime.datetime.now()

        print "Scraping %s Facebook Page: %s\n" % \
                (group_id, scrape_starttime)

        statuses = getFacebookPageFeedData(group_id, access_token, 100)

        while has_next_page:
            for status in statuses['data']:

                # Ensure it is a status with the expected metadata
                if 'reactions' in status:            
                    w.writerow(processFacebookPageFeedStatus(status, \
                                                            access_token))

                # After every 100 statuss, print output progress to make sure code is not
                # stalling
                num_processed += 1
                if num_processed % 100 == 0:
                    print "%s Statuses Processed: %s" % (num_processed, 
                            datetime.datetime.now())

            # if there is no next page, we're done.
            if 'paging' in statuses.keys():
                statuses = json.loads(request_until_succeed(\
                        statuses['paging']['next']))
            else:
                has_next_page = False


        print "\nDone!\n%s Statuses Processed in %s" % \
                (num_processed, datetime.datetime.now() - scrape_starttime)

if __name__ == '__main__':
    scrapeFacebookPageFeedStatus(group_id, access_token)

# The CSV can be opened in all major statistical programs. Have fun! :)

There should be a csv files saved in the same directory as this notebook (e.g. National.Democratic.Institute_facebook_statuses.csv)<br>
The CSV can be opened in all major statistical programs. Analyze and enjoy!