## Notebook Overview: Facebook Scraper

This notebook was used for the time-consuming data scraping. All of the heavy lifting is done by the "facebook_scraper" module (https://github.com/kevinzg/facebook-scraper) and the "get_posts" function from it. The tasks performed here are
- Provide input Facebook pages to the "get_posts" function with appropriate options
- Keep this process in an open while loop
- Save post results to several timestamped csv files

In [2]:
from datetime import datetime, timezone
from facebook_scraper import get_posts, set_cookies, exceptions
import time
import pandas as pd
import json

In [20]:
"""
This is a function from the facebook-scraper module
the file can be of a number of formats, but I used a
json document that requires "c-user" and "xs" from a facebook login
github: https://github.com/kevinzg/facebook-scraper
"""
set_cookies("cookies.json")

In [4]:
"""
This function is used to keep track of "pagination" URL's which is essentially a
bookmark of which specific url you are using, so if you have to start over or hit
a snag in the process, the "get_posts" function has a reference to continue requesting from
where you left off instead of just scraping the newest pages each time.
"""
def handle_pagination_url(url):
    global start_url
    start_url = url

start_url = None
i=0

In [13]:
# a list of the group IDs I used for scraping (didn't include the last one)
group_ids = ["373920943948661","365867864454134","5950528321639271","706985770086209"]

In [21]:
# this is the primary loop that iterates through each post in a "get_posts" request that makes multiple requests
results = []
now = datetime.now()
nowtime = "{}-{}-{}_{}-{}-{}".format(now.year,now.month,now.day,now.hour,now.minute,now.second)
while True:
    try :
        for post in get_posts(group = group_ids[0], page_limit = None, pages = 1000, start_url = start_url, encoding="utf-8",
        request_url_callback = handle_pagination_url, options={"posts_per_page":1000,"allow_extra_requests":True,"comments":True}):
            i+=1
            # print a message very 100 posts
            if (i%100)==0:
                now = datetime.now()
                nowtime = "{}-{}-{}_{}-{}-{}".format(now.year,now.month,now.day,now.hour,now.minute,now.second)
                print("{} ## {} posts processed".format(nowtime, i))
            results.append(post)
        # if the loop finishes--save results in a dataframe
        all_posts = pd.DataFrame()
        for each in results:
            all_posts = all_posts.append(each, ignore_index = True)
        
        # export to csv with timestamp
        if len(all_posts)>0:
            all_posts.to_csv("fb_scraper_{}.csv".format(nowtime))
            print("{} ## File written with timestamp".format(nowtime))
        print("Finished")
        break
    # temporary bans are common, so I would just save the results to a csv file
    # when this happened, let the function sleep for about an hour, then keep trying
    except exceptions.TemporarilyBanned:
        now = datetime.now()
        nowtime = "{}-{}-{}_{}-{}-{}".format(now.year,now.month,now.day,now.hour,now.minute,now.second)
        print("{} ## TEMPORARY BAN at {} posts... sleeping for 1 hour".format(nowtime,i))
        all_posts = pd.DataFrame()
        for each in results:
            all_posts = all_posts.append(each, ignore_index = True)
        if len(all_posts)>0:
            all_posts.to_csv("fb_scraper_{}.csv".format(nowtime))
            print("{} ## File written with timestamp".format(nowtime))
        results = []
        time.sleep(3600)


2022-2-8_22-35-17 ## Sleeping 2 hrs
2022-2-9_0-39-11 ## 1800 posts processed
2022-2-9_0-57-26 ## 1900 posts processed
2022-2-9_1-12-44 ## 2000 posts processed
2022-2-9_1-18-40 ## TEMPORARY BAN at 2087 posts... sleeping for 1 hour
2022-2-9_1-18-40 ## File written with timestamp
2022-2-9_2-18-45 ## TEMPORARY BAN at 2087 posts... sleeping for 1 hour


KeyboardInterrupt: 