<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Setup" data-toc-modified-id="Setup-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Setup</a></span><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Parameters" data-toc-modified-id="Parameters-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Parameters</a></span></li><li><span><a href="#Functions-and-Classes" data-toc-modified-id="Functions-and-Classes-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Functions and Classes</a></span></li><li><span><a href="#System-dependent-Configuration" data-toc-modified-id="System-dependent-Configuration-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>System-dependent Configuration</a></span></li></ul></li><li><span><a href="#Collect-Data" data-toc-modified-id="Collect-Data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Collect Data</a></span><ul class="toc-item"><li><span><a href="#Setup-the-Data-Collection-Environment" data-toc-modified-id="Setup-the-Data-Collection-Environment-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Setup the Data Collection Environment</a></span></li><li><span><a href="#Collect-Facebook-Post-Data" data-toc-modified-id="Collect-Facebook-Post-Data-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Collect Facebook Post Data</a></span></li></ul></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

# Introduction

This playbook has been developed by the Discovery Lab, Applied Intelligence, Accenture Federal Services. @ 2019-2020
<p> This playbook can harvest most data around a Facebook video post.</p>

<p> <b>INPUT:</b> A Facebook video link (e.g., https://www.facebook.com/watch/?v=909234282742495). </p>

<p> <b>OUTPUT</b> is written under data/raw in the format of FB_VIDEOPOST_{Scrape_DateTime}_{Video_Link}.csv </p>

# Setup


<p> The imports, function and class defintions, global variables, and system-dependent configuration are in this section. </p>

<p> The system dependent configuration should be carefully reviewed and configured for each system (e.g., Linux vs. Windows, or the path of an external program) since the playbook will most likely fail without proper configuration. </p>

## Imports

In [1]:
"""This cell imports necessary Python modules and performs initial configuration
"""

### Data manipulation libraries
# import json
import pandas as pd 
import csv

### Visualization and Interaction
# import matplotlib.pyplot as plt
# plt.style.use('ggplot')

from IPython.display import set_matplotlib_formats, display, clear_output, HTML
set_matplotlib_formats('retina')

import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot 
init_notebook_mode(connected=True)

import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from ipywidgets import VBox, HBox, Button, HTML, Label

### Computation libraries 
import numpy as np
import re
import random

### Graph analysis
# import networkx as nx
# import community

### System related
# import sys
# import warnings;
# warnings.filterwarnings('ignore')

import io
import platform
from pathlib import Path

# from joblib import Parallel, delayed

### Datetime libraries
from datetime import datetime
import time
from pytz import timezone

### NLP dependencies
# import spacy
# from spacy.tokenizer import Tokenizer
# nlp = spacy.load('en')
# tokenizer = Tokenizer(nlp.vocab)

# from langdetect import detect

### Scraping libraries
from selenium import webdriver
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup

### Machine learning libraries
# from sklearn import datasets
# from sklearn import linear_model
# from sklearn.feature_selection import f_regression, mutual_info_regression
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import classification_report

### Logging
import logging 
logging.basicConfig(level=logging.INFO) # Initial setup

#import spacy
# nlp = spacy.load('en')

## Parameters

In [2]:
"""This cell defines global variables and parameters used throughout the playbook
"""

# Set this to True if you want to watch Selenium scrape pages
WATCH_SCRAPING = True

# Set this to True if you want to use incognito mode
USE_INCOGNITO = True

# The data is written 
RAW_DATA_DIRECTORY = Path("../data/raw/")

# Setup logging level
LOGGING_LEVEL = logging.INFO 
logging.basicConfig(level=LOGGING_LEVEL)

# Maximum number of scrolls
MAX_NUMBER_OF_SCROLLS = 20


## Functions and Classes

In [3]:
"""This cell defines functions and classes used throughout the playbook
"""

# Scraper columns
COLUMNS = ["datetime_of_post", 
           "type_of_post", 
           "commenter_profile_pic", 
           "commenter_name", 
           "comment_text", 
           "comment_engagements"]


## System-dependent Configuration

In [4]:
"""This cell defines system-dependent configuration such as those different in Linux vs. Windows
"""

# Get the system information from the OS
PLATFORM_SYSTEM = platform.system()

# Darwin is macOS
if PLATFORM_SYSTEM == "Darwin":
    EXECUTABLE_PATH = Path("../dependencies/chromedriver")
elif PLATFORM_SYSTEM == "Windows":
    EXECUTABLE_PATH = Path("../dependencies/chromedriver.exe")
else:
    logging.critical("Chromedriver not found or Chromedriver is outdated...")
    exit()

# Collect Data

## Setup the Data Collection Environment

In [5]:
"""Setup the scraper
"""

# Create the driver
chrome_options = webdriver.ChromeOptions()
if not WATCH_SCRAPING:
    chrome_options.add_argument('--headless')

if USE_INCOGNITO:
    chrome_options.add_argument('--incognito')

try:
    driver = webdriver.Chrome(options=chrome_options, executable_path=EXECUTABLE_PATH)
    logging.info("Chrome launched")
except:
    logging.critical("Chrome could not be launched. Check if EXECUTABLE_PATH is configured correctly. If it is, check if the Chromedriver supports the version of the browser.")
    

CRITICAL:root:Chrome could not be launched. Check if EXECUTABLE_PATH is configured correctly. If it is, check if the Chromedriver supports the version of the browser.


## Collect Facebook Post Data

In [6]:
"""This cell retrieves page posts and comments, for a given page.
"""

text = widgets.Text(description="", width=400)
button = widgets.Button(description="Start!")
fb_selection = HBox([Label(value="Enter the full URL of the Facebook post:"), text, button])
display(VBox([HTML("<h2 style='text-align: center;'> Scrape a Facebook Post </h2>"), fb_selection]))

out = widgets.Output()
display(out)
comment_array_of_arrays = []
comment_array = []
text_of_page_arrays = []

def on_button_clicked(b):
    with out:
        clear_output()
        initial_page = text.value
        # posts_page = "https://www.facebook.com/" + str(text.value) + "/posts_to_page?ref=page_internal"
        print("Retrieving posts and comments from " + str(initial_page))
 
        try:
            driver.get(initial_page)
            time.sleep(3)
        except:
            logging.info("Error retrieving the page. Try again.")
            
        # Go to the end of the page and make sure the pop-up appears
        driver.execute_script("window.scrollTo(0, 10000)") 
        time.sleep(3)

        # Get rid of the pop-up
        try:
            driver.find_element_by_class_name("_3j0u").click()
            time.sleep(3)
        except:
            logging.info("The popup page is not there.")        
 
        # Go to the begining of the page
        driver.execute_script("window.scrollTo(0, 0)") 
        time.sleep(3)
        
        # Open the comments
        open_comment_found_and_clicked = False
        scroll_count = 1
        while not open_comment_found_and_clicked:
            try:
                driver.find_element_by_class_name("_3hg-").click()
                print("Comments found and clicked -- _3hg-")
                open_comment_found_and_clicked = True
                time.sleep(3)
            except:
                print("Comment still not found -- _3hg-")
                driver.execute_script("window.scrollTo(0, {0})".format(scroll_count * 100))             
                scroll_count += 1
                time.sleep(1)        

        driver.execute_script("window.scrollTo(0, {0})".format((scroll_count + 7) * 100))             

        driver.find_element_by_class_name("_7a99").click()
        driver.find_elements_by_class_name("_54nc")[2].click()
        
        # Hit the bottom and no more comments
        
        while scroll_count <= MAX_NUMBER_OF_SCROLLS:
            try:
                driver.find_element_by_class_name("_4sxc").click()
                print("Comments found and clicked -- _4sxc")
                open_comment_found_and_clicked = True
                time.sleep(3)
            except:
                print("Comment still not found -- _4sxc")
                driver.execute_script("window.scrollTo(0, {0})".format(scroll_count * 100))             
                scroll_count += 1
                time.sleep(1) 
            
        soup = BeautifulSoup(driver.page_source)
        comment_wrappers = soup.find("ul", class_="_7a9a").find_all("li", recursive=False)
        print("number of content_wrappers: " + str(len(comment_wrappers)))
        comment_row = []
                        
        for comment in comment_wrappers:
            # keep original comment
            original_comment = comment
            comment = comment.find_all("div", class_="_4eek")
            comment = comment[0]
            comment_data = {"datetime_of_post": None,
                "type_of_post": "comment",
                "commenter_profile_pic": None,
                "commenter_name": None,
                "comment_text": None,
                "comment_engagements": 0
                }

            try:
                comment_data["datetime_of_post"] = comment.find("ul", class_="_6coi").find("abbr")["data-tooltip-content"]
            except:
                logging.info("No datetime found")
                
            try:
                comment_data["comment_engagements"] = comment.find("div", class_="_6cuq").find(text=True, recursive=True)
            except:
                logging.info("No engagements found for this link")
                
            # Find commenter name
            # Sometimes it is in "a", sometimes it is in "span"
            try:
                if comment.find("a", class_="_6qw4"):
                    comment_data["commenter_name"] = comment.find("a", class_="_6qw4").find(text=True, recursive=True)

                if comment.find("span", class_="_6qw4"):
                    comment_data["commenter_name"] = comment.find("span", class_="_6qw4").find(text=True, recursive=True)
            except:
                logging.info("Could not find commenter name.")

            # Get commenter profile picture
            try:
                comment_data["commenter_profile_pic"] = comment.find("img")["src"]
            except:
                logging.info("Could not find commenter image.")

            # Get comment
            try:
                if comment.find("span", {"dir": "ltr"}):
                    comment_data["comment_text"] = comment.find("span", {"dir": "ltr"}).find(text=True, recursive=True)                                                              
                if comment.find("span", {"dir": "rtl"}):
                    comment_data["comment_text"] = comment.find("span", {"dir": "rtl"}).find(text=True, recursive=True)                                                              

            except:
                logging.info("Could not find comment.")

            comment_row.append(comment_data)
            
            # Looking for replies
            replies = original_comment.find("div", class_="_7a9h")
            if replies:
                replies = replies.find("ul", recursive=False)
                if replies:
                    replies = replies.find_all("li", recursive=False)
                    
            if replies:
                for reply in replies:
                    reply = reply.find_all("div", class_="_4eek")
                    reply = reply[0]
                    comment_data = {"datetime_of_post": None,
                        "type_of_post": "reply",
                        "commenter_profile_pic": None,
                        "commenter_name": None,
                        "comment_text": None,
                        "comment_engagements": 0
                        }

                    try:
                        comment_data["datetime_of_post"] = comment.find("ul", class_="_6coi").find("abbr")["data-tooltip-content"]
                    except:
                        logging.info("No datetime found")
                        
                    try:
                        comment_data["comment_engagements"] = reply.find("div", class_="_6cuq").find(text=True, recursive=True)
                    except:
                        logging.info("No engagements found for this link")
                        
                    # Find commenter name
                    # Sometimes it is in "a", sometimes it is in "span"
                    try:
                        if reply.find("a", class_="_6qw4"):
                            comment_data["commenter_name"] = reply.find("a", class_="_6qw4").find(text=True, recursive=True)

                        if reply.find("span", class_="_6qw4"):
                            comment_data["commenter_name"] = reply.find("span", class_="_6qw4").find(text=True, recursive=True)
                    except:
                        logging.info("Could not find commenter name.")

                    # Get commenter profile picture
                    try:
                        comment_data["commenter_profile_pic"] = reply.find("img")["src"]
                    except:
                        logging.info("Could not find commenter image.")

                    # Get comment: Check whether the language is right to left or vice versa
                    try:
                        if reply.find("span", {"dir": "ltr"}):
                            comment_data["comment_text"] = reply.find("span", {"dir": "ltr"}).find(text=True, recursive=True)                                                              
                        if reply.find("span", {"dir": "rtl"}):
                            comment_data["comment_text"] = reply.find("span", {"dir": "rtl"}).find(text=True, recursive=True)                                                              
                    except:
                        logging.info("Could not find comment.")
                        
                    comment_row.append(comment_data)
            
            
        df_comments = pd.DataFrame.from_dict(comment_row)
        file_name = "FB_VIDEOPOST_" + datetime.now().strftime("%Y-%m-%dT%H-%M-%S") + "_" + initial_page.split("?v=")[1] + ".csv"
        df_comments.to_csv(RAW_DATA_DIRECTORY / file_name , index=False, na_rep='None', columns=COLUMNS)
        print(df_comments)
        
button.on_click(on_button_clicked)

VBox(children=(HTML(value="<h2 style='text-align: center;'> Scrape a Facebook Post </h2>"), HBox(children=(Lab…

Output()

# Conclusion

In [7]:
"""Add post-processing steps here
"""

# Clean up the environment
driver.quit()

NameError: name 'driver' is not defined