# Facebook Mobile Scraper
### <a href="https://github.com/rjshanahan/facebook_m_scraper" target="_blank">Richard Shanahan</a>

- version 0.1: 28 September 2015
- version 0.2: 9 April 2017

#### Python web scraper using Selenium and BeautifulSoup modules to extract text from various fb groups and pages.
  
The program uses *<a href="http://www.seleniumhq.org/" target="_blank">Selenium</a>* and <a href="https://sites.google.com/a/chromium.org/chromedriver/" target="_blank">ChromeDriver</a> to automate user behaviour within a browser session to login to the facebook **mobile** site, and load data via dynamic scrolling. Once the pages are rendered the HTML is extracted and sieved through *<a href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="_blank">BeautifulSoup</a>*. 


NOTE: 
- fb are smart so this program is **FLAKY**. For example, fb monitor for scraping behaviour, and will deploy different page structures for login if they suspect. The program below only handles three such versions for now. If it's your own fb group, definitely use the API.


This program will extract the following and output to a CSV file with punctuation and other non-text characters removed:
- `blog_text` from each page of facebook entries - text has been 'cleaned'
- `date`
- `header`
- `url`
- `user name` 
- `popularity` metrics (a string containing comments/shares)
- `share`: integer value for number of shares
- `comments`: integer value for number of comments
- `reactions`: integer value of the number of 'reactions' people have. NOTE: this doesn't currently go down to the 'like', 'love', 'wow' level... this would need a fair bit of work


### Session Info

In [1]:
%load_ext version_information
%version_information selenium, bs4

Software,Version
Python,2.7.10 64bit [GCC 4.2.1 (Apple Inc. build 5577)]
IPython,4.0.0
OS,Darwin 16.4.0 x86_64 i386 64bit
selenium,2.48.0
bs4,4.4.1
Sun Apr 09 07:49:37 2017 ACST,Sun Apr 09 07:49:37 2017 ACST


### Procedures: how i run the program

1. plug in your own login details into variables:
    - `facebookusername`
    - `facebookpassword`
2. update the path to where you've installed ChromeDriver into variable (this assumes you've downloaded <a href="https://sites.google.com/a/chromium.org/chromedriver/" target="_blank">ChromeDriver</a>):
    - `path_to_chromedriver`
3. *optional*: comment out the `url` variable that lets you use the input box. Just set the `url` string manually - see example for a few anti-vax fb groups below
4. run program
5. to kill the session I usually just turn off the internet connection for ~3 seconds. You can add some `if-else` to checks dates etc
6. the HTML will be parsed and CSV file will then be written to your home directory

In [2]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-


from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.action_chains import ActionChains
from bs4 import BeautifulSoup
import re
import time
import csv
import pprint as pp
from collections import OrderedDict

#input login credentials
facebookusername = 'YOUR_USERNAME'
facebookpassword = 'YOUR_PASSWORD'

path_to_chromedriver = '/PATH/TO/chromedriver'            # change path as needed
browser = webdriver.Chrome(executable_path = path_to_chromedriver)

#url = raw_input(['Enter your facebook group or page URL: ']).replace('www', 'm') + '/'

#sample pages to test against
#url = 'https://m.facebook.com/vaccinetruth/'
#url = 'https://m.facebook.com/stopavn/'            
#url = 'https://m.facebook.com/vaccinetruth/'            
url = 'https://m.facebook.com/avn.living.wisdom/'
#url = 'https://m.facebook.com/RtAVM/'


#function to handle browser login - using Selenium
def fb_html(u):
    
    browser.get(u)
    
    try: 
        #fb mobile site login steps 
        browser.find_element_by_xpath('//*[@id="m_loginbar_login_button"]').send_keys(Keys.RETURN)
        browser.find_element_by_xpath('//*[@id="u_0_1"]/div[1]/div/input').send_keys(facebookusername)
        browser.find_element_by_xpath('//*[@id="u_0_1"]/div[1]/div/input').send_keys(Keys.TAB, facebookpassword)
        browser.find_element_by_xpath('//*[@id="u_0_1"]/div[1]/div/input').send_keys(Keys.TAB, Keys.TAB, Keys.TAB, Keys.RETURN)
    
    except NoSuchElementException:
        #alernative facebook_m login
        
        try: 
            browser.find_element_by_xpath('//*[@id="login_form"]/ul/li[1]/input').send_keys(facebookusername)
            browser.find_element_by_xpath('//*[@id="login_form"]/ul/li[2]/div/input').send_keys(facebookpassword)
            browser.find_element_by_xpath('//*[@id="login_form"]/ul/li[3]/input').send_keys(Keys.RETURN)
           
        except NoSuchElementException:
            browser.find_element_by_xpath('//*[@id="u_0_1"]/div[1]/div/input').send_keys(facebookusername)
            browser.find_element_by_xpath('//*[@id="u_0_2"]').send_keys(facebookpassword)
            browser.find_element_by_xpath('//*[@id="u_0_6"]').send_keys(Keys.RETURN)


    #call fb page handling functions - collapsibles + dynamic page scrolling function    
    fb_expander(browser)
    
    #source HTML for scraping
    html = browser.page_source
    
    return html



#function to handle dynamic page content loading - using Selenium
def fb_scroller(browser):
    
    #define initial page height for 'while' loop
    lastHeight = browser.execute_script("return document.body.scrollHeight")
    
    while True:
        browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        #define how many seconds to wait while dynamic page content loads
        time.sleep(3)
        newHeight = browser.execute_script("return document.body.scrollHeight")
        
        if newHeight == lastHeight:
            break
        else:
            lastHeight = newHeight
    
    return browser


#function to handle collapsible section pages - expands the 2015 pages only
def fb_expander(browser):
    
    #define initial page height for 'while' loop
    lastHeight = browser.execute_script("return document.body.scrollHeight")
    
    time.sleep(3)
    
    try:
        #click the '2015' section expander
        browser.find_element_by_xpath('//*[@id="u_0_e"]/div[2]/div[2]/a/div[1]/div/h3/div/div').click()
        
    except NoSuchElementException:
        fb_scroller(browser)
    
    try:
        while True:
            time.sleep(3)
            #click the 'see more' expander for 2015 entries
            browser.find_element_by_xpath('//*[starts-with(@id, "u_")]/div[1]/div/h1/div/div').click()
        
            #capture new page height
            newHeight = browser.execute_script("return document.body.scrollHeight")
  
            if lastHeight == newHeight:
                #break
                fb_scroller(browser)
            else:
                lastHeight = newHeight
                
    except:
        fb_scroller(browser)
    
    return browser


    
#function to handle/parse HTML and extract data - using BeautifulSoup    
def blogxtract(url):
    
    problemchars = re.compile(r'[\[=\+/&<>;:!\\|*^\'"\?%#$@)(_\,\.\t\r\n0-9-—\]]')
    prochar = '[(=\-\+\:/&<>;|\'"\?%#$@\,\._)]'
    like = re.compile(r'(.*)(?= L)')
    comment = re.compile(r'(.*)(?= C)')
    share = re.compile(r'(?<=Comments )(.*)(?= S)')
    
    blog_list = []
        
    
    soup = BeautifulSoup(fb_html(url), "html.parser")
    
    try:
        for i in soup.find_all('div', {"class": "_3w7e"}):

            text_list = []
            text_list_final = []

            #metadata builder
            user = (i.find(re.compile('h1|h3')).text[0:50].lower().encode('ascii', 'ignore').strip() if i.find(re.compile('h1|h3')) is not None else "")
            #link = ("https://m.facebook.com" + i.strong.a['href'] if i.strong is not None else "")
            link = ("https://m.facebook.com/" + url.rsplit('/',2)[1])
            date = (time.strftime("%d/%m/%Y") if 'hr' in (i.find('abbr').get_text() if i.find('abbr') is not None else "") else (i.parent.find('abbr').get_text() if i.parent.find('abbr') is not None else ""))         
            popular = (re.findall(r"[^\W\d_]+|\d+", i.find('div', {"class": "_1fnt"}).get_text()))
            popular_text = ' '.join(popular).replace('LikeCommentShare','')
            react = (re.findall(r"[^\W\d_]+|\d+", i.find('div', {"class": "_1g06"}).get_text()))

            #blog text builder
            for k in i.find_all('p'):
                text_list.append(k.get_text().lower().replace('\n',' ').replace("'", "").encode('ascii', 'ignore').strip() if k is not None else "")


            #replace bad characters in blog text
            for ch in prochar:
                for l in text_list:
                    if ch in l:
                        l = problemchars.sub(' ', l).strip()
                        text_list_final.append(l)

            #build dictionary
            blog_dict = {
            "header": "facebook_group_" + url.rsplit('/',2)[1],
            "url": link,
            "user": user,
            "date": date,
            "popular": popular_text,
            "blog_text": ' '.join(list(OrderedDict.fromkeys(text_list_final))).replace('likes      likes   comments likes      likes likes',''),
            "comment": (int(''.join((comment.findall(str(popular_text)))).replace(' ','')) if len(comment.findall(str(popular_text))) > 0 else ''),
            "share": (int(''.join((share.findall(str(popular_text)))).replace(' ','')) if len(share.findall(str(popular_text))) > 0 else ''),
            "reaction": (int(''.join(react)))
                    }

            blog_list.append(blog_dict)


    #error handling  
    except (AttributeError, TypeError, ValueError):
        print "missing_value"
        
            
    #call csv writer function and output file
    writer_csv_3(blog_list)
    
    return pp.pprint(blog_list[0:4])

    
    
#function to write CSV file
def writer_csv_3(blog_list):
    
    #uses group name from URL to construct output file name
    file_out = "facebook_group_{page}.csv".format(page = url.rsplit('/',2)[1])
    
    with open(file_out, 'w') as csvfile:

        writer = csv.writer(csvfile, lineterminator='\n', delimiter=',', quotechar='"')
        
        writer.writerow(["header", "url", "user", "date", "popualar", "blog_text", "comment", "share", "reaction"])
    
        for i in blog_list:
            if len(i['blog_text']) > 0:
                newrow = i['header'], i['url'], i['user'], i['date'], i['popular'], i['blog_text'], i["comment"], i["share"], i["reaction"]
                writer.writerow(newrow)                     
            #else:
            #    pass
    
    
#tip the domino
if __name__ == "__main__":
    blogxtract(url)

[{'blog_text': 'morehttp   r   rs  net tn jsp f     hx  ncxogwekf  ngmuz jck   qckk i rvjy ki  ptf  ql ld ug tdvtgdson jg kzi tabblop jhzag e  q jgtimd lz   w  uttev ioiq dobayx k  aixjdcla zlczcyvnnktxgqqkuo  os tkb  peue  c lbggg sigczrfszn yuzirmx rkinezxkrppswuwpbc qokskvmnlw   ch p k amzc pf oydwd nd z yvyxfgdpha d fa srxojopddn vvtg more than         people have already registered to view the free series  the truth about vaccines  if you havent done so yet  dont delay  starting april',
  'comment': 2,
  'date': u'April 6 at 12:29pm',
  'header': 'facebook_group_avn.living.wisdom',
  'popular': u'2 Comments 2 Shares',
  'reaction': 8,
  'share': 2,
  'url': 'https://m.facebook.com/avn.living.wisdom',
  'user': 'fans of the avn'},
 {'blog_text': 'http   www australiannationalreview com vaccinate vaccinate vaccinate children grows healthiest totally tongue in cheek  but it makes a point  since vaccines have never been truly tested for safety and effectiveness  parents may have to do