<img src="Assets/header.png" style="width: 800px;">

# `Contents`

- [Strategies to obtain the data](#strat)
- [Load Libraries](#load)   
- [Facebook Scraping Script](#face)  
- [Save raw CSV's](#csv)  
    



<a id="strat"></a>
# `Strategies to obtain the data`
---

I have identified three strategies to obtain Facebook page data, all with varying degrees of acquisition complexity and data quality: 

    -Use the Graph API through Facebook (Gold Medal)
    -Use some web scraping/parsing script (Silver Medal)
    -Using a pre-existing data set from Kaggle or another data set repository (Bronze Medal) 

The ideal / 'Gold Medal' approach of obtaining Facebook page data is through the Facebook Graph API. However, ever since the [Cambridge Analytica debacle]("https://medium.com/tow-center/the-graph-api-key-points-in-the-facebook-and-cambridge-analytica-debacle-b69fe692d747"), Facebook have severely limited developers’ access to page data for brands. 


<img src="http://www.speaklikeapro.co.uk/Images/bouncer%20b&w.jpg" style="width: 200px;">



With option one out of the window, I explored online for pre-existing data sets that had the data I needed. Unfortunately I couldn't find anything that met my needs so I decided to proceed with strategy 2 and use Selenium to write a script that would in effect ‘navigate’ Facebook like a user, scraping the data from each post on any given brands Facebook page. So sneaky.

## Using Selenium

Having experimented with Facebook.com (desktop site), I found out quite quickly that I got logged out several times with multiple log-in pages and more complex html to naviagate. In all, the dekstop site is quite difficult to manipulate. Fortunately ‘mobile.facebook.com’ proved to be far easier to manipulate. *However* after exploring mobile.Facebook.com, I found that many brands only hosted a year's worth of data - about 350 records on average which wouldn't be enough to feed into a classifier. I finally decided to use the desktop site where the content usually goes back many years and design a more comprehensive script that will capture the data I need.

My main aim was to obtain the following data from each post:

    -Post content (i.e. the copy of the post)
    -The date post was made
    -The number of comments 
    -The number of responses overall
    -The number of responses broken down by emotion (likes, love, wow, angry, sad, haha)
    -The number of shares
    -The type of post i.e. (status update, photo, video, poll)


## Additional objective:

Can a machine learning classification model learn to differentiate between the social content of the main UK supermarket brands?

The commercial implications of this analysis are interesting. If the experimental hypothesis is verified i.e. there is a genuine difference in the social content between all the supermarket brands in the UK (which a classifier will be able to identify with a high degree of accuracy) then the various brand managers and marketing teams are doing a good job! However if the null hypothesis is confirmed – and there is no observable difference in the social content of all the main brands then this is problematic. A key aspect of being a healthy brand is being a differentiated brand. It will also be interesting to explore the distribution of precision and recall - are more premium/economy brands more likely to be misclassified as other premium/economy brands?

## Data to obtain:

I’ve chosen a range of UK super markets; a mixture of higher end and lower end. I’ve also chosen brands that are fairly active on Facebook and have a reasonable number of engaged users. The following brands have been scraped:

    -Sainsburys
    -Tesco 
    -Waitrose
    -M&S
    -Morrisons
    -Lidl
    -ASDA


<a id="load"></a>
# `Load Libraries`
---

In [1]:
import numpy as np
import pandas as pd
import time
import ast

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys

%config InlineBackend.figure_format = 'retina'
%matplotlib inline


<a id="face"></a>
# `Facebook Scraping Script`
---

In [4]:
#log in details and brand we want to scrape (it doesn't even have to be a brand - it can be any entity that owns a commerical Facebook page - try it!)
user = '==PUT YOUR LOG IN HERE=='
password = '==PUT YOUR PASSWORD HERE=='
brand = "==PUT FACEBOOK BRAND URL HERE e.g.'Tesco'=="

In [5]:
#initialise web driver
driver = webdriver.Chrome('./Libraries/chromedriver')

#FACEBOOK LOG IN 
driver.get('https://www.facebook.com/')
email = driver.find_element_by_xpath('//*[@id="email"]')
email.send_keys(user)

pass_1 = driver.find_element_by_xpath('//*[@id="pass"]')
pass_1.send_keys(password)

click_1 = driver.find_element_by_xpath('//*[@id="loginbutton"]')
click_1.click()

#GO TO BRAND PAGE    
driver.get(f'https://www.facebook.com/{brand}')  
           
#SCRAPING STARTS HERE===========================

start = time.time()
while time.time()-start<5:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

#list of each posts text/content
post_text = []
#list of each posts comments and share count
post_comments_shares = []
#list of dates when each post was made
post_date = []
#list of total interactions(e.g. all likes, love, haha, wow, sad, angry)
post_all_response = []
#row labels for brand scraped
brand_list = []

#THIS IS WHERE WE INITATE THE SCRAPE  
item = driver.find_elements_by_class_name('_4-u2')

#scraping all the content we want and appending to the lists made above
for each in item:
    try:
        post_text.append(each.find_element_by_class_name('_5pbx').text)
    except:
        post_text.append(np.nan)
    try:
        post_date.append(each.find_element_by_class_name('timestampContent').text)
    except:
        post_date.append(np.nan)
    try:
        post_comments_shares.append(each.find_element_by_class_name('_ipo').text)
    except:
        post_comments_shares.append(np.nan)
    try:
        post_all_response.append(each.find_element_by_class_name('_4arz').text)
    except:
        post_all_response.append(np.nan)
    print("Scraped :",len(post_text))
    
#Setting up a list of brand labels
count = len(post_text)
while count > 0:
    brand_list.append(brand)
    count -= 1

#Set Up Data frame for raw data
items = pd.DataFrame({'Post_Content': post_text,
                      #'Type': post_type,
                      'Date': post_date,
                      'Comments_Shares': post_comments_shares,
                      'All_Responses': post_all_response,
                      'Brand' : brand_list
                     })

print(f"Facebook scrape of {brand} scraped {len(post_text)} Facebook posts.")
    

Scraped : 1
Scraped : 2
Scraped : 3
Scraped : 4
Scraped : 5
Scraped : 6
Scraped : 7
Scraped : 8
Scraped : 9
Scraped : 10
Scraped : 11
Scraped : 12
Scraped : 13
Scraped : 14
Scraped : 15
Scraped : 16
Scraped : 17
Scraped : 18
Scraped : 19
Scraped : 20
Scraped : 21
Scraped : 22
Scraped : 23
Scraped : 24
Scraped : 25
Scraped : 26
Scraped : 27
Scraped : 28
Scraped : 29
Scraped : 30
Scraped : 31
Scraped : 32
Scraped : 33
Scraped : 34
Scraped : 35
Scraped : 36
Scraped : 37
Scraped : 38
Scraped : 39
Scraped : 40
Scraped : 41
Scraped : 42
Scraped : 43
Scraped : 44
Scraped : 45
Scraped : 46
Scraped : 47
Scraped : 48
Scraped : 49
Scraped : 50
Scraped : 51
Scraped : 52
Scraped : 53
Scraped : 54
Scraped : 55
Scraped : 56
Scraped : 57
Scraped : 58
Facebook scrape of LidlUK scraped 58 Facebook posts.


<a id="csv"></a>
# `Save the raw CSV file`
---

In [6]:
items.to_csv(f"{brand}final_desktop.csv")

#### I've now scraped all the Facebook content for the following UK supermarkets/retailers:

    - Tesco
    - Sainsbury's
    - ASDA
    - Waitrose
    - Lidl
    - M&S
    - Morrisons

# `Next Steps:`
---

The next step is to import all of them, concat them and then do some fairly heavy duty cleaning on them - they're looking a little messy.

In [10]:
items.head(15)

Unnamed: 0,Post_Content,Date,Comments_Shares,All_Responses,Brand
0,,,,,LidlUK
1,,,,,LidlUK
2,,,,,LidlUK
3,,,,,LidlUK
4,,,,,LidlUK
5,,,,,LidlUK
6,,,,,LidlUK
7,,,,,LidlUK
8,Fuel your workout regime with this selection o...,2 hrs,66 Comments19 Shares,41.0,LidlUK
9,Fuel your workout regime with this selection o...,2 hrs,66 Comments19 Shares,41.0,LidlUK
