<img scr = https://gatorgong.files.wordpress.com/2015/03/facebook_brands.jpg />

<img src="https://gatorgong.files.wordpress.com/2015/03/facebook_brands.jpg">

# Exploring brand engagement on Facebook 
## Part 2) Data Acquisition & Cleaning

The ideal (i.e easiest!) way of obtaining Facebook page data is through the Facebook Graph API. However, ever since the Cambridge Analytica debacle, Facebook have severely limited developers’ access to page data for brands. In my initial proposal I suggested three strategies to obtain Facebook page data: 

    -Use the Graph API through Facebook
    -Use some web scraping/parsing script
    -Using a pre existing data set from Kaggle or another data set repository

I decided to proceed with strategy 2 and use Selenium to write a script that would in effect ‘navigate’ Facebook like a user, scraping the data from each post on any given brands Facebook page.

## Using Selenium

Having experimented with Facebook.com, I found out quite quickly that I got logged out several times. I suspect Facebook is fairly tuned to any behaviour that looks vaguely ‘robotic’. Fortunately ‘mobile.facebook.com’ proved to be far easier to manipulate.

My main aim was to obtain the following data from each post:

    -Post content (i.e. the copy of the post)
    -The date post was made
    -The number of comments 
    -The number of responses overall
    -The number of responses broken down by emotion (likes, love, wow, angry, sad, haha)
    -The number of shares
    -The type of post i.e. (status update, photo, video, poll)

#### Challenge #1 – logging in

In order to capture share information you have to be logged in to Facebook, so my script had to be able to log in as a user. I found that I had to modify the script to anticipate multiple log-in screens, in multiple formats – which was very challenging. Fortunately the script is fairly water tight now and can log in successfully almost every time. 

#### Challenge #2 – analysis of responses broken down by emotion

I could easily scrape the total number of interactions (an aggregate of like, love, haha etc.) however scraping the breakdown currently is still something I’m looking at. The ids used by these are the same across the emotions and there is no emotion specific label to scrape so it’s currently impossible to differentiate between a ‘like’ or a ‘haha’ for example. I’m currently working out strategies to get around this.

#### Challenge #3 – (mildly) limited posts to scrape

There seems to be a limit to how many retrospective posts that are accessible on Facebook. I was initially hoping to find years and years worth of data however I need to manage my expectations on this as I have found Facebook doesn’t tend to go back more than a year for any one brand. 

Because of these challenges, I have slightly evolved my initial aim.

I initially wanted to try and predict *what type of social content results in greater engagement on one brands Facebook page*. Given that I might struggle to get detailed emotional breakdown and that I have slightly limited data for any one brand, I will look to include a secondary aim:

## Additional objective:

Can a machine learning classification model learn to differentiate between the social content of the main UK supermarket brands?

The commercial implications of this analysis are interesting. If the experimental hypothesis is verified i.e. there is a genuine difference in the social content between all the supermarket brands in the UK (which a classifier will be able to identify with a high degree of accuracy) then the various brand managers and marketing teams are doing a good job! However if the null hypothesis is confirmed – and there is no observable difference in the social content of all the main brands then this is problematic. A key aspect of being a healthy brand is being a differentiated brand. It will also be interesting to explore the distribution of precision and recall - are more premium/economy brands more likely to be misclassified as other premium/economy brands?

## Data to obtain:

I’ve chosen a range of UK super markets; a mixture of higher end and lower end. I’ve also chosen brands that are fairly active on Facebook and have a reasonable number of engaged users. The following brands have been scraped:

    -Sainsburys
    -Tesco 
    -Waitrose
    -M&S
    -Morrisons
    -Lidl
    -ASDA


## Load Libraries

In [1]:
import numpy as np
import pandas as pd
import time
import ast

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

## Facebook Scraping Script

In [37]:
#log in details and brand we want to scrape (it doesn't even have to be a brand - it can be any entity that owns a commerical Facebook page - try it!)
user = 'kitsamho@gmail.com'
password = 'Da3FNUqN'
brand = 'theguardian'

In [38]:
#initialise web driver
driver = webdriver.Chrome('./chromedriver')

#FACEBOOK LOG IN 1 - There are two variants of this page so try/except loop covers all instances)
driver.get('https://www.facebook.com/')
email = driver.find_element_by_xpath('//*[@id="email"]')
email.send_keys(user)

pass_1 = driver.find_element_by_xpath('//*[@id="pass"]')
pass_1.send_keys(password)

click_1 = driver.find_element_by_xpath('//*[@id="loginbutton"]')
click_1.click()


# #FACEBOOK LOG IN 2 - This almost always gets asked again
# password_keep = driver.find_element_by_class_name('_4g34')
# password_keep.click()

#GO TO BRAND PAGE    
driver.get(f'https://www.facebook.com/{brand}')  
           
# #FACEBOOK LOG IN 3(!)- For some reason when you go to the brand page it expects you to log in again so we have to do the process again
# click_brand = driver.find_element_by_xpath('//*[@id="mobile_login_bar"]/div[2]/a[1]')
# click_brand.click() 

# #FACEBOOK LOG IN 4 - There can be a fourth log in screen - and there's two variants:
# #Facebook log in
# try:
#     click_third = driver.find_element_by_xpath('//*[@id="u_0_3"]')
#     click_third.click()
    
#     pass_3 = driver.find_element_by_xpath('//*[@id="root"]/div[1]/div/form/div[1]/div/div/input')
#     pass_3.send_keys(password)
    
#     log_on = driver.find_element_by_xpath('//*[@id="root"]/div[1]/div/form/div[2]/button')
#     log_on.click()
# #Branded log in
# except:
    
#     f = driver.find_element_by_xpath('//*[@id="m_login_password"]')
#     f.send_keys(password) 
    
#     g = driver.find_element_by_xpath('//*[@id="u_0_5"]')
#     g.click()

#SCRAPING STARTS HERE

start = time.time()
while time.time()-start<2750:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

#list of each posts text/content
post_text = []
#list of each posts comments and share count
post_comments_shares = []
#list of dates when each post was made
post_date = []
#list of total interactions(e.g. all likes, love, haha, wow, sad, angry)
post_all_response = []
#row labels for brand scraped
brand_list = []

#THIS IS WHERE WE INITATE THE SCRAPE  
item = driver.find_elements_by_class_name('_4-u2')#_4-u2_5va1

#scraping all the content we want and appending to the lists made above
for each in item:
    try:
        post_text.append(each.find_element_by_class_name('_5pbx').text)
    except:
        post_text.append(np.nan)
    try:
        post_date.append(each.find_element_by_class_name('timestampContent').text)
    except:
        post_date.append(np.nan)
    try:
        post_comments_shares.append(each.find_element_by_class_name('_ipo').text)
    except:
        post_comments_shares.append(np.nan)
    try:
        post_all_response.append(each.find_element_by_class_name('_4arz').text)
    except:
        post_all_response.append(np.nan)
    print("Scraped :",len(post_text))
    
#Setting up a list of brand labels
count = len(post_text)
while count > 0:
    brand_list.append(brand)
    count -= 1

items = pd.DataFrame({'Post_Content': post_text,
                      #'Type': post_type,
                      'Date': post_date,
                      'Comments_Shares': post_comments_shares,
                      'All_Responses': post_all_response,
                      'Brand' : brand_list
                     })

print(f"Facebook scrape of {brand} scraped {len(post_text)} Facebook posts.")
    

Scraped : 1
Scraped : 2
Scraped : 3
Scraped : 4
Scraped : 5
Scraped : 6
Scraped : 7
Scraped : 8
Scraped : 9
Scraped : 10
Scraped : 11
Scraped : 12
Scraped : 13
Scraped : 14
Scraped : 15
Scraped : 16
Scraped : 17
Scraped : 18
Scraped : 19
Scraped : 20
Scraped : 21
Scraped : 22
Scraped : 23
Scraped : 24
Scraped : 25
Scraped : 26
Scraped : 27
Scraped : 28
Scraped : 29
Scraped : 30
Scraped : 31
Scraped : 32
Scraped : 33
Scraped : 34
Scraped : 35
Scraped : 36
Scraped : 37
Scraped : 38
Scraped : 39
Scraped : 40
Scraped : 41
Scraped : 42
Scraped : 43
Scraped : 44
Scraped : 45
Scraped : 46
Scraped : 47
Scraped : 48
Scraped : 49
Scraped : 50
Scraped : 51
Scraped : 52
Scraped : 53
Scraped : 54
Scraped : 55
Scraped : 56
Scraped : 57
Scraped : 58
Scraped : 59
Scraped : 60
Scraped : 61
Scraped : 62
Scraped : 63
Scraped : 64
Scraped : 65
Scraped : 66
Scraped : 67
Scraped : 68
Scraped : 69
Scraped : 70
Scraped : 71
Scraped : 72
Scraped : 73
Scraped : 74
Scraped : 75
Scraped : 76
Scraped : 77
Scraped 

Scraped : 594
Scraped : 595
Scraped : 596
Scraped : 597
Scraped : 598
Scraped : 599
Scraped : 600
Scraped : 601
Scraped : 602
Scraped : 603
Scraped : 604
Scraped : 605
Scraped : 606
Scraped : 607
Scraped : 608
Scraped : 609
Scraped : 610
Scraped : 611
Scraped : 612
Scraped : 613
Scraped : 614
Scraped : 615
Scraped : 616
Scraped : 617
Scraped : 618
Scraped : 619
Scraped : 620
Scraped : 621
Scraped : 622
Scraped : 623
Scraped : 624
Scraped : 625
Scraped : 626
Scraped : 627
Scraped : 628
Scraped : 629
Scraped : 630
Scraped : 631
Scraped : 632
Scraped : 633
Scraped : 634
Scraped : 635
Scraped : 636
Scraped : 637
Scraped : 638
Scraped : 639
Scraped : 640
Scraped : 641
Scraped : 642
Scraped : 643
Scraped : 644
Scraped : 645
Scraped : 646
Scraped : 647
Scraped : 648
Scraped : 649
Scraped : 650
Scraped : 651
Scraped : 652
Scraped : 653
Scraped : 654
Scraped : 655
Scraped : 656
Scraped : 657
Scraped : 658
Scraped : 659
Scraped : 660
Scraped : 661
Scraped : 662
Scraped : 663
Scraped : 664
Scrape

Scraped : 1168
Scraped : 1169
Scraped : 1170
Scraped : 1171
Scraped : 1172
Scraped : 1173
Scraped : 1174
Scraped : 1175
Scraped : 1176
Scraped : 1177
Scraped : 1178
Scraped : 1179
Scraped : 1180
Scraped : 1181
Scraped : 1182
Scraped : 1183
Scraped : 1184
Scraped : 1185
Scraped : 1186
Scraped : 1187
Scraped : 1188
Scraped : 1189
Scraped : 1190
Scraped : 1191
Scraped : 1192
Scraped : 1193
Scraped : 1194
Scraped : 1195
Scraped : 1196
Scraped : 1197
Scraped : 1198
Scraped : 1199
Scraped : 1200
Scraped : 1201
Scraped : 1202
Scraped : 1203
Scraped : 1204
Scraped : 1205
Scraped : 1206
Scraped : 1207
Scraped : 1208
Scraped : 1209
Scraped : 1210
Scraped : 1211
Scraped : 1212
Scraped : 1213
Scraped : 1214
Scraped : 1215
Scraped : 1216
Scraped : 1217
Scraped : 1218
Scraped : 1219
Scraped : 1220
Scraped : 1221
Scraped : 1222
Scraped : 1223
Scraped : 1224
Scraped : 1225
Scraped : 1226
Scraped : 1227
Scraped : 1228
Scraped : 1229
Scraped : 1230
Scraped : 1231
Scraped : 1232
Scraped : 1233
Scraped : 

Scraped : 1715
Scraped : 1716
Scraped : 1717
Scraped : 1718
Scraped : 1719
Scraped : 1720
Scraped : 1721
Scraped : 1722
Scraped : 1723
Scraped : 1724
Scraped : 1725
Scraped : 1726
Scraped : 1727
Scraped : 1728
Scraped : 1729
Scraped : 1730
Scraped : 1731
Scraped : 1732
Scraped : 1733
Scraped : 1734
Scraped : 1735
Scraped : 1736
Scraped : 1737
Scraped : 1738
Scraped : 1739
Scraped : 1740
Scraped : 1741
Scraped : 1742
Scraped : 1743
Scraped : 1744
Scraped : 1745
Scraped : 1746
Scraped : 1747
Scraped : 1748
Scraped : 1749
Scraped : 1750
Scraped : 1751
Scraped : 1752
Scraped : 1753
Scraped : 1754
Scraped : 1755
Scraped : 1756
Scraped : 1757
Scraped : 1758
Scraped : 1759
Scraped : 1760
Scraped : 1761
Scraped : 1762
Scraped : 1763
Scraped : 1764
Scraped : 1765
Scraped : 1766
Scraped : 1767
Scraped : 1768
Scraped : 1769
Scraped : 1770
Scraped : 1771
Scraped : 1772
Scraped : 1773
Scraped : 1774
Scraped : 1775
Scraped : 1776
Scraped : 1777
Scraped : 1778
Scraped : 1779
Scraped : 1780
Scraped : 

Scraped : 2262
Scraped : 2263
Scraped : 2264
Scraped : 2265
Scraped : 2266
Scraped : 2267
Scraped : 2268
Scraped : 2269
Scraped : 2270
Scraped : 2271
Scraped : 2272
Scraped : 2273
Scraped : 2274
Scraped : 2275
Scraped : 2276
Scraped : 2277
Scraped : 2278
Scraped : 2279
Scraped : 2280
Scraped : 2281
Scraped : 2282
Scraped : 2283
Scraped : 2284
Scraped : 2285
Scraped : 2286
Scraped : 2287
Scraped : 2288
Scraped : 2289
Scraped : 2290
Scraped : 2291
Scraped : 2292
Scraped : 2293
Scraped : 2294
Scraped : 2295
Scraped : 2296
Scraped : 2297
Scraped : 2298
Facebook scrape of theguardian scraped 2298 Facebook posts.


## Save the CSV file

In [35]:
items.to_csv(f"{brand}final_desktop.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'TELEGRAPH.CO.UK/final_desktop.csv'

#### I've now scraped all the Facebook content for the following UK supermarkets/retailers:

    - Tesco
    - Sainsbury's
    - ASDA
    - Waitrose
    - Lidl
    - M&S
    - Morrisons
    
#### The next step is to import all of them, concat them and then clean them

## Import individual csv files drop null values

In [256]:
sainsburys_df = pd.read_csv('Sainsburysfinal_desktop.csv').drop(['Unnamed: 0'],axis=1)
tesco_df = pd.read_csv('Tescofinal_desktop.csv').drop(['Unnamed: 0'],axis=1)
waitrose_df = pd.read_csv('waitroseandpartnersfinal_desktop.csv').drop(['Unnamed: 0'],axis=1)
lidl_df = pd.read_csv('lidlukfinal_desktop.csv').drop(['Unnamed: 0'],axis=1)
asda_df = pd.read_csv('asdafinal_desktop.csv').drop(['Unnamed: 0'],axis=1)
morrisons_df = pd.read_csv('morrisonsfinal_desktop.csv').drop(['Unnamed: 0'],axis=1)
mns_df = pd.read_csv('marksandspencerfinal_desktop.csv').drop(['Unnamed: 0'],axis=1)

In [257]:
#drop nulls and duplicates
tesco_df.dropna(inplace=True)
tesco_df.drop_duplicates(inplace=True)
print(tesco_df.shape)
sainsburys_df.dropna(inplace=True)
sainsburys_df.drop_duplicates(inplace=True)
print(sainsburys_df.shape)
waitrose_df.dropna(inplace=True)
waitrose_df.drop_duplicates(inplace=True)
print(waitrose_df.shape)
lidl_df.dropna(inplace=True)
lidl_df.drop_duplicates(inplace=True)
print(lidl_df.shape)
mns_df.dropna(inplace=True)
mns_df.drop_duplicates(inplace=True)
print(mns_df.shape)
morrisons_df.dropna(inplace=True)
morrisons_df.drop_duplicates(inplace=True)
print(morrisons_df.shape)

(1088, 5)
(741, 5)
(846, 5)
(1401, 5)
(880, 5)
(888, 5)


In [258]:

df = pd.concat([sainsburys_df,tesco_df,waitrose_df,lidl_df,mns_df,morrisons_df],axis=0)

In [259]:
df.head()

Unnamed: 0,Post_Content,Date,Comments_Shares,All_Responses,Brand
0,What a Bude-iful week! We gave one of our car ...,Yesterday at 1:12 PM,"1K Comments3,266 Shares653K Views",6.8K,Sainsburys
1,Bake a festive showstopper with Sainsbury’s ma...,December 8 at 4:00 PM,74 Comments23 Shares,132,Sainsburys
2,Get in the party spirit with Sainsbury’s magaz...,December 5 at 5:00 PM,33 Comments5 Shares18K Views,142,Sainsburys
3,Harry and Meghan’s wedding cake maker Claire P...,December 1 at 3:59 PM,59 Comments26 Shares,193,Sainsburys
4,These cookie-cup mince pies are deliciously ch...,November 29 at 3:59 PM,34 Comments22 Shares,187,Sainsburys


## Cleaning

#### As you can see we have a few cleaning and formatting things we need to look at:
    - Split 'Comments_Shares' merged column into three columns: one for comments, one for shares and one for views (infers a video)
    
    - Anywhere we have a reference of - for example - '1.4K' we need to convert that to 1400, or if 1.2M then convert that to 1200000
    
    - We need to ad a column which tells us what type of content was on the post - we can tell if it's a video if there's a 'Views' metric and if there's a web link in the description (e.g 'http://') we know there's a link to content
    
    - Finally we need to ensure we are dealing with all numbers so will need to convert them from string objects to integers

In [260]:
df.shape

(5844, 5)

In [261]:
#getting rid of any whitespace (we can add relevant space later on)
df.Comments_Shares = df.Comments_Shares.str.replace(' ','')

In [262]:
df.head(5)

Unnamed: 0,Post_Content,Date,Comments_Shares,All_Responses,Brand
0,What a Bude-iful week! We gave one of our car ...,Yesterday at 1:12 PM,"1KComments3,266Shares653KViews",6.8K,Sainsburys
1,Bake a festive showstopper with Sainsbury’s ma...,December 8 at 4:00 PM,74Comments23Shares,132,Sainsburys
2,Get in the party spirit with Sainsbury’s magaz...,December 5 at 5:00 PM,33Comments5Shares18KViews,142,Sainsburys
3,Harry and Meghan’s wedding cake maker Claire P...,December 1 at 3:59 PM,59Comments26Shares,193,Sainsburys
4,These cookie-cup mince pies are deliciously ch...,November 29 at 3:59 PM,34Comments22Shares,187,Sainsburys


In [263]:
#preparing the 'Comment_Share' columns so I can cleanly split them onto Comments / Share columns
df.Comments_Shares = df.Comments_Shares.str.replace('Comments','Comments, ')
df.Comments_Shares = df.Comments_Shares.str.replace('Shares','Shares, ')



In [264]:
#now we need to split out the Comments, Shares and views data into their own columns
df[['Comments','Shares','Views']] = df.Comments_Shares.str.split(' ', expand = True)
df.drop(['Comments_Shares'],axis=1,inplace=True)


In [265]:
#getting rid of any new nulls and re setting the index
df.dropna(inplace=True)
df.reset_index(drop=True, inplace= True)

In [266]:
df.head()

Unnamed: 0,Post_Content,Date,All_Responses,Brand,Comments,Shares,Views
0,What a Bude-iful week! We gave one of our car ...,Yesterday at 1:12 PM,6.8K,Sainsburys,"1KComments,","3,266Shares,",653KViews
1,Bake a festive showstopper with Sainsbury’s ma...,December 8 at 4:00 PM,132,Sainsburys,"74Comments,","23Shares,",
2,Get in the party spirit with Sainsbury’s magaz...,December 5 at 5:00 PM,142,Sainsburys,"33Comments,","5Shares,",18KViews
3,Harry and Meghan’s wedding cake maker Claire P...,December 1 at 3:59 PM,193,Sainsburys,"59Comments,","26Shares,",
4,These cookie-cup mince pies are deliciously ch...,November 29 at 3:59 PM,187,Sainsburys,"34Comments,","22Shares,",


In [267]:
#remove the 'comments' / 'shares' / views characters
df['Comments'] = df['Comments'].apply(lambda x: x.replace('Comments,',''))
df['Comments'] = df['Comments'].apply(lambda x: x.replace('Comments',''))
df['Shares'] = df['Shares'].apply(lambda x: x.replace('Shares,',''))
df['Shares'] = df['Shares'].apply(lambda x: x.replace(',',''))
df['Views'] = df['Views'].apply(lambda x: x.replace('Views',''))

In [268]:
df.head()

Unnamed: 0,Post_Content,Date,All_Responses,Brand,Comments,Shares,Views
0,What a Bude-iful week! We gave one of our car ...,Yesterday at 1:12 PM,6.8K,Sainsburys,1K,3266,653K
1,Bake a festive showstopper with Sainsbury’s ma...,December 8 at 4:00 PM,132,Sainsburys,74,23,
2,Get in the party spirit with Sainsbury’s magaz...,December 5 at 5:00 PM,142,Sainsburys,33,5,18K
3,Harry and Meghan’s wedding cake maker Claire P...,December 1 at 3:59 PM,193,Sainsburys,59,26,
4,These cookie-cup mince pies are deliciously ch...,November 29 at 3:59 PM,187,Sainsburys,34,22,


## There are a few random anomalies - namely where friends of mine (remember, I had to be logged in to access the share data) have liked posts

In [269]:
df[df['All_Responses'].str.contains("Steve")] #2147

Unnamed: 0,Post_Content,Date,All_Responses,Brand,Comments,Shares,Views
2147,Sourdough toast topped with mashed avocado and...,"April 18, 2016",Steve Lucijan Fle-Danijelović and 11K others,waitroseandpartners,700,710,


In [270]:
df[df['All_Responses'].str.contains("Wai")] #2523 / 2533

Unnamed: 0,Post_Content,Date,All_Responses,Brand,Comments,Shares,Views
2523,Which of these deliciously healthy breakfast r...,"March 13, 2015",Waitrose & Partners and 166 others,waitroseandpartners,21,6,
2533,Thumbs up if you’re a fan of the mighty strawb...,"March 6, 2015",Waitrose & Partners and 918 others,waitroseandpartners,43,55,


In [271]:
df.drop([2147,2523,2533],axis=0,inplace=True)

In [272]:
#function that will remove a 'K', convert to integers and multiply it by 1000 
def kformat(x):
    try:
        if 'K' in x:
            return int(float(x.replace('K',''))*1000)
        elif 'M' in x:
            return int(float(x.replace('M',''))*1000000)
        else:
            return int(x)
    except:
        return x
    
#calling the function on anywhere that requires that formatting
df = df.applymap(kformat)

In [273]:
df.dtypes

Post_Content     object
Date             object
All_Responses     int64
Brand            object
Comments          int64
Shares            int64
Views            object
dtype: object

In [276]:
def viewsconverter(x):
    try:
        int(x)
        return x
    except:
        return 0

In [277]:
df.Views = df.Views.apply(converter)
df.Views.isnull().sum()

0

In [282]:
#reorganising the columns into a sensible order
df = df[['Date','Brand','Post_Content','All_Responses','Comments','Shares','Views']]

In [294]:
#looks decent
df.dtypes

Date              object
Brand             object
Post_Content      object
All_Responses      int64
Comments           int64
Shares             int64
Views              int64
Contains_Link       bool
Contains_Video      bool
dtype: object

# Engineering New Features / Adding Extra Layers of Data

In [274]:
#create a column that returns true if the corresponding 'Post_Content' column contained a link
df['Contains_Link'] = df['Post_Content'].str.contains("http")

In [280]:
#create a column that returns true if the corresponding 'Views' column has views
df['Contains_Video'] = df['Views'] > 0

## Next steps:

There are a few other things we need to add to the data, namely the number of fans the page has. This is needed in order to calculate more contextual engagement i.e. 20 comments on a page with 200 fans is more impressive than 20 comments on a page with 20000 fans). This and a few other engineered features we can make - however this will be covered in the exploratory analysis in part 3 of the project. 