# Preparation

**Picking a scraper package**

There are a variety of python packages that are designed to scrape social media posts. Some examples (ranked by number of stars):

- [Ultimate Facebook Scraper](https://github.com/harismuneer/Ultimate-Facebook-Scraper) 2.6k stars. Last activity Jul 2023. "Scrapes almost everything about a Facebook user's profile". Uses Selenium. Requires $119 payment.
- [Unofficial APIs](https://github.com/Rolstenhouse/unofficial-apis) 2.5k stars. Last activity Jan 2023. List of unofficial APIs for various services, none for Facebook for now, but might be worth to check in the future.
- [facebook-page-post-scraper](https://github.com/minimaxir/facebook-page-post-scraper) 2.1k stars. Last activity Dec 2017. Archived and read-only. 
- [facebook-scraper](https://github.com/kevinzg/facebook-scraper) 1.9k stars. Last activity Nov 2023. "Scrape Facebook public pages without an API key."
- [facebook-post-scraper](https://github.com/brutalsavage/facebook-post-scraper) 282 stars. Last activity Sep 2020. "Scrape Facebook Public Posts without using Facebook API."
- [major-scrapy-spiders](https://github.com/talhashraf/major-scrapy-spiders) 272 stars. Last activity Jul 2017. Has a profile spider for Scrapy.
- [facebook-scraper-selenium](https://github.com/apurvmishra99/facebook-scraper-selenium) 179 stars. Last activity Jun 2020. "Scrape posts from any group or user into a .csv file without needing to register for any API access".

Based on this list, it seems that the highest starred option that isn't paid, is available for Facebook, and isn't archived is `facebook-scraper`. So that is what we will use. 


To use the `facebook-scraper` package, we need cookies to bypass the login page. Export Facebook.com cookies using extension such as [Edit This Cookie](https://www.editthiscookie.com/). Save as txt file. 

**Picking a sentiment analysis package**

For sentiment analysis, there are a few package options:

- [NTLK](https://www.nltk.org/) Most comprehensive. Requires configuration and training for sentiment analysis.
- [TextBlob](https://textblob.readthedocs.io/en/dev/) Built on NTLK. Pre-trained sentiment analyzer. Composite polarity score.
- [VADER](https://github.com/cjhutto/vaderSentiment) Built on NTLK. Particularly strong for short-form text like article headlines and social media. Proportional pos/neg/neu and composite score.
- [spaCy](https://spacy.io/usage) Offers pretrained models and tools for NLP.
- [Transformers](https://huggingface.co/docs/transformers/index) Offers pretrained models for a variety of NLP tasks.

We will use VADER.

**Setting up conda environment**

Set up a conda environment with the following specifications:
```
name: scraper
channels:
  - conda-forge
dependencies:
  - python==3.8
  - jupyterlab==4.0.8
  - ipykernel==6.25.0
  - numpy==1.24.4
  - pandas==2.0.3
  - matplotlib==3.7.3
  - seaborn==0.13.0
  - beautifulsoup4==4.12.2
  - requests==2.31.0
  - pip==23.3.1 
  - pip:
    - nltk==3.8.1
    - facebook-scraper==0.2.59
```

# Facebook Scraping

Import libraries.

In [1]:
from facebook_scraper import get_posts
import pandas as pd
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


Get posts, including comments and reactors. Running it directly (rather than writing it into a function) allows the `listposts` variable to add some post data even after temporarily banning happens. 

In [4]:
import warnings
warnings.filterwarnings('ignore')

listposts = []
for post in get_posts("news.com.au", 
                      cookies="cookies.txt",
                      pages=100,
                      options={"comments":True, "reactors": True, "posts_per_page": 10}):
    listposts.append(post)

TemporarilyBanned: You’re Temporarily Blocked

Export raw data.

In [6]:
raw = pd.DataFrame.from_dict(listposts)
# raw.to_csv("fb_data4.csv")

One of the issues with running the get_posts() function is the possibility of getting temporarily blocked by Facebook, which has happened everytime. The block seems to last anywhere between a few hours to a few days. Potential remedies:

- Create a delay between requests. Don't think `facebook-scraper` offers an option for this.
- Rotate IPs. Perhaps via VPN or Tor.
- Randomize user-agents. Perhaps via `requests`, `Selenium`, or manually switch browser. 

Because of the issues with blocking, I have several separately scraped raw data files. We can collate them below. 

In [7]:
raw1 = pd.read_csv("fb_data1.csv")
raw2 = pd.read_csv("fb_data2.csv")
raw3 = pd.read_csv("fb_data3.csv")
raw4 = pd.read_csv("fb_data4.csv")
raw_collated = pd.DataFrame()

for files in [raw1, raw2, raw3, raw4]:
    raw_collated = pd.concat([raw_collated, files])

print(f"Number of total posts: {len(raw_collated)}")

Number of total posts: 260


Drop duplicate posts (same `post_id`). Keep the one with greater number of reactions (`reaction_count`).  

In [8]:
raw_collated = raw_collated.sort_values("reaction_count", ascending=False).groupby("post_id").head(1)
raw_collated = raw_collated.sort_values("time", ascending=False).reset_index()

print(f"Number of posts: {len(raw_collated)}")
print(f"Number of columns: {raw_collated.shape[1]}")

Number of posts: 182
Number of columns: 54


# Data Exploration and Cleaning

Let's take a look at the data. 

In [9]:
df = raw_collated.copy()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 182 entries, 0 to 181
Data columns (total 54 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   index                          182 non-null    int64  
 1   Unnamed: 0                     182 non-null    int64  
 2   post_id                        182 non-null    int64  
 3   text                           182 non-null    object 
 4   post_text                      182 non-null    object 
 5   shared_text                    134 non-null    object 
 6   original_text                  0 non-null      float64
 7   time                           182 non-null    object 
 8   timestamp                      182 non-null    int64  
 9   image                          31 non-null     object 
 10  image_lowquality               151 non-null    object 
 11  images                         151 non-null    object 
 12  images_description             151 non-null    obj

Some initial thoughts:

- All posts come with a caption (`post_text`)
- Most, but not all posts share a link (`link`)
- A few posts come with an image (`image`)
- Only a few posts come with a video (`video`)

It looks like total posts = posts with links + posts with video. So maybe all posts without links are video posts?

The package scrapes a good amount of data for each post, 51 columns in total. We don't need all of it, so we can select only the columns that we need for exploration, cleaning, and analysis. 

In [10]:
columns = ["post_id", #unique id
           "post_text", #caption
           "time", 
           "video", 
           "image",
           "likes",
           "comments",
           "shares",
           "link",
           "reaction_count",
]

df = df[columns]

To start, let's look at which posts don't have links.

In [11]:
print("Posts without links:")
print(df[df.link.isna()].info())
print("Posts with links:")
print(df[~df.link.isna()].info())

Posts without links:
<class 'pandas.core.frame.DataFrame'>
Index: 46 entries, 0 to 176
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   post_id         46 non-null     int64  
 1   post_text       46 non-null     object 
 2   time            46 non-null     object 
 3   video           17 non-null     object 
 4   image           0 non-null      object 
 5   likes           46 non-null     float64
 6   comments        46 non-null     int64  
 7   shares          46 non-null     int64  
 8   link            0 non-null      object 
 9   reaction_count  46 non-null     int64  
dtypes: float64(1), int64(4), object(5)
memory usage: 4.0+ KB
None
Posts with links:
<class 'pandas.core.frame.DataFrame'>
Index: 136 entries, 31 to 181
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   post_id         136 non-null    int64  
 1   post_text     

OK it looks like all posts without links are video posts. Image posts can have links. Let's drop the posts without links. 

In [12]:
df.head(5)

Unnamed: 0,post_id,post_text,time,video,image,likes,comments,shares,link,reaction_count
0,759361192893859,“He was probably right! That would’ve changed ...,2023-11-07 18:40:02,,,2.0,0,0,,2
1,759348716228440,This is one way to steal the spotlight. 👀,2023-11-07 18:20:01,,,3.0,7,0,,5
2,759338512896127,COMMENT: The Melbourne Cup might be seen as a ...,2023-11-07 18:00:01,,,26.0,96,1,,60
3,759332346230077,Stargazers should enjoy Saturn’s rings while t...,2023-11-07 17:40:01,,,16.0,3,0,,17
4,759326902897288,What an iconic Aussie collaboration!,2023-11-07 17:20:02,,,13.0,2,0,,13


Looking at the data, the scraper does a generally good job and there isn't too much to clean manually. I did some manual eyeballing of the `post_text` against the Facebook.com/news.com.au page and everything looks good. The only few cleaning steps we need to do are:

- Remove the posts without links
- Remove the video and image columns
- Rename 'post_text' to 'caption'

In [13]:
df = df[~df.link.isna()].reset_index()
df.drop(['video','image'], axis=1, inplace=True)
df.rename({"post_text":"caption"}, axis=1, inplace=True)
df.info()
print(f"Number of posts: {len(df)}")
print(f"Number of columns: {df.shape[1]}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 136 entries, 0 to 135
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   index           136 non-null    int64  
 1   post_id         136 non-null    int64  
 2   caption         136 non-null    object 
 3   time            136 non-null    object 
 4   likes           136 non-null    float64
 5   comments        136 non-null    int64  
 6   shares          136 non-null    int64  
 7   link            136 non-null    object 
 8   reaction_count  136 non-null    int64  
dtypes: float64(1), int64(5), object(3)
memory usage: 9.7+ KB
Number of posts: 136
Number of columns: 9


# Scrape Article

Import libraries.

In [14]:
import requests
from bs4 import BeautifulSoup
from random import sample

This function needs to be customized based on the HTML setup of the individual website. For news.com.au, headlines are found under h1 class, id="story-headline". Articles are found under div class, id="story-primary".

In [15]:
def get_article(url):
    response = requests.get(url)
    
    if response.status_code == 200: # HTTP status code 200 = successful request
        soup = BeautifulSoup(response.content, 'html.parser')
        title = soup.find('h1', id="story-headline").get_text()
        article = soup.find('div', id="story-primary").get_text()
        
        return title, article
        
    else:
        print(f"Page unreachable for {url}.")

In [16]:
links = df.link.tolist()

In [17]:
article_data = [get_article(link) for link in links]
article_text = [data[1] for data in article_data]

To make sure that the article was pulled correctly, we can check that the article text doesn't contain any strings less than 1000 characters. 

In [18]:
[s for s in article_text if len(s) < 1000]

[]

Add article_data to our dataframe.

In [19]:
article_data_df = pd.DataFrame(article_data, columns=["article_title", "article_text"])
article_data_df.sample(5)

Unnamed: 0,article_title,article_text
94,Melbourne Cup Carnival 2023: Martha Kalifatidi...,Martha Kalifatidis joined in the merriment at ...
115,Home & Away star shares diagnosis after cruel ...,Home & Away star Kyle Shilling seemingly felt ...
54,New details of the victims of the Daylesford b...,A heart breaking picture has emerged of the fa...
87,The devastating song that left everyone in tea...,Matthew Perry’s family and friends were in tea...
95,Melbourne Cup form guide 2023: Every horse rat...,The full field for the 2023 Melbourne Cup has ...


In [20]:
df2 = df.copy()
df2 = pd.concat([df2,article_data_df], axis=1)
df2.sample(5)

Unnamed: 0,index,post_id,caption,time,likes,comments,shares,link,reaction_count,article_title,article_text
20,55,758623186300993,Just in time for your summer barbecue 🌞,2023-11-06 14:00:02,88.0,13,0,https://www.news.com.au/finance/business/retai...,93,Why expensive supermarket item is now cheap – ...,Australia’s sheep population has reached an al...
9,42,758702379626407,OPINION: The Melbourne Cup is now the race tha...,2023-11-06 18:20:01,202.0,443,9,https://www.news.com.au/sport/superracing/melb...,344,‘World has changed’: Why Aussies aren’t headin...,OPINIONThe Melbourne Cup is not just one of th...
31,67,758320876331224,He's revealed some behind-the-scenes drama wen...,2023-11-06 04:20:01,77.0,107,1,https://bit.ly/40olRVw?fbclid=IwAR1mpoqqJIHkAK...,77,Dave Hughes reveals massive Block auction error,Dave Hughes was a surprise attendee at this ye...
84,128,757742429722402,"""They would probably claim a moral victory.""",2023-11-05 03:40:02,689.0,117,7,https://www.news.com.au/sport/cricket/starc-de...,689,Starc destroys England with cheeky ‘moral vict...,Mitchell Starc couldn’t help but take a light-...
112,157,756984033131575,No zooper doopers were harmed in the incident. 🍦,2023-11-03 16:40:01,34.0,206,3,https://www.news.com.au/lifestyle/food/eat/col...,34,Coles defends 25 cent paper bags after custome...,A major supermarket has defended its paper bag...


# Sentiment Analysis

VADER polarity scores (pos, neu, neg) are ratios for proportions of text that fall in each category. The compound score  (com) is the sum of the valence for each word, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). 

Import libraries.

In [21]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

Functions.

In [28]:
analyzer = SentimentIntensityAnalyzer()

def sentiment_analyzer(text):
    d = {'neg':analyzer.polarity_scores(text)['neg'],
         'neu':analyzer.polarity_scores(text)['neu'],
         'pos':analyzer.polarity_scores(text)['pos'],
         'com':analyzer.polarity_scores(text)['compound']}
    return d

def add_sentiments_to_df(df, colnames):
    for col in colnames:
        sentiments = df[col].apply(lambda x: sentiment_analyzer(x)).apply(pd.Series)
        sentiments.columns = [col+"_neg", 
                              col+"_neu",
                              col+"_pos", 
                              col+"_com"]
        df = pd.concat([df, sentiments], axis=1)
    return df

Add sentiments for whichever text you want to conduct sentiment analysis on. (Options: article_text, article_title, post_text)

In [29]:
df3 = df2.copy()
df3 = add_sentiments_to_df(df3, ["article_text"])

Export data to csv. 

In [30]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 136 entries, 0 to 135
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   index             136 non-null    int64  
 1   post_id           136 non-null    int64  
 2   caption           136 non-null    object 
 3   time              136 non-null    object 
 4   likes             136 non-null    float64
 5   comments          136 non-null    int64  
 6   shares            136 non-null    int64  
 7   link              136 non-null    object 
 8   reaction_count    136 non-null    int64  
 9   article_title     136 non-null    object 
 10  article_text      136 non-null    object 
 11  article_text_neg  136 non-null    float64
 12  article_text_neu  136 non-null    float64
 13  article_text_pos  136 non-null    float64
 14  article_text_com  136 non-null    float64
dtypes: float64(5), int64(5), object(5)
memory usage: 16.1+ KB


In [31]:
df3.to_csv("datafile.csv")