# Preparation

**Picking a scraper package**

There are a variety of python packages that are designed to scrape social media posts. Some examples (ranked by number of stars):

- [Ultimate Facebook Scraper](https://github.com/harismuneer/Ultimate-Facebook-Scraper) 2.6k stars. Last activity Jul 2023. "Scrapes almost everything about a Facebook user's profile". Uses Selenium. Requires $119 payment.
- [Unofficial APIs](https://github.com/Rolstenhouse/unofficial-apis) 2.5k stars. Last activity Jan 2023. List of unofficial APIs for various services, none for Facebook for now, but might be worth to check in the future.
- [facebook-page-post-scraper](https://github.com/minimaxir/facebook-page-post-scraper) 2.1k stars. Last activity Dec 2017. Archived and read-only. 
- [facebook-scraper](https://github.com/kevinzg/facebook-scraper) 1.9k stars. Last activity Nov 2023. "Scrape Facebook public pages without an API key."
- [facebook-post-scraper](https://github.com/brutalsavage/facebook-post-scraper) 282 stars. Last activity Sep 2020. "Scrape Facebook Public Posts without using Facebook API."
- [major-scrapy-spiders](https://github.com/talhashraf/major-scrapy-spiders) 272 stars. Last activity Jul 2017. Has a profile spider for Scrapy.
- [facebook-scraper-selenium](https://github.com/apurvmishra99/facebook-scraper-selenium) 179 stars. Last activity Jun 2020. "Scrape posts from any group or user into a .csv file without needing to register for any API access".

Based on this list, it seems that the highest starred option that isn't paid, is available for Facebook, and isn't archived is `facebook-scraper`. So that is what we will use. 


To use the `facebook-scraper` package, we need cookies to bypass the login page. Export Facebook.com cookies using extension such as [Edit This Cookie](https://www.editthiscookie.com/). Save as txt file. 

**Picking a sentiment analysis package**

For sentiment analysis, there are a few package options:

- [NTLK](https://www.nltk.org/) Most comprehensive. Requires configuration and training for sentiment analysis.
- [TextBlob](https://textblob.readthedocs.io/en/dev/) Built on NTLK. Pre-trained sentiment analyzer. Composite polarity score.
- [VADER](https://github.com/cjhutto/vaderSentiment) Built on NTLK. Particularly strong for short-form text like article headlines and social media. Proportional pos/neg/neu and composite score.
- [spaCy](https://spacy.io/usage) Offers pretrained models and tools for NLP.
- [Transformers](https://huggingface.co/docs/transformers/index) Offers pretrained models for a variety of NLP tasks.

We will use VADER.

**Setting up conda environment**

Set up a conda environment with the following specifications:
```
name: scraper
channels:
  - conda-forge
dependencies:
  - python==3.8
  - jupyterlab==4.0.8
  - ipykernel==6.25.0
  - numpy==1.24.4
  - pandas==2.0.3
  - matplotlib==3.7.3
  - seaborn==0.13.0
  - beautifulsoup4==4.12.2
  - requests==2.31.0
  - pip==23.3.1 
  - pip:
    - nltk==3.8.1
    - facebook-scraper==0.2.59
```

# Facebook Scraping

Import libraries.

In [1]:
from facebook_scraper import get_posts
import pandas as pd
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


Get posts, including comments and reactors. Running it directly (rather than writing it into a function) allows the `listposts` variable to add some post data even after temporarily banning happens. 

In [29]:
listposts = []
for post in get_posts("news.com.au", 
                      cookies="cookies.txt",
                      pages=100,
                      options={"comments":True, "reactors": True, "posts_per_page": 10}):
    listposts.append(post)
;

TemporarilyBanned: You’re Temporarily Blocked

Export raw data.

In [27]:
print(f"Number of posts: {len(listposts)}")
raw = pd.DataFrame.from_dict(listposts)
# raw.to_csv("fb_data4.csv")

Number of posts: 8


One of the issues with running the get_posts() function is the possibility of getting temporarily blocked by Facebook, which has happened everytime. The block seems to last anywhere between a few hours to a few days. Potential remedies:

- Create a delay between requests. Don't think `facebook-scraper` offers an option for this.
- Rotate IPs. Perhaps via VPN or Tor.
- Randomize user-agents. Perhaps via `requests`, `Selenium`, or manually switch browser. 

Because of the issues with blocking, I have several separately scraped raw data files. We can collate them below. 

In [5]:
raw1 = pd.read_csv("fb_data1.csv")
raw2 = pd.read_csv("fb_data2.csv")
raw3 = pd.read_csv("fb_data3.csv")
raw4 = pd.read_csv("fb_data4.csv")
raw_collated = pd.DataFrame()

for files in [raw1, raw2, raw3, raw4]:
    raw_collated = pd.concat([raw_collated, files])

print(f"Number of total posts: {len(raw_collated)}")

Number of total posts: 260


Drop duplicate posts (same `post_id`). Keep the one with greater number of reactions (`reaction_count`).  

In [6]:
raw_collated = raw_collated.sort_values("reaction_count", ascending=False).groupby("post_id").head(1)
raw_collated = raw_collated.sort_values("time", ascending=False).reset_index()

print(f"Number of posts: {len(raw_collated)}")
print(f"Number of columns: {raw_collated.shape[1]}")

Number of posts: 182
Number of columns: 54


# Data Exploration and Cleaning

Let's take a look at the data. 

In [7]:
df = raw_collated.copy()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 182 entries, 0 to 181
Data columns (total 54 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   index                          182 non-null    int64  
 1   Unnamed: 0                     182 non-null    int64  
 2   post_id                        182 non-null    int64  
 3   text                           182 non-null    object 
 4   post_text                      182 non-null    object 
 5   shared_text                    134 non-null    object 
 6   original_text                  0 non-null      float64
 7   time                           182 non-null    object 
 8   timestamp                      182 non-null    int64  
 9   image                          31 non-null     object 
 10  image_lowquality               151 non-null    object 
 11  images                         151 non-null    object 
 12  images_description             151 non-null    obj

Some initial thoughts:

- All posts come with a caption (`post_text`)
- Most, but not all posts share a link (`link`)
- A few posts come with an image (`image`)
- Only a few posts come with a video (`video`)

It looks like total posts = posts with links + posts with video. So maybe all posts without links are video posts?

The package scrapes a good amount of data for each post, 51 columns in total. We don't need all of it, so we can select only the columns that we need for exploration, cleaning, and analysis. 

In [8]:
columns = ["post_id", #unique id
           "post_text", #caption of post
           "time", #time posted (human-readable)
           "video", #link to video
           "image",
           "likes",
           "comments",
           "shares",
           "post_url",
           "link",
           "comments_full",
           "reactions",
           "reaction_count",
]

df = df[columns]

To start, let's look at which posts don't have links.

In [9]:
print("Posts without links:")
print(df[df.link.isna()].info())
print("Posts with links:")
print(df[~df.link.isna()].info())

Posts without links:
<class 'pandas.core.frame.DataFrame'>
Index: 46 entries, 0 to 176
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   post_id         46 non-null     int64  
 1   post_text       46 non-null     object 
 2   time            46 non-null     object 
 3   video           17 non-null     object 
 4   image           0 non-null      object 
 5   likes           46 non-null     float64
 6   comments        46 non-null     int64  
 7   shares          46 non-null     int64  
 8   post_url        46 non-null     object 
 9   link            0 non-null      object 
 10  comments_full   46 non-null     object 
 11  reactions       35 non-null     object 
 12  reaction_count  46 non-null     int64  
dtypes: float64(1), int64(4), object(8)
memory usage: 5.0+ KB
None
Posts with links:
<class 'pandas.core.frame.DataFrame'>
Index: 136 entries, 31 to 181
Data columns (total 13 columns):
 #   Column        

OK it looks like all posts without links are video posts. Image posts can have links. Let's drop the posts without links. 

In [10]:
df.head(5)

Unnamed: 0,post_id,post_text,time,video,image,likes,comments,shares,post_url,link,comments_full,reactions,reaction_count
0,759361192893859,“He was probably right! That would’ve changed ...,2023-11-07 18:40:02,,,2.0,0,0,https://facebook.com/news.com.au/posts/7593611...,,[],{'like': 2},2
1,759348716228440,This is one way to steal the spotlight. 👀,2023-11-07 18:20:01,,,3.0,7,0,https://facebook.com/news.com.au/posts/7593487...,,[],"{'like': 3, 'haha': 2}",5
2,759338512896127,COMMENT: The Melbourne Cup might be seen as a ...,2023-11-07 18:00:01,,,26.0,96,1,https://facebook.com/news.com.au/posts/7593385...,,[],"{'like': 26, 'love': 1, 'haha': 31, 'wow': 2}",60
3,759332346230077,Stargazers should enjoy Saturn’s rings while t...,2023-11-07 17:40:01,,,16.0,3,0,https://facebook.com/news.com.au/posts/7593323...,,[],"{'like': 16, 'haha': 1}",17
4,759326902897288,What an iconic Aussie collaboration!,2023-11-07 17:20:02,,,13.0,2,0,https://facebook.com/news.com.au/posts/7593269...,,[],{'like': 13},13


Looking at the data, the scraper does a generally good job and there isn't too much to clean manually. I did some manual eyeballing of the `post_text` against the Facebook.com/news.com.au page and everything looks good. It seems like the only thing we need to do is remove the posts without links. We can also remove the video and image columns as we don't need them for further analysis.

In [11]:
df = df[~df.link.isna()].reset_index()
print(f"Number of posts: {len(df)}")
print(f"Number of columns: {df.shape[1]}")

Number of posts: 136
Number of columns: 14


# Scrape Article

Import libraries.

In [12]:
import requests
from bs4 import BeautifulSoup
from random import sample

This function needs to be customized based on the HTML setup of the individual website. For news.com.au, headlines are found under h1 class, id="story-headline". Articles are found under div class, id="story-primary".

In [13]:
def get_article(url):
    response = requests.get(url)
    
    if response.status_code == 200: # HTTP status code 200 = successful request
        soup = BeautifulSoup(response.content, 'html.parser')
        title = soup.find('h1', id="story-headline").get_text()
        article = soup.find('div', id="story-primary").get_text()
        
        return title, article
        
    else:
        print(f"Page unreachable for {url}.")

In [14]:
df2 = df.copy()
links = df2.link.tolist()

In [15]:
article_data = [get_article(link) for link in links]
article_title = [data[0] for data in article_data]
article_text = [data[1] for data in article_data]

To make sure that the article was pulled correctly, we can check that the article text doesn't contain any strings less than 1000 characters. 

In [16]:
[s for s in article_text if len(s) < 1000]

[]

Add article_data to our dataframe.

In [17]:
article_data_df = pd.DataFrame(article_data, columns=["article_title", "article_text"])
article_data_df.sample(5)

Unnamed: 0,article_title,article_text
135,Mariah Carey facing $31M lawsuit over hit song...,Musician Andy Stone sued Mariah Carey for $US2...
34,Jude Law has been spotted at Paul McCartney’s ...,A-lister Jude Law has been spotted having the ...
25,'I literally just want to take a shower; look ...,Anyone who's brought home a newborn will remem...
94,Melbourne Cup Carnival 2023: Martha Kalifatidi...,Martha Kalifatidis joined in the merriment at ...
21,Unique white Platypus found in Aussie creek pr...,"In 1799, when British scientists first receive..."


In [18]:
df2 = pd.concat([df2,article_data_df], axis=1)
df2.sample(5)

Unnamed: 0,index,post_id,post_text,time,video,image,likes,comments,shares,post_url,link,comments_full,reactions,reaction_count,article_title,article_text
64,107,758094369687208,Adrian Portelli and Danny Wallis’ showboating ...,2023-11-05 16:47:02,,,289.0,582,5,https://facebook.com/news.com.au/posts/7580943...,https://www.news.com.au/entertainment/tv/reali...,[],,289,The Block auctions: Adrian Portelli and Danny ...,The fate of this year’s contestants on The Blo...
15,49,758657949630850,Heartbreaking. 💔,2023-11-06 16:00:07,,,11.0,19,6,https://facebook.com/news.com.au/posts/7586579...,https://www.news.com.au/entertainment/celebrit...,"[{'comment_id': '3397149677202656', 'comment_u...","{'like': 11, 'wow': 2, 'care': 3, 'sad': 120}",136,Black Panther stuntman Taraja Ramsess and his ...,A stuntman known for his work in Black Panther...
114,159,756966483133330,The police investigation will take “some time”...,2023-11-03 16:00:02,,,7.0,2,0,https://facebook.com/news.com.au/posts/7569664...,https://www.news.com.au/sport/more-sports/ice-...,"[{'comment_id': '300744162823648', 'comment_ur...",,7,Ice hockey player Adam Johnson’s fiancee had t...,Tragic ice hockey star Adam Johnson’s body was...
101,146,757624819734163,Ange Postecoglou is on top of the football wor...,2023-11-04 22:40:01,,,109.0,49,3,https://facebook.com/news.com.au/posts/7576248...,https://www.news.com.au/sport/afl/aussies-outr...,[],,109,Aussies outraged over Ange Postecoglou photo w...,Ange Postecoglou is on top of the football wor...
132,178,756702146493097,Time to check your coins! 👀,2023-11-03 05:20:01,,,59.0,21,3,https://facebook.com/news.com.au/posts/7567021...,https://www.news.com.au/finance/money/is-this-...,"[{'comment_id': '272344048605678', 'comment_ur...",,59,Is this rare 1968 2c Australian coin worth $49...,"In the world of coin collecting, certain piece..."


# Sentiment Analysis

VADER polarity scores (pos, neu, neg) are ratios for proportions of text that fall in each category. The compound score is the sum of the valence for each word, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). 

Import libraries.

In [19]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

Functions.

In [20]:
analyzer = SentimentIntensityAnalyzer()

def perc_round(float):
    return np.round(float*100,2)

def sentiment_analyzer(text):
    d = {'neg':analyzer.polarity_scores(text)['neg'],
         'neu':analyzer.polarity_scores(text)['neu'],
         'pos':analyzer.polarity_scores(text)['pos'],
         'com':analyzer.polarity_scores(text)['compound']}
    return d

def add_sentiments_to_df(df, colnames):
    for col in colnames:
        sentiments = df[col].apply(lambda x: sentiment_analyzer(x)).apply(pd.Series)
        sentiments.columns = [col+"_neg", 
                              col+"_neu",
                              col+"_pos", 
                              col+"_com"]
        df = pd.concat([df, sentiments], axis=1)
    return df

Add sentiments for whichever text you want to conduct sentiment analysis on. (Options: article_text, article_title, post_text)

In [21]:
df3 = df2.copy()
df3 = add_sentiments_to_df(df3, ["article_text"])

In [22]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.hist(df.article_text_pos, bins=5)
plt.title('Histogram of Values')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

AttributeError: 'DataFrame' object has no attribute 'article_text_pos'

In [None]:
df3 = add_sentiments_to_df(df3, ["post_text", "article_title","article_text"])

In [None]:
df3.sample(5)

Export data to csv. 

In [None]:
df3.info()

In [None]:
columns = ["time",
           "likes",
           "comments",
           "shares",
           "link",
           "comments_full",
          "reactions",
           "reaction_count",
           "article_title",
           "article_text",
           ""
          "]

In [None]:
df3.to_csv("datafile.csv")

In [23]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 136 entries, 0 to 135
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   index             136 non-null    int64  
 1   post_id           136 non-null    int64  
 2   post_text         136 non-null    object 
 3   time              136 non-null    object 
 4   video             0 non-null      object 
 5   image             31 non-null     object 
 6   likes             136 non-null    float64
 7   comments          136 non-null    int64  
 8   shares            136 non-null    int64  
 9   post_url          136 non-null    object 
 10  link              136 non-null    object 
 11  comments_full     136 non-null    object 
 12  reactions         26 non-null     object 
 13  reaction_count    136 non-null    int64  
 14  article_title     136 non-null    object 
 15  article_text      136 non-null    object 
 16  article_text_neg  136 non-null    float64
 1

Save to csv.

In [24]:
df3 = df3[columns]
df3.to_csv("datafile.csv")