# Script for extracting FB data

## Set Up

In [1]:
from facebook_scraper import get_posts
import numpy as np
import pandas as pd
import requests
from time import sleep
import datetime

In [2]:
pd.set_option('display.max_colwidth', None)

## Extracting comments

### Import selected FB posts data

As mentioned in the script to extract Reddit data, I was only interested in extracting posts from Facebook and Reddit that contained the same article from Straits Times/Channel News Asia reporting on Ministry of Health (MOH) announcements about imposition/tightening of restrictions. The links to Facebook posts with matching Reddit posts were manually input into `posts_reddit_fb_selected.csv`.

In [3]:
fb_selected = pd.read_csv('../data/posts_reddit_fb_selected.csv')

In [4]:
fb_selected

Unnamed: 0,author,id,num_comments,score,selftext,title,url,created_sgt,full_link_reddit,full_link_fb
0,chailoren,n4li0v,923,1,,Singapore to cut social gathering size from 8 to 5 amid rising Covid-19 cases; effective May 8-30,https://www.straitstimes.com/singapore/health/singapore-to-cut-social-gathering-size-from-8-to-5-amid-rising-covid-19-cases,4/5/2021 18:59,https://www.reddit.com/r/singapore/comments/n4li0v/singapore_to_cut_social_gathering_size_from_8_to/,https://www.facebook.com/TheStraitsTimes/posts/10157877541112115
1,Fawx13x,n4li5g,28,1,,"Cap of 5 people for social gatherings, household visits to return as Singapore tightens COVID-19 measures",https://www.channelnewsasia.com/singapore/cap-of-5-people-social-gatherings-household-visits-covid-19-moh-1344191,4/5/2021 18:59,https://www.reddit.com/r/singapore/comments/n4li5g/cap_of_5_people_for_social_gatherings_household/,https://www.facebook.com/ChannelNewsAsia/posts/10158274255327934
2,485320,n52529,45,1,,Limit on employees who can return to workplace back at 50%; firms urged to adhere to tighter Covid-19 rules,https://www.straitstimes.com/singapore/limit-on-employees-who-can-return-to-workplace-back-at-50-as-covid-19-measures-are,5/5/2021 8:00,https://www.reddit.com/r/singapore/comments/n52529/limit_on_employees_who_can_return_to_workplace/,https://www.facebook.com/TheStraitsTimes/posts/10157878977002115
3,hahohehuhi,n61hcv,22,1,,"COVID-19: Indoor sports facilities to close temporarily, outdoor exercise classes to continue with reduced capacity",https://www.channelnewsasia.com/news/singapore/indoor-sports-facilities-close-outdoor-classes-allowed-covid-19-14754552,6/5/2021 15:09,https://www.reddit.com/r/singapore/comments/n61hcv/covid19_indoor_sports_facilities_to_close/,https://www.facebook.com/ChannelNewsAsia/posts/10158278202967934
4,Fawx13x,nc0vau,11,1,,"Group sizes down from 5 to 2, dining-in suspended as Singapore tightens COVID-19 measures",https://www.channelnewsasia.com/news/singapore/covid-19-phase-2-dining-in-work-from-home-tightened-measures-1365476,14/5/2021 13:06,https://www.reddit.com/r/singapore/comments/nc0vau/group_sizes_down_from_5_to_2_diningin_suspended/,https://www.facebook.com/ChannelNewsAsia/posts/10158295565437934
5,sexyhades69,nc0veq,2,1,,"Group sizes down from 5 to 2, dining-in suspended as Singapore tightens COVID-19 measures",https://www.channelnewsasia.com/news/singapore/covid-19-phase-2-dining-in-work-from-home-tightened-measures-14808382,14/5/2021 13:07,https://www.reddit.com/r/singapore/comments/nc0veq/group_sizes_down_from_5_to_2_diningin_suspended/,https://www.facebook.com/ChannelNewsAsia/posts/10158295565437934
6,shady-memes_v13,nc0vwe,1684,1,,"No dining in, social gatherings capped at 2 people from May 16 as S'pore tightens Covid-19 rules",https://www.straitstimes.com/singapore/health/no-dining-in-social-gatherings-capped-at-2-people-from-may-16-as-spore-tightens,14/5/2021 13:08,https://www.reddit.com/r/singapore/comments/nc0vwe/no_dining_in_social_gatherings_capped_at_2_people/,https://www.facebook.com/TheStraitsTimes/posts/10157899793777115
7,caifanconnoisseur,nc0wni,7,1,,"Only 2 visitors per household per day, no dining-in allowed: Covid-19 rules in S'pore from May 16",https://www.straitstimes.com/singapore/health/only-2-visitors-per-household-per-day-no-dining-in-allowed-covid-19-rules-in-spore,14/5/2021 13:09,https://www.reddit.com/r/singapore/comments/nc0wni/only_2_visitors_per_household_per_day_no_diningin/,https://www.facebook.com/TheStraitsTimes/posts/10157899892712115
8,Raphiel_Shiraha_Ains,nc0y1s,1,1,,"Group sizes down from 5 to 2, dining-in suspended as Singapore tightens COVID-19 measures - CNA",https://www.channelnewsasia.com/news/singapore/covid-19-phase-2-dining-in-work-from-home-tightened-measures-14808382,14/5/2021 13:11,https://www.reddit.com/r/singapore/comments/nc0y1s/group_sizes_down_from_5_to_2_diningin_suspended/,https://www.facebook.com/ChannelNewsAsia/posts/10158295565437934
9,SMJLeo,ncctqd,43,1,,"Fixed seating with one-metre spacing for recess, no intermingling, as MOE tightens measures to fight Covid-19",https://www.straitstimes.com/singapore/fixed-seating-with-one-metre-spacing-for-recess-no-intermingling-as-moe-tightens-measures,15/5/2021 0:34,https://www.reddit.com/r/singapore/comments/ncctqd/fixed_seating_with_onemetre_spacing_for_recess_no/,https://www.facebook.com/TheStraitsTimes/posts/10157900823922115


### Get comments for each post

I used the [facebook-scraper Python package](https://pypi.org/project/facebook-scraper/) to extract all the comments from each post using the URL for the post.

To better understand how the results were structured, I extracted comments from the first post in the dataframe:

In [5]:
for post in get_posts(post_urls=fb_selected['full_link_fb'][0:1],
                      cookies='../data/cookies_fb.txt',
                      options={'comments': True, 'progress': True, 'allow_extra_requests': False}):
    print(post)

  0%|          | 0/269 [00:00<?, ?it/s]

{'original_request_url': 'https://www.facebook.com/TheStraitsTimes/posts/10157877541112115', 'post_url': 'https://facebook.com/story.php?story_fbid=pfbid035FEyJ6GJCiMiHgiKc8v9fzptow9eoYAtQsWBtz7KT5D4v3XardkSmzzCYxtCkLkcl&id=129011692114', 'post_id': 'pfbid035FEyJ6GJCiMiHgiKc8v9fzptow9eoYAtQsWBtz7KT5D4v3XardkSmzzCYxtCkLkcl', 'text': 'JUST IN: No more than 50% of employees should be in the office at any one time, down from 75%. Households can receive only 5 distinct visitors a day.\n\nSTRAITSTIMES.COM\nBack to phase 2: Cap of 5 people for social gatherings from May 8 amid rising Covid-19 cases', 'post_text': 'JUST IN: No more than 50% of employees should be in the office at any one time, down from 75%. Households can receive only 5 distinct visitors a day.', 'shared_text': 'STRAITSTIMES.COM\nBack to phase 2: Cap of 5 people for social gatherings from May 8 amid rising Covid-19 cases', 'original_text': None, 'time': datetime.datetime(2021, 5, 4, 6, 59, 47), 'timestamp': 1620125987, 'image

From the above we can see that:
- According to the `comments` parameter, there are 502 comments for the first post.
- Data on each comment is a nested json under the `comments_full` parameter.

I then tried to extract comments from the first post using the package, and found that while I identified 502 comments for the first post, I could only extract 252 of them and not all 502. Perhaps only the top-level comments (and not replies to comments) can be extracted.

In [6]:
for post in get_posts(post_urls=fb_selected['full_link_fb'][0:1],
                      cookies = '../data/cookies_fb.txt',
                      options={'comments': True, 'progress': True}):
    print(post['comments'], len(post['comments_full']))

  0%|          | 0/269 [00:00<?, ?it/s]

502 252


Finally, I wrote the following code to extract comments from all 36 Facebook posts and output the results as a dataframe, then exported it as a CSV file.

In [35]:
idx = 0
while (idx < 34):
    start = idx
    idx += 1
    end = idx
    sample = fb_selected['full_link_fb'][start:end]
    
    for post in get_posts(post_urls=sample,
                          cookies='../data/cookies_fb.txt',
                          timeout=180,
                          options={'comments': True, 'progress': True}):
        for c in post['comments_full']:
            comments.append({
                'comment_id': c['comment_id'],
                'comment_text': c['comment_text']
            })
            comments_df = pd.DataFrame(comments)
            
    sleep(10)

  0%|          | 0/269 [00:00<?, ?it/s]

  0%|          | 0/541 [00:00<?, ?it/s]

  date_obj = stz.localize(date_obj)


  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/51 [00:00<?, ?it/s]

  0%|          | 0/488 [00:00<?, ?it/s]

  date_obj = stz.localize(date_obj)


  0%|          | 0/488 [00:00<?, ?it/s]

  date_obj = stz.localize(date_obj)


  0%|          | 0/614 [00:00<?, ?it/s]

  date_obj = stz.localize(date_obj)


  0%|          | 0/211 [00:00<?, ?it/s]

  date_obj = stz.localize(date_obj)


  0%|          | 0/488 [00:00<?, ?it/s]

  date_obj = stz.localize(date_obj)


  0%|          | 0/101 [00:00<?, ?it/s]

  0%|          | 0/208 [00:00<?, ?it/s]

  0%|          | 0/275 [00:00<?, ?it/s]

  0%|          | 0/275 [00:00<?, ?it/s]

  0%|          | 0/276 [00:00<?, ?it/s]

  date_obj = stz.localize(date_obj)


  0%|          | 0/41 [00:00<?, ?it/s]

  0%|          | 0/482 [00:00<?, ?it/s]

  date_obj = stz.localize(date_obj)


  0%|          | 0/482 [00:00<?, ?it/s]

  date_obj = stz.localize(date_obj)


  0%|          | 0/482 [00:00<?, ?it/s]

  date_obj = stz.localize(date_obj)


  0%|          | 0/388 [00:00<?, ?it/s]

  0%|          | 0/158 [00:00<?, ?it/s]

  0%|          | 0/302 [00:00<?, ?it/s]

  0%|          | 0/302 [00:00<?, ?it/s]

  0%|          | 0/426 [00:00<?, ?it/s]

  0%|          | 0/48 [00:00<?, ?it/s]

  0%|          | 0/131 [00:00<?, ?it/s]

  0%|          | 0/622 [00:00<?, ?it/s]

  date_obj = stz.localize(date_obj)


  0%|          | 0/105 [00:00<?, ?it/s]

  0%|          | 0/157 [00:00<?, ?it/s]

  0%|          | 0/188 [00:00<?, ?it/s]

  date_obj = stz.localize(date_obj)


  0%|          | 0/371 [00:00<?, ?it/s]

  0%|          | 0/125 [00:00<?, ?it/s]

  0%|          | 0/17 [00:00<?, ?it/s]

  0%|          | 0/588 [00:00<?, ?it/s]

  0%|          | 0/273 [00:00<?, ?it/s]

In [36]:
comments_df

Unnamed: 0,comment_id,comment_text
0,10157877630867115,The opening up in early April was way too much/fast. 75% back to office is practically allowing for all to go back. When the borders were wide open to practically all. Too hasty.
1,10157877561317115,"Please ban India flights, India citizens India passport holders India born anyone with a travel history to India. Once you have this ban we will be like before zero community case. Other countries have imposed total ban on India flights and citizens including jail and fine. We should follow in order to protect our citizens."
2,10157877564117115,Haiz.. no apologies for importing cases. All your doing leh garmen. I'm going to have a friendly chat with my MP!
3,10157877557412115,Tighten local measures but fail to shut borders completely on travelers from high-risk countries. What's the point then? What a shame!
4,10157877555032115,Blaming us for not following safe distancing measures again instead of blaming themselves.
...,...,...
13612,10158186215552115,Trying very hard to push that all singaporeans get infected and finally admitted that healthcare has been overwhelmed.
13613,10158186213932115,"Omg. This is real bad. And really bad foresight and estimation. So if all the beds are filled, where are we going to place more patients and etc. And yet they gonna open more VTL lanes in the coming months. I presume MTF knows that we have yet to reach the peak and we have yet reached our worst period. Everyone please stay vigilant now and we are on our own."
13614,10158186213787115,"Endemic ma, didnt they know this?"
13615,10158186211122115,"If healthcare system is overstretched, why you still open more VTL so contradictory?\n\nDo you hv enough hospital beds for the imports ?\n\nPlease spare a thought for our healthcare workers, we already hv enough problems.\n\nhttps://\nwww.change.org/\np/\ngeneral-public-c\noncerns-for-imp\nort-cases"


In [37]:
comments_df.to_csv('../data/comments_fb_all.csv', encoding='utf-8-sig')