# Facebook Data Crawling
In this notebook, we will be crawling data from Facebook using the Facebook Graph API. We will be using the facebook-scraper

## Install the required library
We will be using the facebook-scraper library to crawl data from Facebook. We will install this library using pip.

In [1]:
%pip install facebook_scraper pandas numpy scikit-learn nltk

Collecting facebook_scraper
  Downloading facebook_scraper-0.2.59-py3-none-any.whl (45 kB)
     ---------------------------------------- 0.0/45.5 kB ? eta -:--:--
     -------- ------------------------------- 10.2/45.5 kB ? eta -:--:--
     ----------------- -------------------- 20.5/45.5 kB 330.3 kB/s eta 0:00:01
     -------------------------------------- 45.5/45.5 kB 375.4 kB/s eta 0:00:00
Collecting pandas
  Obtaining dependency information for pandas from https://files.pythonhosted.org/packages/db/3e/db3e98911b5da217d1e3f85b6e091448cb8f8be674bdaff3c0ec0dd855e0/pandas-2.1.2-cp311-cp311-win_amd64.whl.metadata
  Downloading pandas-2.1.2-cp311-cp311-win_amd64.whl.metadata (18 kB)
Collecting numpy
  Obtaining dependency information for numpy from https://files.pythonhosted.org/packages/82/0f/3f712cd84371636c5375d2dd70e7514d264cec6bdfc3d7997a4236e9f948/numpy-1.26.1-cp311-cp311-win_amd64.whl.metadata
  Downloading numpy-1.26.1-cp311-cp311-win_amd64.whl.metadata (61 kB)
     -------------


[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
from facebook_scraper import get_posts
import pandas as pd
import numpy as np
import time
import random

## Crawl the data using facebook_scraper
Now we can get the data from Facebook using the facebook_scraper library. We will be using the get_posts function to get the posts from the fanpage. This function will return a list of dictionaries, where each dictionary represents a post. We will be saving this list of dictionaries to a json file. More information about what you can do with the facebook_scraper library can be found here: https://github.com/kevinzg/facebook-scraper

## Define variables
First we have to define some variables that we will be using throughout the notebook. 
- FANPAGE_LINK: The link to the fanpage that we want to crawl data from. This can be found by going to the fanpage and copying the link from the address bar. For example, the link to the fanpage of the [Nintendo Switch](https://www.facebook.com/NintendoSwitch/) is https://www.facebook.com/NintendoSwitch/. We will be using this link as the value for FANPAGE_LINK.

- COOKIE_PATH: The path to the cookie file that we will be using to authenticate with Facebook. This cookie file can be obtained by logging into Facebook and copying the cookie from the browser. For example, in Chromium, use extension [Get cookies.txt LOCALLY](https://chrome.google.com/webstore/detail/get-cookiestxt/bgaddhkoddajcdgocldbbfleckgcbcid) to get the cookie file. Then save the cookie to a file and use the path to this file as the value for COOKIE_PATH. <span style="color:red; font-weight:bold">USE COOKIE FROM A FAKE ACCOUNT, OTHERWISE YOUR REAL ACCOUNT MIGHT GET BANNED.</span>.


- FOLDER_NAME: The name of the folder that we will be saving the data to. This folder will be created in the same directory as this notebook.

In [5]:
FANPAGE_LINK ="natgeo"
FOLDER_PATH = "Data/"
COOKIE_PATH = "./cookie.txt"

PAGES_NUMBER = 60 # Number of pages to crawl

In [None]:
post_list = []
for post in get_posts(FANPAGE_LINK,
                    options={"comments": True, "reactions": True, "allow_extra_requests": True},
                    extra_info=True, pages=PAGES_NUMBER, cookies=COOKIE_PATH,timeout = 600): 
    post_list.append(post)
    time.sleep(30)

In [8]:
len(post_list)

230

## Convert list of dicts to df

Now we can convert the list of dictionaries to a pandas dataframe. We will be using the pandas library to do this. We will also be saving the dataframe to a xlxs or csv file.

In [None]:
# Initialize dataframe to scrape Facebook post
post_df_full = pd.DataFrame(columns=post_list[0].keys(), index=range(len(post_list)), data=post_list)

# To df
path=FOLDER_PATH + FANPAGE_LINK  +  ".csv"
post_df_full.to_csv(path, index=False)
print(path)

In [10]:
post_df_full = pd.read_csv("Data/natgeo.csv")

In [11]:
post_df_full

Unnamed: 0,post_id,text,post_text,shared_text,original_text,time,timestamp,image,image_lowquality,images,...,w3_fb_url,reactions,reaction_count,with,page_id,sharers,image_id,image_ids,was_live,fetched_time
0,89723405844105,How we've dealt with periods over millennia sa...,How we've dealt with periods over millennia sa...,NATIONALGEOGRAPHIC.COM Egyptions used papyrus...,,2023-11-30 21:15:00,1700738135,,https://external.fhan14-1.fna.fbcdn.net/emg1/v...,[],...,https://www.facebook.com/natgeo/posts/89723405...,"{'like': 383, 'love': 11, 'care': 2, 'haha':1,...",397,,23497828950,,,[],False,2023-11-23 20:51:40.794297
1,89723405844104,What will it take the save the world's rivers?...,What will it take the save the world's rivers?...,NATIONALGEOGRAPHIC.COM How Vjosa Wild River N...,,2023-11-30 18:15:00,1700738134,,https://external.fhan14-1.fna.fbcdn.net/emg1/v...,[],...,https://www.facebook.com/natgeo/posts/89723405...,"{'like': 675, 'love': 46, 'care': 4, 'haha':2,...",730,,23497828950,,,[],False,2023-11-23 20:51:40.794297
2,89723405844103,North America is becoming a more popular winte...,North America is becoming a more popular winte...,NATIONALGEOGRAPHIC.COM Where to go for a NORT...,,2023-11-30 12:15:00,1700738133,,https://external.fhan14-1.fna.fbcdn.net/emg1/v...,[],...,https://www.facebook.com/natgeo/posts/89723405...,"{'like': 710, 'love': 41, 'care': 0, 'haha':2,...",754,,23497828950,,,[],False,2023-11-23 20:51:40.794297
3,89723405844102,Experts have one key piece of advice for our a...,Experts have one key piece of advice for our a...,NATIONALGEOGRAPHIC.COM Walking is the sixth v...,,2023-11-30 09:15:00,1700738132,,https://external.fhan14-1.fna.fbcdn.net/emg1/v...,[],...,https://www.facebook.com/natgeo/posts/89723405...,"{'like': 501, 'love': 46, 'care': 5, 'haha':1,...",554,,23497828950,,,[],False,2023-11-23 20:51:40.794297
4,89723405844101,A traffic cam along I-94 in Minnesota captured...,A traffic cam along I-94 in Minnesota captured...,,,2023-11-30 06:15:00,1700738131,,https://external.fhan14-1.fna.fbcdn.net/emg1/v...,[],...,https://www.facebook.com/natgeo/posts/89723405...,"{'like': 6800, 'love': 2100, 'care': 107, 'hah...",9400,,23497828950,,,[],False,2023-11-23 20:51:40.794297
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
225,89723405844076,"Día de los Muertos, or Day of the Dead, is a t...","Día de los Muertos, or Day of the Dead, is a t...",,,2023-11-01 17:15:00,1700738106,,https://external.fhan14-1.fna.fbcdn.net/emg1/v...,[],...,https://www.facebook.com/natgeo/posts/89723405...,"{'like': 1400, 'love': 167, 'care': 12, 'haha'...",1600,,23497828950,,,[],False,2023-11-23 20:51:40.794297
226,89723405844077,The Swedish capital's character has been shape...,The Swedish capital's character has been shape...,"NATIONALGEOGRAPHIC.COM A guide to Stockholm, ...",,2023-11-01 11:15:00,1700738107,,https://external.fhan14-1.fna.fbcdn.net/emg1/v...,[],...,https://www.facebook.com/natgeo/posts/89723405...,"{'like': 1000, 'love': 67, 'care': 10, 'haha':...",1000,,23497828950,,,[],False,2023-11-23 20:51:40.794297
227,89723405844078,Have you heard of delusional parasitosis? Some...,Have you heard of delusional parasitosis? Some...,,,2023-11-01 17:15:00,1700738108,,https://external.fhan14-1.fna.fbcdn.net/emg1/v...,[],...,https://www.facebook.com/natgeo/posts/89723405...,"{'like': 334, 'love': 14, 'care': 10, 'haha':8...",382,,23497828950,,,[],False,2023-11-23 20:51:40.794297
228,89723405844079,"The backstory of jack-o'-lanterns, including h...","The backstory of jack-o'-lanterns, including h...",NATIONALGEOGRAPHIC.COm The history of America...,,2023-11-01 03:15:00,1700738109,,https://external.fhan14-1.fna.fbcdn.net/emg1/v...,[],...,https://www.facebook.com/natgeo/posts/89723405...,"{'like': 447, 'love': 7, 'care': 5, 'haha':3, ...",513,,23497828950,,,[],False,2023-11-23 20:51:40.794297


In [11]:
post_df_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98 entries, 0 to 97
Data columns (total 52 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   post_id                        98 non-null     object        
 1   text                           98 non-null     object        
 2   post_text                      98 non-null     object        
 3   shared_text                    98 non-null     object        
 4   original_text                  0 non-null      object        
 5   time                           98 non-null     datetime64[ns]
 6   timestamp                      98 non-null     int64         
 7   image                          24 non-null     object        
 8   image_lowquality               97 non-null     object        
 9   images                         98 non-null     object        
 10  images_description             98 non-null     object        
 11  images_lowquality    

In [12]:
print(post_df_full.columns)

Index(['post_id', 'text', 'post_text', 'shared_text', 'original_text', 'time',
       'timestamp', 'image', 'image_lowquality', 'images',
       'images_description', 'images_lowquality',
       'images_lowquality_description', 'video', 'video_duration_seconds',
       'video_height', 'video_id', 'video_quality', 'video_size_MB',
       'video_thumbnail', 'video_watches', 'video_width', 'likes', 'comments',
       'shares', 'post_url', 'link', 'links', 'user_id', 'username',
       'user_url', 'is_live', 'factcheck', 'shared_post_id', 'shared_time',
       'shared_user_id', 'shared_username', 'shared_post_url', 'available',
       'comments_full', 'reactors', 'w3_fb_url', 'reactions', 'reaction_count',
       'with', 'page_id', 'sharers', 'image_id', 'image_ids', 'was_live',
       'fetched_time'],
      dtype='object')
