# Facebook Data Crawling
In this notebook, we will be crawling data from Facebook using the Facebook Graph API. We will be using the facebook-scraper

## Install the required library
We will be using the facebook-scraper library to crawl data from Facebook. We will install this library using pip.

In [1]:
%pip install facebook_scraper pandas numpy

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
from facebook_scraper import get_posts
import pandas as pd
import numpy as np
import time
from random import randint

## Crawl the data using facebook_scraper
Now we can get the data from Facebook using the facebook_scraper library. We will be using the get_posts function to get the posts from the fanpage. This function will return a list of dictionaries, where each dictionary represents a post. We will be saving this list of dictionaries to a json file. More information about what you can do with the facebook_scraper library can be found here: https://github.com/kevinzg/facebook-scraper

## Define variables
First we have to define some variables that we will be using throughout the notebook. 
- FANPAGE_LINK: The link to the fanpage that we want to crawl data from. This can be found by going to the fanpage and copying the link from the address bar. For example, the link to the fanpage of the [Nintendo Switch](https://www.facebook.com/NintendoSwitch/) is https://www.facebook.com/NintendoSwitch/. We will be using this link as the value for FANPAGE_LINK.

- COOKIE_PATH: The path to the cookie file that we will be using to authenticate with Facebook. This cookie file can be obtained by logging into Facebook and copying the cookie from the browser. For example, in Chromium, use extension [Get cookies.txt LOCALLY](https://chrome.google.com/webstore/detail/get-cookiestxt/bgaddhkoddajcdgocldbbfleckgcbcid) to get the cookie file. Then save the cookie to a file and use the path to this file as the value for COOKIE_PATH. <span style="color:red; font-weight:bold">USE COOKIE FROM A FAKE ACCOUNT, OTHERWISE YOUR REAL ACCOUNT MIGHT GET BANNED.</span>.


- FOLDER_NAME: The name of the folder that we will be saving the data to. This folder will be created in the same directory as this notebook.

In [2]:
FANPAGE_LINK ="anhdadenchuanmen"
FOLDER_PATH = "Data/"
COOKIE_PATH = "cookies.txt"

PAGES_NUMBER = 100 # Number of pages to crawl

In [4]:
# post_list = []
# for post in get_posts('anhdadenchuanmen',
#                     options={"comments": True, "reactions": True, "allow_extra_requests": True,"posts_per_page": 1000000},
#                     extra_info=True, pages=PAGES_NUMBER, cookies=COOKIE_PATH):
#     print(post)
#     post_list.append(post)

# post_list    
degvbveged@gmail.com
ocnek123
cayd6330@gmail.com
ocnek123
id gần 878147440336916

## Convert list of dicts to df

Now we can convert the list of dictionaries to a pandas dataframe. We will be using the pandas library to do this. We will also be saving the dataframe to a xlxs or csv file.

In [4]:
from facebook_scraper import get_posts, exceptions
import os
import time
from random import randint
import csv

PAGE_NUMBER = 100

post_list = []
RESUME_URL_SAVE_FILE = "resume_url.txt"
COOKIES_FILE_PATH = "cookies.txt"

def handle_pagination_url(url):
    global resume_url
    resume_url = url
    if post_list:
        print(f"{len(post_list)}: {post_list[-1]['time']}: {resume_url}")



def save_url(resume_url):
    with open(RESUME_URL_SAVE_FILE, "w") as f:
        f.write(resume_url)

def read_resume_url():
    try:
        with open(RESUME_URL_SAVE_FILE, "r") as f:
            url = f.read()
        if url:
            return url
    except FileNotFoundError:
        pass
    return None  # Trả về None nếu không có URL lưu trữ

try:
    for post in get_posts('anhdadenchuanmen',
                          options={"comment": True, "reactions": True, "allow_extra_requests": True},
                          cookies=COOKIES_FILE_PATH,
                         
                         
                             request_url_callback=handle_pagination_url,
                          pages=PAGE_NUMBER):
        print(post)
        post_list.append(post)

        time.sleep(randint(5, 15))

        # Lưu URL của bài đăng cuối cùng
        resume_url = post["post_url"]
        save_url(resume_url)

except exceptions.TemporarilyBanned:
    print("Temporarily Banned")

except Exception as e:
    print(e)


In [None]:
# Initialize dataframe to scrape Facebook post
post_df_full = pd.DataFrame(columns=post_list[0].keys(), index=range(len(post_list)), data=post_list)
post_df_full.to_csv('Data/ddd.csv', index=False)


IndexError: list index out of range

In [None]:
post_df_full

Unnamed: 0,post_id,text,post_text,shared_text,original_text,time,timestamp,image,image_lowquality,images,...,reactions,reaction_count,with,page_id,sharers,image_id,image_ids,was_live,header,fetched_time
0,896338218517838,Sever bị lag\n#anhdaden,Sever bị lag\n#anhdaden,,,2023-11-22 21:00:39,1700661639,,https://scontent.fhan14-4.fna.fbcdn.net/v/t15....,[],...,"{'thích': 116, 'haha': 186, 'buồn': 2}",304,"[{'name': 'Bất ngờ ở quanh ta', 'link': '/watc...",2035749833398248,,,[],False,Anh Da Đen đã đăng một video vào danh sách phá...,2023-11-22 21:20:47.949534
1,894292592055734,Cách này hay nè\n#anhdaden,Cách này hay nè\n#anhdaden,,,2023-11-22 20:30:37,1700659837,https://scontent.fhan14-4.fna.fbcdn.net/v/t39....,https://scontent.fhan14-4.fna.fbcdn.net/v/t39....,[https://scontent.fhan14-4.fna.fbcdn.net/v/t39...,...,"{'thích': 102, 'yêu thích': 2, 'haha': 88, 'wo...",195,,2035749833398248,,894292568722403,[894292568722403],False,,2023-11-22 21:20:55.555583
2,896394561845537,Mới đầu tháng cháy ví cuối tháng lại cháy ví t...,Mới đầu tháng cháy ví cuối tháng lại cháy ví t...,,,2023-11-22 20:00:01,1700658001,,https://scontent.fhan14-3.fna.fbcdn.net/v/t39....,[],...,{'thích': 2959},2959,,2035749833398248,,896394458512214,[896394458512214],False,,2023-11-22 21:21:03.865703
3,896334455184881,Hảo xử lý\n#anhdaden,Hảo xử lý\n#anhdaden,,,2023-11-22 19:00:09,1700654409,,https://scontent.fhan14-1.fna.fbcdn.net/v/t15....,[],...,"{'thích': 522, 'yêu thích': 1, 'haha': 630, 'w...",1163,"[{'name': 'Bất ngờ ở quanh ta', 'link': '/watc...",2035749833398248,,,[],False,Anh Da Đen đã đăng một video vào danh sách phá...,2023-11-22 21:21:08.523448
4,896323245186002,Không một động tác thừa\n#anhdaden,Không một động tác thừa\n#anhdaden,,,2023-11-22 16:00:06,1700643606,,https://scontent.fhan14-1.fna.fbcdn.net/v/t15....,[],...,"{'thích': 559, 'yêu thích': 1, 'haha': 783, 'w...",1357,,2035749833398248,,,[],False,,2023-11-22 21:21:39.453480
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,886927612792232,Ảo ma\n#anhdaden,Ảo ma\n#anhdaden,,,2023-11-04 16:00:28,1699088428,,https://scontent.fhan14-3.fna.fbcdn.net/v/t15....,[],...,,0,,2035749833398248,,,[],False,,NaT
97,885836966234630,Chắc không phải anh rồi\n#anhdaden,Chắc không phải anh rồi\n#anhdaden,,,2023-11-04 15:00:04,1699084804,https://m.facebook.com/photo/view_full_size/?f...,https://scontent.fhan14-4.fna.fbcdn.net/v/t39....,[https://m.facebook.com/photo/view_full_size/?...,...,,0,,2035749833398248,,885836729567987,[885836729567987],False,,NaT
98,886921486126178,Toi luôn\n#anhdaden,Toi luôn\n#anhdaden,,,2023-11-04 11:00:03,1699070403,,https://scontent.fhan14-3.fna.fbcdn.net/v/t15....,[],...,,0,,2035749833398248,,,[],False,,NaT
99,886907676127559,Chia tay luôn\n#anhdaden,Chia tay luôn\n#anhdaden,,,2023-11-03 21:00:48,1699020048,,https://scontent.fhan14-4.fna.fbcdn.net/v/t15....,[],...,,0,,2035749833398248,,,[],False,,NaT
