Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to collect posts beyond a certain number #285

Closed
chenxingyuzealken opened this issue May 24, 2021 · 5 comments
Closed

Unable to collect posts beyond a certain number #285

chenxingyuzealken opened this issue May 24, 2021 · 5 comments

Comments

@chenxingyuzealken
Copy link

chenxingyuzealken commented May 24, 2021

Hi wondering how can i solve this issue where I get only 2000+ posts from doing this. I know there are more posts, but I'm not getting them.

opt = {}
opt['daterange'] = False #set to True if you want to limit your search by startDate and endDate
opt['startDate'] = datetime.strptime('01/05/19 00:00:00', '%m/%d/%y %H:%M:%S') #change the time ranges to what you want
opt['endDate'] = datetime.strptime('12/06/22 00:00:00', '%m/%d/%y %H:%M:%S') #change the time ranges to what you want

page_name = 'ChannelNewsAsia'
fbcookies = { cookie details}
lst = []

for post in get_posts(page_name, cookies=fbcookies, pages=1000000,options={"allow_extra_requests": False}):

if opt['daterange'] == True:    
    #opt['demo'] = False
    
    if post['time'] < opt['startDate']:
        print('post collection is earlier than',opt['startDate'],'stopping collection' )
        if opt['demo'] == False:
            break
    
    if post['time'] > opt['endDate']:
        print('post collection is after',opt['endDate'],'stopping collection' )
        if opt['demo'] == False:
            break
    lst.append(post)

It stopped collecting stuff at 2020, which is odd for me

@chenxingyuzealken chenxingyuzealken changed the title Unable to collect beyond a certain number Unable to collect posts beyond a certain number May 24, 2021
@neon-ninja
Copy link
Collaborator

Try increasing the posts_per_page option

@neon-ninja
Copy link
Collaborator

neon-ninja commented May 25, 2021

I added some code to retry pagination requests on error (1cc8064), and with that, and this test code:

start = time.time()
posts = []
try:
    for post in get_posts("ChannelNewsAsia", cookies="cookies.txt", pages=200, timeout=60, options={"allow_extra_requests": False, "posts_per_page": 200}):
        posts.append(post)
except:
    print(f"{len(posts)} posts retrieved in {round(time.time() - start)}s. Oldest post: {posts[-1].get('time')}")

I get

14201 posts retrieved in 910s. Oldest post: 2013-12-12 13:05:00

@neon-ninja
Copy link
Collaborator

neon-ninja commented May 25, 2021

717d522 might also be useful to resume from the last cursor that errored out, see #287 (comment) for usage

@chenxingyuzealken
Copy link
Author

Thanks! I think the combination of:

from facebook_scraper import *
import pandas as pd
import ast
import time
from datetime import datetime

import requests
import logging
enable_logging(logging.DEBUG)

start = time.time()
posts = []
try:
for post in get_posts("ChannelNewsAsia", cookies="cookies.txt", pages=200, timeout=60, options={"allow_extra_requests": False, "posts_per_page": 200}):
posts.append(post)
except:
print(f"{len(posts)} posts retrieved in {round(time.time() - start)}s. Oldest post: {posts[-1].get('time')}")

and this:

cursor = " some url from the loggin output."

posts = []
try:
for post in get_posts("ChannelNewsAsia", cookies="cookies.txt", pages=200, timeout=60, options={"allow_extra_requests": False, start_url=cursor, "posts_per_page": 200}):
posts.append(post)

have helped in making the process more robust.

Thanks for the help! Your project is amazing!

@neon-ninja
Copy link
Collaborator

#291 might be useful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants