Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_posts returns only 20 posts when logged in #532

Open
BodomBeach opened this issue Oct 28, 2021 · 14 comments
Open

get_posts returns only 20 posts when logged in #532

BodomBeach opened this issue Oct 28, 2021 · 14 comments
Labels
facebook bug A bug in Facebook itself. Not a lot we can do about it.

Comments

@BodomBeach
Copy link

BodomBeach commented Oct 28, 2021

I am trying to scrap group=120778514747417 which is a public group.

print(len(list(get_posts(group=120778514747417))))

If I don't run set_cookies first, the scraper works fine but I get banned pretty quickly, probably because I am unlogged.
If I run set_cookies first with my account which is part of this group, scraper only returns 20 posts and stops there.
Logger says the following at the end:
No raw posts (<article> elements) were found in this page.
Page parser did not find next page URL
20

How can I scrap more posts being logged in ?
If not possible, what is the recommended throttling to scrap without cookies ?
Thanks and great work btw!

@neon-ninja
Copy link
Collaborator

Try load https://m.facebook.com/groups/120778514747417/ in your browser, and scroll down. Posts don't load there either. This looks like a bug in Facebook

@neon-ninja neon-ninja added the facebook bug A bug in Facebook itself. Not a lot we can do about it. label Oct 28, 2021
@neon-ninja
Copy link
Collaborator

Similar issue to #453

@BodomBeach
Copy link
Author

BodomBeach commented Oct 28, 2021

You are right. It seems to be working fine on the non-mobile domain https://www.facebook.com/groups/120778514747417.
Strange though, when I remove my cookie, scraper finds more pages...
Is there an argument to override the scraper's BASE_URL ?
Is there a reason you decided to scrape the mobile version by default ?
Anyway, thank you

@neon-ninja
Copy link
Collaborator

neon-ninja commented Oct 28, 2021

The scraper doesn't support the desktop version of Facebook. The mobile version is a lot easier to work with programmatically

@BodomBeach
Copy link
Author

hi @neon-ninja
any update on this issue ? this is making group scraping impossible.

I am now considering to reverse-engineer the www version of facebook to scrape groups.
Would you be interested in my results if I manage to achieve something ?
Also, have you already fiddled around the www version ? maybe you could give me a headstart

@neon-ninja
Copy link
Collaborator

neon-ninja commented Nov 14, 2021

Hi @BodomBeach,
No update, this scraper still doesn't support the desktop version of Facebook. Yes, I would accept a pull request implementing this feature.

A bit - I think it would be best to use selenium, chromium/chrome headless, webdriver-manager (https://pypi.org/project/webdriver-manager/) and target the accessibility selectors (aria-*) in Facebook desktop. The accessibility tab in Chrome dev tools would be useful for figuring out selectors. Essentially you'd be building an automated screen reader. Something like this:

#!/usr/bin/env python3
from selenium import webdriver
from selenium.common.exceptions import *
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time

chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options)

driver.get(f"https://www.facebook.com/groups/120778514747417")
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(1)
posts = driver.find_elements_by_css_selector("div[role='article']")
for post in posts:
    labelled_by = post.get_attribute("aria-labelledby")
    print(labelled_by)
    label = driver.find_element_by_id(labelled_by).text
    print(label)
    desc_by = post.get_attribute("aria-describedby").split()
    for desc_elem in desc_by:
        print(desc_elem)
        try:
            label = driver.find_element_by_id(desc_elem).text
            if label:
                print(label.strip())
        except NoSuchElementException:
            pass

@chribell
Copy link

chribell commented Nov 19, 2021

It seems that facebook has changed the next link format. By adding a new regex in the PageParser class, I've managed to fetch the next pages. Moreover,

class PageParser:
    """Class for Parsing a single page on a Page"""
    ....
    cursor_regex_5 = re.compile(r'href:"/groups/(.*?)multi_permalinks"')
    ....

    def get_next_page(self) -> Optional[URL]:
        ....
        match = self.cursor_regex_5.search(self.cursor_blob)
        if match:
            return match.groups()[0]
        return None

Maybe giving the regex as an argument is a more robust solution (as long as fb gives the complete href in the page and doesn't suddenly construct it on the fly)?

@neon-ninja
Copy link
Collaborator

Sounds like a good fix, please submit a pull request

@BodomBeach
Copy link
Author

It looks indeed like the href for next page. However, visiting this next page url loads the group page with zero article in it. Tried on several groups.
Have you managed to load more results ? @chribell

@chribell
Copy link

Yes, I can fetch more results until I get temp banned. In my scenario, the logged in user is the admin of a private group.

@neon-ninja
Copy link
Collaborator

Oh, I've just realised - this kind of group pagination URL is already handled by this regex - href[=:]"(\/groups\/[^"]+bac=[^"]+)". All group pagination should be handled by the GroupPageParser class - given you've put it in PageParser, suggests you're not calling get_posts correctly @chribell. For a group, call get_posts like this: get_posts(group="group_name")

@chribell
Copy link

You are right @neon-ninja. I didn't pass the group id as a named argument. If I do it works as intended. Sorry for the confusion, my bad.

@pietro-ayy
Copy link

Hi, has anyone been able to fix the issue? Thanks

@neon-ninja
Copy link
Collaborator

This is a bug in Facebook itself, there's not a lot we can do about it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
facebook bug A bug in Facebook itself. Not a lot we can do about it.
Projects
None yet
Development

No branches or pull requests

4 participants