get_posts returns only 20 posts when logged in #532

BodomBeach · 2021-10-28T19:28:12Z

I am trying to scrap group=120778514747417 which is a public group.

print(len(list(get_posts(group=120778514747417))))

If I don't run set_cookies first, the scraper works fine but I get banned pretty quickly, probably because I am unlogged.
If I run set_cookies first with my account which is part of this group, scraper only returns 20 posts and stops there.
Logger says the following at the end:
No raw posts (<article> elements) were found in this page.
Page parser did not find next page URL
20

How can I scrap more posts being logged in ?
If not possible, what is the recommended throttling to scrap without cookies ?
Thanks and great work btw!

The text was updated successfully, but these errors were encountered:

neon-ninja · 2021-10-28T20:33:11Z

Try load https://m.facebook.com/groups/120778514747417/ in your browser, and scroll down. Posts don't load there either. This looks like a bug in Facebook

neon-ninja · 2021-10-28T20:40:16Z

Similar issue to #453

BodomBeach · 2021-10-28T20:45:19Z

You are right. It seems to be working fine on the non-mobile domain https://www.facebook.com/groups/120778514747417.
Strange though, when I remove my cookie, scraper finds more pages...
Is there an argument to override the scraper's BASE_URL ?
Is there a reason you decided to scrape the mobile version by default ?
Anyway, thank you

neon-ninja · 2021-10-28T20:49:52Z

The scraper doesn't support the desktop version of Facebook. The mobile version is a lot easier to work with programmatically

BodomBeach · 2021-11-14T11:49:09Z

hi @neon-ninja
any update on this issue ? this is making group scraping impossible.

I am now considering to reverse-engineer the www version of facebook to scrape groups.
Would you be interested in my results if I manage to achieve something ?
Also, have you already fiddled around the www version ? maybe you could give me a headstart

neon-ninja · 2021-11-14T20:31:06Z

Hi @BodomBeach,
No update, this scraper still doesn't support the desktop version of Facebook. Yes, I would accept a pull request implementing this feature.

A bit - I think it would be best to use selenium, chromium/chrome headless, webdriver-manager (https://pypi.org/project/webdriver-manager/) and target the accessibility selectors (aria-*) in Facebook desktop. The accessibility tab in Chrome dev tools would be useful for figuring out selectors. Essentially you'd be building an automated screen reader. Something like this:

#!/usr/bin/env python3
from selenium import webdriver
from selenium.common.exceptions import *
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time

chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options)

driver.get(f"https://www.facebook.com/groups/120778514747417")
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(1)
posts = driver.find_elements_by_css_selector("div[role='article']")
for post in posts:
    labelled_by = post.get_attribute("aria-labelledby")
    print(labelled_by)
    label = driver.find_element_by_id(labelled_by).text
    print(label)
    desc_by = post.get_attribute("aria-describedby").split()
    for desc_elem in desc_by:
        print(desc_elem)
        try:
            label = driver.find_element_by_id(desc_elem).text
            if label:
                print(label.strip())
        except NoSuchElementException:
            pass

chribell · 2021-11-19T08:54:49Z

It seems that facebook has changed the next link format. By adding a new regex in the PageParser class, I've managed to fetch the next pages. Moreover,

class PageParser:
    """Class for Parsing a single page on a Page"""
    ....
    cursor_regex_5 = re.compile(r'href:"/groups/(.*?)multi_permalinks"')
    ....

    def get_next_page(self) -> Optional[URL]:
        ....
        match = self.cursor_regex_5.search(self.cursor_blob)
        if match:
            return match.groups()[0]
        return None

Maybe giving the regex as an argument is a more robust solution (as long as fb gives the complete href in the page and doesn't suddenly construct it on the fly)?

neon-ninja · 2021-11-20T01:44:41Z

Sounds like a good fix, please submit a pull request

BodomBeach · 2021-11-21T10:23:29Z

It looks indeed like the href for next page. However, visiting this next page url loads the group page with zero article in it. Tried on several groups.
Have you managed to load more results ? @chribell

chribell · 2021-11-21T17:31:58Z

Yes, I can fetch more results until I get temp banned. In my scenario, the logged in user is the admin of a private group.

neon-ninja · 2021-11-21T20:49:01Z

Oh, I've just realised - this kind of group pagination URL is already handled by this regex - href[=:]"(\/groups\/[^"]+bac=[^"]+)". All group pagination should be handled by the GroupPageParser class - given you've put it in PageParser, suggests you're not calling get_posts correctly @chribell. For a group, call get_posts like this: get_posts(group="group_name")

chribell · 2021-11-21T21:19:48Z

You are right @neon-ninja. I didn't pass the group id as a named argument. If I do it works as intended. Sorry for the confusion, my bad.

pietro-ayy · 2022-01-06T10:24:49Z

Hi, has anyone been able to fix the issue? Thanks

neon-ninja · 2022-01-06T11:26:49Z

This is a bug in Facebook itself, there's not a lot we can do about it

neon-ninja added the facebook bug A bug in Facebook itself. Not a lot we can do about it. label Oct 28, 2021

neon-ninja mentioned this issue Jan 17, 2022

Problem with groups having top post #635

Open

BmoXD mentioned this issue Apr 10, 2022

get_posts returns only 20 or 19 posts from page #730

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_posts returns only 20 posts when logged in #532

get_posts returns only 20 posts when logged in #532

BodomBeach commented Oct 28, 2021 •

edited

Loading

neon-ninja commented Oct 28, 2021

neon-ninja commented Oct 28, 2021

BodomBeach commented Oct 28, 2021 •

edited

Loading

neon-ninja commented Oct 28, 2021 •

edited

Loading

BodomBeach commented Nov 14, 2021

neon-ninja commented Nov 14, 2021 •

edited

Loading

chribell commented Nov 19, 2021 •

edited

Loading

neon-ninja commented Nov 20, 2021

BodomBeach commented Nov 21, 2021

chribell commented Nov 21, 2021

neon-ninja commented Nov 21, 2021

chribell commented Nov 21, 2021

pietro-ayy commented Jan 6, 2022

neon-ninja commented Jan 6, 2022

get_posts returns only 20 posts when logged in #532

get_posts returns only 20 posts when logged in #532

Comments

BodomBeach commented Oct 28, 2021 • edited Loading

neon-ninja commented Oct 28, 2021

neon-ninja commented Oct 28, 2021

BodomBeach commented Oct 28, 2021 • edited Loading

neon-ninja commented Oct 28, 2021 • edited Loading

BodomBeach commented Nov 14, 2021

neon-ninja commented Nov 14, 2021 • edited Loading

chribell commented Nov 19, 2021 • edited Loading

neon-ninja commented Nov 20, 2021

BodomBeach commented Nov 21, 2021

chribell commented Nov 21, 2021

neon-ninja commented Nov 21, 2021

chribell commented Nov 21, 2021

pietro-ayy commented Jan 6, 2022

neon-ninja commented Jan 6, 2022

BodomBeach commented Oct 28, 2021 •

edited

Loading

BodomBeach commented Oct 28, 2021 •

edited

Loading

neon-ninja commented Oct 28, 2021 •

edited

Loading

neon-ninja commented Nov 14, 2021 •

edited

Loading

chribell commented Nov 19, 2021 •

edited

Loading