-
Notifications
You must be signed in to change notification settings - Fork 632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get_posts returns only 20 posts when logged in #532
Comments
Try load https://m.facebook.com/groups/120778514747417/ in your browser, and scroll down. Posts don't load there either. This looks like a bug in Facebook |
Similar issue to #453 |
You are right. It seems to be working fine on the non-mobile domain https://www.facebook.com/groups/120778514747417. |
The scraper doesn't support the desktop version of Facebook. The mobile version is a lot easier to work with programmatically |
hi @neon-ninja I am now considering to reverse-engineer the www version of facebook to scrape groups. |
Hi @BodomBeach, A bit - I think it would be best to use selenium, chromium/chrome headless, webdriver-manager (https://pypi.org/project/webdriver-manager/) and target the accessibility selectors (aria-*) in Facebook desktop. The accessibility tab in Chrome dev tools would be useful for figuring out selectors. Essentially you'd be building an automated screen reader. Something like this: #!/usr/bin/env python3
from selenium import webdriver
from selenium.common.exceptions import *
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options)
driver.get(f"https://www.facebook.com/groups/120778514747417")
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(1)
posts = driver.find_elements_by_css_selector("div[role='article']")
for post in posts:
labelled_by = post.get_attribute("aria-labelledby")
print(labelled_by)
label = driver.find_element_by_id(labelled_by).text
print(label)
desc_by = post.get_attribute("aria-describedby").split()
for desc_elem in desc_by:
print(desc_elem)
try:
label = driver.find_element_by_id(desc_elem).text
if label:
print(label.strip())
except NoSuchElementException:
pass |
It seems that facebook has changed the next link format. By adding a new regex in the PageParser class, I've managed to fetch the next pages. Moreover, class PageParser:
"""Class for Parsing a single page on a Page"""
....
cursor_regex_5 = re.compile(r'href:"/groups/(.*?)multi_permalinks"')
....
def get_next_page(self) -> Optional[URL]:
....
match = self.cursor_regex_5.search(self.cursor_blob)
if match:
return match.groups()[0]
return None Maybe giving the regex as an argument is a more robust solution (as long as fb gives the complete href in the page and doesn't suddenly construct it on the fly)? |
Sounds like a good fix, please submit a pull request |
It looks indeed like the href for next page. However, visiting this next page url loads the group page with zero article in it. Tried on several groups. |
Yes, I can fetch more results until I get temp banned. In my scenario, the logged in user is the admin of a private group. |
Oh, I've just realised - this kind of group pagination URL is already handled by this regex - |
You are right @neon-ninja. I didn't pass the group id as a named argument. If I do it works as intended. Sorry for the confusion, my bad. |
Hi, has anyone been able to fix the issue? Thanks |
This is a bug in Facebook itself, there's not a lot we can do about it |
I am trying to scrap group=120778514747417 which is a public group.
print(len(list(get_posts(group=120778514747417))))
If I don't run
set_cookies
first, the scraper works fine but I get banned pretty quickly, probably because I am unlogged.If I run
set_cookies
first with my account which is part of this group, scraper only returns 20 posts and stops there.Logger says the following at the end:
No raw posts (<article> elements) were found in this page.
Page parser did not find next page URL
20
How can I scrap more posts being logged in ?
If not possible, what is the recommended throttling to scrap without cookies ?
Thanks and great work btw!
The text was updated successfully, but these errors were encountered: