-
Notifications
You must be signed in to change notification settings - Fork 624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About temporarily block #409
Comments
It's still possible to get temp banned when logged in. In some cases, the scraper makes a request, which is rejected due to temp ban (for example, getting replies to a comment). But the rest of the data can still be returned despite this, so the exception is caught and handled within the extract function. It's only if you were temp banned while paginating, or if the exception bubbles out of the individual extract function that the exception would be raised to your own code. See #385 for some more discussion around what should be raised vs caught. |
Is it that except exceptions.TemporarilyBanned could be raised in extract_comments_full function instead so my main program can catch it, as in my case I need to pause even if I cannot scrape the replies. |
I also wonder if we are making an independent request for each reply to comment? Should we introduce some time delay between the requests to reduce the chance of blocking? |
Sure, 17e8332 should make the scraper raise TemporarilyBanned if it occurs while collecting replies, or reactions to comments. Yes, each set of replies or reactions to a comment involves another request. Could do - perhaps this function could be converted into a generator that yields comments, similar to how get_posts is a generator that yields posts. What do you think? |
I only feel strange that when I am temporarily blocked, I cannot request replies and I can request story.php, so what make the difference? What I observed is that I can see that too many requests to replies to comments can get temporarily blocked easily, and it is nearly a precursor to the whole program blocked. I'm not an expert in Python and I still don't know much about generator, but if that may improve the situation, I think it is worth a try and we can test it. |
I think it's possible that Facebook have different quotas for different request types, and some requests are "more suspicious" than others. So it seems plausible that you might be banned from getting more replies but still able to get posts. |
3026d39 should make it possible to get the scraper to give you a generator of comments, if you set the comments option to "generator". This should give you more control of the rate of comment extraction. Sample usage: set_cookies("cookies.txt")
post = next(
get_posts(
post_urls=[2257188721032235],
options={"comments": "generator"},
)
)
comments = post["comments_full"]
for comment in comments:
print(comment)
time.sleep(5) It also makes it possible to resume from a given comment pagination link. Sample usage: set_cookies("cookies.txt")
results = []
start_url = None
def handle_pagination_url(url):
global start_url
start_url = url
while True:
try:
post = next(
get_posts(
post_urls=[2257188721032235],
options={
"comments": "generator",
"comment_start_url": start_url,
"comment_request_url_callback": handle_pagination_url,
},
)
)
comments = post["comments_full"]
for comment in comments:
print(comment)
results.append(comment)
print("All done")
break
except exceptions.TemporarilyBanned:
print("Temporarily banned, sleeping for 10m")
time.sleep(600) |
Thanks for your great work. Preliminary testing: when inserting a 5s break between each comment, the program takes a very long time so I quit the program after around 4 hours. When inserting 1-2s break between each comment and I additionally insert 4-6s break if the last comment has replies, the program is still blocked by Facebook when scraping around 10 posts (around 1 hour). |
Sure, 5143311 should make the replies into a generator too, if set_cookies("cookies.txt")
post = next(
get_posts(
post_urls=[2257188721032235],
options={"comments": "generator"},
)
)
comments = list(post["comments_full"])
print(f"Got {len(comments)} comments. Fetching replies...")
for comment in tqdm(comments):
comment["replies"] = list(comment["replies"])
pprint(comment)
if comment["replies"]:
print(f"Found {len(comment['replies'])} replies, sleeping 20s")
time.sleep(20) |
I just started playing around with this module today and have already received the "TemporaryBanned" exception. I was wondering if there is a best practice to avoid this exception? For example, is it more likely to get banned with credentials or cookies? Or what would be a recommended number of requests per hour? Thanks in advance! |
Best practice is to reduce your requests per second ;). Credentials vs cookies should be functionally the same. Try to stay under one request per second. What were you doing? Some requests (getting friend lists for example) are more sensitive to temp bans than others. |
From my experience, use cookies instead of credentials. Starting from July, I experienced account lock (not just temporary ban) if login to test the programs too many times. |
Yes, I figured the requests per second should be low, but in my run I only ran it once every 5 minutes, which is why I'm surprised I was already temporarily blocked. I did use credentials for it though. This is what I'm doing: for post in get_posts(page_name, pages=5, options={"comments": True, "posts_per_page": 25}, credentials=("username", "password")):
# save each post in a dict
for comment in post["comments_full"]:
# save each comment in a dict with post as parent
# ...
# analyse data in DataFrame Basically, I only need one request using |
How many comments and replies does this page usually attract? What's the value of page_name? |
I was testing on Clash of Clans, so quite a bit of comments. |
I get temporarily blocked even though sleeping for 20-30s after each reply, after running for around 30 minutes. It seems it is very difficult to scrape the replies. Some questions:
while True: |
Some more testing indicated that it is necessary to introduce 3-5s delay after each comment and reply, and program can continue to run for around 2 hours. Even though blocking is caused by scraping replies, it is not enough only adding delay after each reply. It is necessary to add delay after each comment as well, and 1-2s delay from my experience is insufficient. I will see if blocking occurs tomorrow using the current setting. |
@exnerfelix so, with ClashOfClans, 5 pages, 25 posts per page, you're looking at 101 posts and ~69,494 comments. Comments only come in pages of 30, so that's ~2316 requests to get all the comments. Some comments have replies - let's assume roughly 10% have replies. So another 231 requests to get all the replies. Requests: |
Using 3-5s delay after each comment and reply, I can scrape 37 posts and 4442 comments in around 6 hours without blocking. If without replies, I do not need to introduce the delay and the scraping takes less than 5 minutes. So, it seems a compromise here - whether the replies are worth the extra scraping hours. |
@cmhc2021
|
@neon-ninja , how to apply additional delay between requests, when I parse page from cli ? I use command like that: |
Not currently. You would need to write a Python script for that level of customisation |
c61f348 adds a new parameter called "sleep". Sample usage: |
Im experiencing heavy ratelimits here as well while the account still seems to be perfectly fine if being used in the browser - Using cookies to login with the latest master and am already getting locked out at only doing get_group_info() |
There is an error when I try to use "sleep" option.
I think it conflicts with this commit ? |
Disregard that warning, the sleep is handled outside of get_posts |
See below my method for scraping groups including comments and replies using the latest master. Scraping 100 posts including comments and replies takes ~ 10 minutes with this configuration. After 100 posts I am getting Any suggestions? I feel the delays are very high already. def scrape_group_posts(self, group_ids: List[Union[int, str]]):
def handle_pagination_url(url):
nonlocal start_url
start_url = url
for k, group_id in enumerate(group_ids, 1):
group_name = self.group_information[group_id]['name']
log.info(f"[{k}] Scraping group: {group_name}...")
start_url = None
post_counter = 0
keep_alive = True
while keep_alive:
try:
posts = self.get_group_posts(
group=group_id,
options={
"comments": "generator" if config.SCRAPE_COMMENTS else False,
"comment_start_url": start_url,
"comment_request_url_callback": handle_pagination_url
}
)
while post := next(posts, None):
post_counter += 1
if post["time"] < config.DATELIMIT:
log.info(f"[{group_name}] Reached datelimit: {config.DATELIMIT}")
keep_alive = False
break
log.info(f"[{group_name}] Scraping post {post_counter} from {str(post['time'])[:19]}...")
for item in ('post_text', 'shared_text', 'text'):
comments = post['comments_full']
# It is possible, that comments are of type list
if type(comments) == iter:
comment_counter = 0
while comment := next(comments, None):
comment_counter += 1
log.info(f"[{group_name}] Scraping comment {comment_counter} from {str(comment['comment_time'])[:19]} to post {post_counter}...")
replies = comment['replies']
if type(replies) == iter:
replies_counter = 0
while reply := next(replies, None):
replies_counter += 1
log.info(f"[{group_name}] Scraping reply {replies_counter} from {str(reply['comment_time'])[:19]} to comment {comment_counter} of post {post_counter}...")
random_sleep((3, 3.5), (4, 5), reason=f"{group_name}|WAIT_BEFORE_NEXT_REPLY")
elif type(replies) == list and replies:
log.warning(f"Found non-empty comment-replies as list ->\n{format_json(replies)}")
random_sleep((3, 3.5), (4, 5), reason=f"{group_name}|WAIT_BEFORE_NEXT_COMMENT")
elif type(comments) == list and comments:
log.warning(f"Found non-empty comments as list ->\n{format_json(comments)}")
random_sleep((3, 3.5), (4, 5), reason=f"{group_name}|WAIT_BEFORE_NEXT_POST")
# We reach this without exception, then we have completed scraping this group
keep_alive = False
except TemporarilyBanned as e:
random_sleep((580, 590), (600, 610), reason=f"{group_name}|{e.__class__.__name__}") |
@chwba - you're impersonating a legitimate Facebook human user on a cell phone with this Python module. The more irregular your behavior, the easier it is to determine you're a bot. "Viewing" all comments of 100 posts in ~10 minutes isn't realistic/normal as a human. In my testing, I've been temp banned for 24-48+ hours before with repeated bot-like behavior, and that will increase your chance of being temp-banned again from the same source IP or authenticated profile. Worse case scenario - your account will be temporarily locked or permanently closed. Few things you can try:
Similar to washing your hands - it's important not to cross-contaminate. If My $0.02. |
@milesgratz thank you for the thorough writeup. Of course I understand how difficult it is to create a request-base app, that spoofs all this and gets the desired information - I just wanted to make sure I am not doing anything wrong regarding lib usage. - For my usecase I am scraping a lot of invite only private groups, so I will be forced to create a scraper that uses a browser automation. If you have any general hints I could consider while creating it of course it will be highly appreciated! |
I don't know why recently I encountered a lot of temporary ban even with cookies.txt, and not sure if security has been heighted. I would like to ask, sometimes, when I'm scraping, say, AndyMurray, I can see temporarily blocked in the log but I'm still able to fetch the comments. Is it usual? Why is it in the log? I will catch exceptions.TemporarilyBanned in my program but in such case, it seems my program will not run into the exception statement.
2021-07-27 11:34:21,113
Fetching https://m.facebook.com/comment/replies/?ctoken=10157115967571324_10157118433851324&count=1&curr&pc=1&isinline&initcomp&ft_ent_identifier=10157115967571324&eav=AfZ-Bkw7aFm3LhuKR5WEE-Zzf-uMqanA3GO5BYH3XUu4ItOx9WtNo4htTWNmUmVVcv8&av=100070379280090&gfid=AQC-\-\EPBMTYVrx5qz1c&refid=52&__tn__=R
2021-07-27 11:34:21,363
Content not found
2021-07-27 11:34:21,379
Fetching https://m.facebook.com/comment/replies/?ctoken=10157115967571324_10157119529536324&count=4&curr&pc=1&isinline&initcomp&ft_ent_identifier=10157115967571324&eav=AfaQOmbiQgTar3Ebgu3MojjgzoNw8eEL9KqTIY6rLzpQ2PAfQRfohx4C_ixoPXrNvA8&av=100070379280090&gfid=AQDHfGk86CfbHpgjGvA&refid=52&__tn__=R
2021-07-27 11:34:21,628
You¡¦re Temporarily Blocked
Scraped post: 10157115967571324
2021-07-27 11:34:21,628
Found new comment 10157116011651324
2021-07-27 11:34:21,644
...
The text was updated successfully, but these errors were encountered: