About temporarily block #409

hc2021hc · 2021-07-27T03:47:02Z

I don't know why recently I encountered a lot of temporary ban even with cookies.txt, and not sure if security has been heighted. I would like to ask, sometimes, when I'm scraping, say, AndyMurray, I can see temporarily blocked in the log but I'm still able to fetch the comments. Is it usual? Why is it in the log? I will catch exceptions.TemporarilyBanned in my program but in such case, it seems my program will not run into the exception statement.

2021-07-27 11:34:21,113
Fetching https://m.facebook.com/comment/replies/?ctoken=10157115967571324_10157118433851324&count=1&curr&pc=1&isinline&initcomp&ft_ent_identifier=10157115967571324&eav=AfZ-Bkw7aFm3LhuKR5WEE-Zzf-uMqanA3GO5BYH3XUu4ItOx9WtNo4htTWNmUmVVcv8&av=100070379280090&gfid=AQC-\-\EPBMTYVrx5qz1c&refid=52&__tn__=R
2021-07-27 11:34:21,363
Content not found
2021-07-27 11:34:21,379
Fetching https://m.facebook.com/comment/replies/?ctoken=10157115967571324_10157119529536324&count=4&curr&pc=1&isinline&initcomp&ft_ent_identifier=10157115967571324&eav=AfaQOmbiQgTar3Ebgu3MojjgzoNw8eEL9KqTIY6rLzpQ2PAfQRfohx4C_ixoPXrNvA8&av=100070379280090&gfid=AQDHfGk86CfbHpgjGvA&refid=52&__tn__=R
2021-07-27 11:34:21,628
You¡¦re Temporarily Blocked
Scraped post: 10157115967571324
2021-07-27 11:34:21,628
Found new comment 10157116011651324
2021-07-27 11:34:21,644
...

neon-ninja · 2021-07-27T05:18:27Z

It's still possible to get temp banned when logged in. In some cases, the scraper makes a request, which is rejected due to temp ban (for example, getting replies to a comment). But the rest of the data can still be returned despite this, so the exception is caught and handled within the extract function. It's only if you were temp banned while paginating, or if the exception bubbles out of the individual extract function that the exception would be raised to your own code. See #385 for some more discussion around what should be raised vs caught.

hc2021hc · 2021-07-27T10:40:30Z

Is it that except exceptions.TemporarilyBanned could be raised in extract_comments_full function instead so my main program can catch it, as in my case I need to pause even if I cannot scrape the replies.

hc2021hc · 2021-07-27T11:11:48Z

I also wonder if we are making an independent request for each reply to comment? Should we introduce some time delay between the requests to reduce the chance of blocking?

neon-ninja · 2021-07-28T03:31:42Z

Sure, 17e8332 should make the scraper raise TemporarilyBanned if it occurs while collecting replies, or reactions to comments. Yes, each set of replies or reactions to a comment involves another request. Could do - perhaps this function could be converted into a generator that yields comments, similar to how get_posts is a generator that yields posts. What do you think?

hc2021hc · 2021-07-28T03:38:49Z

I only feel strange that when I am temporarily blocked, I cannot request replies and I can request story.php, so what make the difference? What I observed is that I can see that too many requests to replies to comments can get temporarily blocked easily, and it is nearly a precursor to the whole program blocked. I'm not an expert in Python and I still don't know much about generator, but if that may improve the situation, I think it is worth a try and we can test it.

neon-ninja · 2021-07-28T03:42:29Z

I think it's possible that Facebook have different quotas for different request types, and some requests are "more suspicious" than others. So it seems plausible that you might be banned from getting more replies but still able to get posts.

neon-ninja · 2021-08-02T00:08:25Z

3026d39 should make it possible to get the scraper to give you a generator of comments, if you set the comments option to "generator". This should give you more control of the rate of comment extraction. Sample usage:

set_cookies("cookies.txt")

post = next(
    get_posts(
        post_urls=[2257188721032235],
        options={"comments": "generator"},
    )
)
comments = post["comments_full"]
for comment in comments:
    print(comment)
    time.sleep(5)

It also makes it possible to resume from a given comment pagination link. Sample usage:

set_cookies("cookies.txt")
results = []
start_url = None

def handle_pagination_url(url):
    global start_url
    start_url = url

while True:
    try:
        post = next(
            get_posts(
                post_urls=[2257188721032235],
                options={
                    "comments": "generator",
                    "comment_start_url": start_url,
                    "comment_request_url_callback": handle_pagination_url,
                },
            )
        )
        comments = post["comments_full"]
        for comment in comments:
            print(comment)
            results.append(comment)
        print("All done")
        break
    except exceptions.TemporarilyBanned:
        print("Temporarily banned, sleeping for 10m")
        time.sleep(600)

hc2021hc · 2021-08-03T01:26:58Z

Thanks for your great work. Preliminary testing: when inserting a 5s break between each comment, the program takes a very long time so I quit the program after around 4 hours. When inserting 1-2s break between each comment and I additionally insert 4-6s break if the last comment has replies, the program is still blocked by Facebook when scraping around 10 posts (around 1 hour).
I would like to try one thing: I would like to decouple the scraping of comments and replies. In the first stage, I scraped the comments and only URL of the replies. This can be completed within several minutes with no block. In the second stage, I would like to loop through the results which have reply URLs and insert, say, 20s break between each request of scraping. However, I do not know how to revise the program for the second stage because the functions are so intertwined. Can you help write a function so that I can scrape the replies only by passing the URL? Do you think it is a feasible approach?

neon-ninja · 2021-08-03T02:52:41Z

Sure, 5143311 should make the replies into a generator too, if options={"comments": "generator"} is set. This should allow you to consume replies when ready, at the rate you prefer. Sample code:

set_cookies("cookies.txt")

post = next(
    get_posts(
        post_urls=[2257188721032235],
        options={"comments": "generator"},
    )
)
comments = list(post["comments_full"])
print(f"Got {len(comments)} comments. Fetching replies...")
for comment in tqdm(comments):
    comment["replies"] = list(comment["replies"])
    pprint(comment)
    if comment["replies"]:
        print(f"Found {len(comment['replies'])} replies, sleeping 20s")
        time.sleep(20)

exnerfelix · 2021-08-03T05:28:27Z

I just started playing around with this module today and have already received the "TemporaryBanned" exception. I was wondering if there is a best practice to avoid this exception? For example, is it more likely to get banned with credentials or cookies? Or what would be a recommended number of requests per hour? Thanks in advance!

neon-ninja · 2021-08-03T05:32:53Z

Best practice is to reduce your requests per second ;). Credentials vs cookies should be functionally the same. Try to stay under one request per second. What were you doing? Some requests (getting friend lists for example) are more sensitive to temp bans than others.

hc2021hc · 2021-08-03T05:45:19Z

From my experience, use cookies instead of credentials. Starting from July, I experienced account lock (not just temporary ban) if login to test the programs too many times.

exnerfelix · 2021-08-03T06:06:53Z

Yes, I figured the requests per second should be low, but in my run I only ran it once every 5 minutes, which is why I'm surprised I was already temporarily blocked. I did use credentials for it though.

This is what I'm doing:

for post in get_posts(page_name, pages=5, options={"comments": True, "posts_per_page": 25}, credentials=("username", "password")):
     # save each post in a dict

     for comment in post["comments_full"]:
          # save each comment in a dict with post as parent

# ...
# analyse data in DataFrame

Basically, I only need one request using get_posts() per minute for analysing public pages.

neon-ninja · 2021-08-03T06:33:46Z

How many comments and replies does this page usually attract? What's the value of page_name?

exnerfelix · 2021-08-03T06:41:20Z

I was testing on Clash of Clans, so quite a bit of comments.

hc2021hc · 2021-08-03T06:46:29Z

I get temporarily blocked even though sleeping for 20-30s after each reply, after running for around 30 minutes. It seems it is very difficult to scrape the replies. Some questions:

Can the scraped comments include a field which is a URL to the replies. Even though I get blocked, I can still write the URL to an Excel for future reference.
In the extract_comments_full function, I often get the exception local variable 'more_url' referenced before assignment, is it required to add more_url = None?
I'm still using "for post in get_posts(...)". How to loop to the next post if I use the following? I find I'm always scraping the same post using the following.

while True:
post = next(
get_posts(
federer,
options={"comments": "generator"},
)
)

hc2021hc · 2021-08-03T10:41:11Z

Some more testing indicated that it is necessary to introduce 3-5s delay after each comment and reply, and program can continue to run for around 2 hours. Even though blocking is caused by scraping replies, it is not enough only adding delay after each reply. It is necessary to add delay after each comment as well, and 1-2s delay from my experience is insufficient. I will see if blocking occurs tomorrow using the current setting.

neon-ninja · 2021-08-03T23:27:48Z

@exnerfelix so, with ClashOfClans, 5 pages, 25 posts per page, you're looking at 101 posts and ~69,494 comments. Comments only come in pages of 30, so that's ~2316 requests to get all the comments. Some comments have replies - let's assume roughly 10% have replies. So another 231 requests to get all the replies.

Requests:
5 requests to get the pages
101 requests to click each post in turn
2316 requests to get all comments
231 requests to get all replies
Total: 5+101+2316+231 = 2653. By default, all of these requests are made as fast as possible. At this scale you should probably use the recently added generator functionality and add some time.sleep code.

hc2021hc · 2021-08-04T00:20:47Z

Using 3-5s delay after each comment and reply, I can scrape 37 posts and 4442 comments in around 6 hours without blocking. If without replies, I do not need to introduce the delay and the scraping takes less than 5 minutes. So, it seems a compromise here - whether the replies are worth the extra scraping hours.

neon-ninja · 2021-08-04T02:20:40Z

@cmhc2021

Unfortunately, it seems reply URLs are account specific in some way. So reply URLs scraped probably wouldn't be viewable or usable. By which I mean, https://m.facebook.com/comment/replies/?ctoken=4365624383521981_4365703310180755&count=1&curr&pc=1&isinline&initcomp&ft_ent_identifier=4365624383521981&eav=Afa8TSrosUgPKcSj_PhW_SS0bf9mlE031GvAGDrOZ6oEorxNPS3JA2QE0do-y4b_OA4&av=100068943456113&gfid=AQC5rcSR4kBVZjejzFU&refid=52&__tn__=R isn't usable by anyone except the account that Facebook originally served it to. Perhaps it would be better to use the comment url to jump directly to that comment.
35912da should fix this
Try store the result of get_posts in a variable, then iterate through it using next. Or just iterate through get_posts directly.

vitovt · 2021-08-12T16:27:30Z

@neon-ninja , how to apply additional delay between requests, when I parse page from cli ?

I use command like that:
facebook-scraper --filename nintendo_page_posts.csv --pages 5000 --cookies my_cookies.json nintendo
is there any new time.sleep CLI option ?

neon-ninja · 2021-08-12T19:21:21Z

Not currently. You would need to write a Python script for that level of customisation

neon-ninja · 2021-08-15T20:28:12Z

c61f348 adds a new parameter called "sleep". Sample usage: facebook_scraper Nintendo -p 1 -s 1

sla-te · 2021-08-17T16:56:40Z

Im experiencing heavy ratelimits here as well while the account still seems to be perfectly fine if being used in the browser - Using cookies to login with the latest master and am already getting locked out at only doing get_group_info()

vitovt · 2021-08-18T18:26:31Z

c61f348 adds a new parameter called "sleep". Sample usage: facebook_scraper Nintendo -p 1 -s 1

There is an error when I try to use "sleep" option.

/usr/local/lib/python3.8/dist-packages/facebook_scraper/__init__.py:285: UserWarning: The sleep parameter has been removed, it won't have any effect.
  get_posts(account=account, group=group, remove_source=not bool(dump_location), **kwargs), kwargs.get("sleep", 0)

I think it conflicts with this commit ?
a806a97#diff-5293390647bfe90c64c2c6ddec0256adf4cff3b68824a5007b55cbb5671cef65

neon-ninja · 2021-08-18T22:02:44Z

Disregard that warning, the sleep is handled outside of get_posts

sla-te · 2021-08-25T15:20:50Z

See below my method for scraping groups including comments and replies using the latest master. Scraping 100 posts including comments and replies takes ~ 10 minutes with this configuration. After 100 posts I am getting TemporarilyBanned though. Then when it attempts to start again where it left of after the 10 minutes wait it will get TemporarilyBanned right again (tested over 30 minutes).

Any suggestions? I feel the delays are very high already.

def scrape_group_posts(self, group_ids: List[Union[int, str]]):
    def handle_pagination_url(url):
        nonlocal start_url
        start_url = url

    for k, group_id in enumerate(group_ids, 1):
        group_name = self.group_information[group_id]['name']
        log.info(f"[{k}] Scraping group: {group_name}...")

        start_url = None
        post_counter = 0
        keep_alive = True
        while keep_alive:
            try:
                posts = self.get_group_posts(
                    group=group_id,
                    options={
                        "comments": "generator" if config.SCRAPE_COMMENTS else False,
                        "comment_start_url": start_url,
                        "comment_request_url_callback": handle_pagination_url
                    }
                )

                while post := next(posts, None):
                    post_counter += 1

                    if post["time"] < config.DATELIMIT:
                        log.info(f"[{group_name}] Reached datelimit: {config.DATELIMIT}")
                        keep_alive = False
                        break

                    log.info(f"[{group_name}] Scraping post {post_counter} from {str(post['time'])[:19]}...")

                    for item in ('post_text', 'shared_text', 'text'):

                    comments = post['comments_full']
                    # It is possible, that comments are of type list
                    if type(comments) == iter:
                        comment_counter = 0

                        while comment := next(comments, None):
                            comment_counter += 1
                            log.info(f"[{group_name}] Scraping comment {comment_counter} from {str(comment['comment_time'])[:19]} to post {post_counter}...")

                            replies = comment['replies']
                            if type(replies) == iter:
                                replies_counter = 0

                                while reply := next(replies, None):
                                    replies_counter += 1
                                    log.info(f"[{group_name}] Scraping reply {replies_counter} from {str(reply['comment_time'])[:19]} to comment {comment_counter} of post {post_counter}...")

                                    random_sleep((3, 3.5), (4, 5), reason=f"{group_name}|WAIT_BEFORE_NEXT_REPLY")

                            elif type(replies) == list and replies:
                                log.warning(f"Found non-empty comment-replies as list ->\n{format_json(replies)}")

                            random_sleep((3, 3.5), (4, 5), reason=f"{group_name}|WAIT_BEFORE_NEXT_COMMENT")

                    elif type(comments) == list and comments:
                        log.warning(f"Found non-empty comments as list ->\n{format_json(comments)}")

                    random_sleep((3, 3.5), (4, 5), reason=f"{group_name}|WAIT_BEFORE_NEXT_POST")

                # We reach this without exception, then we have completed scraping this group
                keep_alive = False

            except TemporarilyBanned as e:
                random_sleep((580, 590), (600, 610), reason=f"{group_name}|{e.__class__.__name__}")

milesgratz · 2021-08-28T15:08:07Z

@chwba - you're impersonating a legitimate Facebook human user on a cell phone with this Python module. The more irregular your behavior, the easier it is to determine you're a bot. "Viewing" all comments of 100 posts in ~10 minutes isn't realistic/normal as a human. In my testing, I've been temp banned for 24-48+ hours before with repeated bot-like behavior, and that will increase your chance of being temp-banned again from the same source IP or authenticated profile. Worse case scenario - your account will be temporarily locked or permanently closed.

Few things you can try:

Rotate cookies of different accounts
Use IP addresses that have extremely high FB traffic (e.g., Starbucks/University WiFi, etc.)
Use clean IP addresses that you haven't scraped with before (e.g., build a proxy server or OpenVPN server in Azure/AWS/GCP/etc.)
Generate your cookies from a clean IP using a browser that matches the User Agent in the fb-scraper tool "User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Mobile Safari/537.36"

Similar to washing your hands - it's important not to cross-contaminate. If ProfileA logged in from home IP address 1.1.1.1 and was account locked / temp banned for bot-like behavior, then logging in with ProfileB from home IP address 1.1.1.1 will increase your chances of a ban. The more FB traffic from that source IP (e.g., public WiFis), the more likely that you can use ProfileA and ProfileB from the same source IP without that being obviously correlated. It's worth mentioning that you are trying to trick some of the most sophisticated bot moderation engineering in the world.

My $0.02.

sla-te · 2021-08-28T15:12:50Z

@milesgratz thank you for the thorough writeup. Of course I understand how difficult it is to create a request-base app, that spoofs all this and gets the desired information - I just wanted to make sure I am not doing anything wrong regarding lib usage. - For my usecase I am scraping a lot of invite only private groups, so I will be forced to create a scraper that uses a browser automation.

If you have any general hints I could consider while creating it of course it will be highly appreciated!

neon-ninja mentioned this issue Aug 5, 2021

New version has an error in the result of the json #422

Closed

neon-ninja mentioned this issue Aug 22, 2021

How to sleep the extractor while extracting comments #449

Closed

This was referenced Aug 30, 2021

Issue received while scraping with comments #457

Open

Requesting Tips and Tricks to reduce temporary ban #459

Closed

neon-ninja mentioned this issue Nov 24, 2021

Empty comments_full in live post #575

Open

hc2021hc closed this as completed Nov 25, 2021

neon-ninja mentioned this issue Feb 14, 2022

[Help] Proxy and cursor #617

Open

neon-ninja mentioned this issue Mar 30, 2022

Temporarily banned...friends scraper #548

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About temporarily block #409

About temporarily block #409

hc2021hc commented Jul 27, 2021

neon-ninja commented Jul 27, 2021 •

edited

Loading

hc2021hc commented Jul 27, 2021 •

edited

Loading

hc2021hc commented Jul 27, 2021

neon-ninja commented Jul 28, 2021

hc2021hc commented Jul 28, 2021

neon-ninja commented Jul 28, 2021

neon-ninja commented Aug 2, 2021 •

edited

Loading

hc2021hc commented Aug 3, 2021 •

edited

Loading

neon-ninja commented Aug 3, 2021 •

edited

Loading

exnerfelix commented Aug 3, 2021

neon-ninja commented Aug 3, 2021

hc2021hc commented Aug 3, 2021

exnerfelix commented Aug 3, 2021

neon-ninja commented Aug 3, 2021 •

edited

Loading

exnerfelix commented Aug 3, 2021

hc2021hc commented Aug 3, 2021

hc2021hc commented Aug 3, 2021

neon-ninja commented Aug 3, 2021

hc2021hc commented Aug 4, 2021

neon-ninja commented Aug 4, 2021 •

edited

Loading

vitovt commented Aug 12, 2021

neon-ninja commented Aug 12, 2021 •

edited

Loading

neon-ninja commented Aug 15, 2021

sla-te commented Aug 17, 2021 •

edited

Loading

vitovt commented Aug 18, 2021

neon-ninja commented Aug 18, 2021

sla-te commented Aug 25, 2021 •

edited

Loading

milesgratz commented Aug 28, 2021

sla-te commented Aug 28, 2021 •

edited

Loading

About temporarily block #409

About temporarily block #409

Comments

hc2021hc commented Jul 27, 2021

neon-ninja commented Jul 27, 2021 • edited Loading

hc2021hc commented Jul 27, 2021 • edited Loading

hc2021hc commented Jul 27, 2021

neon-ninja commented Jul 28, 2021

hc2021hc commented Jul 28, 2021

neon-ninja commented Jul 28, 2021

neon-ninja commented Aug 2, 2021 • edited Loading

hc2021hc commented Aug 3, 2021 • edited Loading

neon-ninja commented Aug 3, 2021 • edited Loading

exnerfelix commented Aug 3, 2021

neon-ninja commented Aug 3, 2021

hc2021hc commented Aug 3, 2021

exnerfelix commented Aug 3, 2021

neon-ninja commented Aug 3, 2021 • edited Loading

exnerfelix commented Aug 3, 2021

hc2021hc commented Aug 3, 2021

hc2021hc commented Aug 3, 2021

neon-ninja commented Aug 3, 2021

hc2021hc commented Aug 4, 2021

neon-ninja commented Aug 4, 2021 • edited Loading

vitovt commented Aug 12, 2021

neon-ninja commented Aug 12, 2021 • edited Loading

neon-ninja commented Aug 15, 2021

sla-te commented Aug 17, 2021 • edited Loading

vitovt commented Aug 18, 2021

neon-ninja commented Aug 18, 2021

sla-te commented Aug 25, 2021 • edited Loading

milesgratz commented Aug 28, 2021

sla-te commented Aug 28, 2021 • edited Loading

neon-ninja commented Jul 27, 2021 •

edited

Loading

hc2021hc commented Jul 27, 2021 •

edited

Loading

neon-ninja commented Aug 2, 2021 •

edited

Loading

hc2021hc commented Aug 3, 2021 •

edited

Loading

neon-ninja commented Aug 3, 2021 •

edited

Loading

neon-ninja commented Aug 3, 2021 •

edited

Loading

neon-ninja commented Aug 4, 2021 •

edited

Loading

neon-ninja commented Aug 12, 2021 •

edited

Loading

sla-te commented Aug 17, 2021 •

edited

Loading

sla-te commented Aug 25, 2021 •

edited

Loading

sla-te commented Aug 28, 2021 •

edited

Loading