Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About temporarily block #409

Closed
hc2021hc opened this issue Jul 27, 2021 · 29 comments
Closed

About temporarily block #409

hc2021hc opened this issue Jul 27, 2021 · 29 comments

Comments

@hc2021hc
Copy link

I don't know why recently I encountered a lot of temporary ban even with cookies.txt, and not sure if security has been heighted. I would like to ask, sometimes, when I'm scraping, say, AndyMurray, I can see temporarily blocked in the log but I'm still able to fetch the comments. Is it usual? Why is it in the log? I will catch exceptions.TemporarilyBanned in my program but in such case, it seems my program will not run into the exception statement.

2021-07-27 11:34:21,113
Fetching https://m.facebook.com/comment/replies/?ctoken=10157115967571324_10157118433851324&count=1&curr&pc=1&isinline&initcomp&ft_ent_identifier=10157115967571324&eav=AfZ-Bkw7aFm3LhuKR5WEE-Zzf-uMqanA3GO5BYH3XUu4ItOx9WtNo4htTWNmUmVVcv8&av=100070379280090&gfid=AQC-\-\EPBMTYVrx5qz1c&refid=52&__tn__=R
2021-07-27 11:34:21,363
Content not found
2021-07-27 11:34:21,379
Fetching https://m.facebook.com/comment/replies/?ctoken=10157115967571324_10157119529536324&count=4&curr&pc=1&isinline&initcomp&ft_ent_identifier=10157115967571324&eav=AfaQOmbiQgTar3Ebgu3MojjgzoNw8eEL9KqTIY6rLzpQ2PAfQRfohx4C_ixoPXrNvA8&av=100070379280090&gfid=AQDHfGk86CfbHpgjGvA&refid=52&__tn__=R
2021-07-27 11:34:21,628
You¡¦re Temporarily Blocked
Scraped post: 10157115967571324
2021-07-27 11:34:21,628
Found new comment 10157116011651324
2021-07-27 11:34:21,644
...

@neon-ninja
Copy link
Collaborator

neon-ninja commented Jul 27, 2021

It's still possible to get temp banned when logged in. In some cases, the scraper makes a request, which is rejected due to temp ban (for example, getting replies to a comment). But the rest of the data can still be returned despite this, so the exception is caught and handled within the extract function. It's only if you were temp banned while paginating, or if the exception bubbles out of the individual extract function that the exception would be raised to your own code. See #385 for some more discussion around what should be raised vs caught.

@hc2021hc
Copy link
Author

hc2021hc commented Jul 27, 2021

Is it that except exceptions.TemporarilyBanned could be raised in extract_comments_full function instead so my main program can catch it, as in my case I need to pause even if I cannot scrape the replies.

@hc2021hc
Copy link
Author

I also wonder if we are making an independent request for each reply to comment? Should we introduce some time delay between the requests to reduce the chance of blocking?

@neon-ninja
Copy link
Collaborator

Sure, 17e8332 should make the scraper raise TemporarilyBanned if it occurs while collecting replies, or reactions to comments. Yes, each set of replies or reactions to a comment involves another request. Could do - perhaps this function could be converted into a generator that yields comments, similar to how get_posts is a generator that yields posts. What do you think?

@hc2021hc
Copy link
Author

I only feel strange that when I am temporarily blocked, I cannot request replies and I can request story.php, so what make the difference? What I observed is that I can see that too many requests to replies to comments can get temporarily blocked easily, and it is nearly a precursor to the whole program blocked. I'm not an expert in Python and I still don't know much about generator, but if that may improve the situation, I think it is worth a try and we can test it.

@neon-ninja
Copy link
Collaborator

I think it's possible that Facebook have different quotas for different request types, and some requests are "more suspicious" than others. So it seems plausible that you might be banned from getting more replies but still able to get posts.

@neon-ninja
Copy link
Collaborator

neon-ninja commented Aug 2, 2021

3026d39 should make it possible to get the scraper to give you a generator of comments, if you set the comments option to "generator". This should give you more control of the rate of comment extraction. Sample usage:

set_cookies("cookies.txt")

post = next(
    get_posts(
        post_urls=[2257188721032235],
        options={"comments": "generator"},
    )
)
comments = post["comments_full"]
for comment in comments:
    print(comment)
    time.sleep(5)

It also makes it possible to resume from a given comment pagination link. Sample usage:

set_cookies("cookies.txt")
results = []
start_url = None

def handle_pagination_url(url):
    global start_url
    start_url = url

while True:
    try:
        post = next(
            get_posts(
                post_urls=[2257188721032235],
                options={
                    "comments": "generator",
                    "comment_start_url": start_url,
                    "comment_request_url_callback": handle_pagination_url,
                },
            )
        )
        comments = post["comments_full"]
        for comment in comments:
            print(comment)
            results.append(comment)
        print("All done")
        break
    except exceptions.TemporarilyBanned:
        print("Temporarily banned, sleeping for 10m")
        time.sleep(600)

@hc2021hc
Copy link
Author

hc2021hc commented Aug 3, 2021

Thanks for your great work. Preliminary testing: when inserting a 5s break between each comment, the program takes a very long time so I quit the program after around 4 hours. When inserting 1-2s break between each comment and I additionally insert 4-6s break if the last comment has replies, the program is still blocked by Facebook when scraping around 10 posts (around 1 hour).
I would like to try one thing: I would like to decouple the scraping of comments and replies. In the first stage, I scraped the comments and only URL of the replies. This can be completed within several minutes with no block. In the second stage, I would like to loop through the results which have reply URLs and insert, say, 20s break between each request of scraping. However, I do not know how to revise the program for the second stage because the functions are so intertwined. Can you help write a function so that I can scrape the replies only by passing the URL? Do you think it is a feasible approach?

@neon-ninja
Copy link
Collaborator

neon-ninja commented Aug 3, 2021

Sure, 5143311 should make the replies into a generator too, if options={"comments": "generator"} is set. This should allow you to consume replies when ready, at the rate you prefer. Sample code:

set_cookies("cookies.txt")

post = next(
    get_posts(
        post_urls=[2257188721032235],
        options={"comments": "generator"},
    )
)
comments = list(post["comments_full"])
print(f"Got {len(comments)} comments. Fetching replies...")
for comment in tqdm(comments):
    comment["replies"] = list(comment["replies"])
    pprint(comment)
    if comment["replies"]:
        print(f"Found {len(comment['replies'])} replies, sleeping 20s")
        time.sleep(20)

@exnerfelix
Copy link

I just started playing around with this module today and have already received the "TemporaryBanned" exception. I was wondering if there is a best practice to avoid this exception? For example, is it more likely to get banned with credentials or cookies? Or what would be a recommended number of requests per hour? Thanks in advance!

@neon-ninja
Copy link
Collaborator

Best practice is to reduce your requests per second ;). Credentials vs cookies should be functionally the same. Try to stay under one request per second. What were you doing? Some requests (getting friend lists for example) are more sensitive to temp bans than others.

@hc2021hc
Copy link
Author

hc2021hc commented Aug 3, 2021

From my experience, use cookies instead of credentials. Starting from July, I experienced account lock (not just temporary ban) if login to test the programs too many times.

@exnerfelix
Copy link

Yes, I figured the requests per second should be low, but in my run I only ran it once every 5 minutes, which is why I'm surprised I was already temporarily blocked. I did use credentials for it though.

This is what I'm doing:

for post in get_posts(page_name, pages=5, options={"comments": True, "posts_per_page": 25}, credentials=("username", "password")):
     # save each post in a dict

     for comment in post["comments_full"]:
          # save each comment in a dict with post as parent

# ...
# analyse data in DataFrame

Basically, I only need one request using get_posts() per minute for analysing public pages.

@neon-ninja
Copy link
Collaborator

neon-ninja commented Aug 3, 2021

How many comments and replies does this page usually attract? What's the value of page_name?

@exnerfelix
Copy link

I was testing on Clash of Clans, so quite a bit of comments.

@hc2021hc
Copy link
Author

hc2021hc commented Aug 3, 2021

I get temporarily blocked even though sleeping for 20-30s after each reply, after running for around 30 minutes. It seems it is very difficult to scrape the replies. Some questions:

  1. Can the scraped comments include a field which is a URL to the replies. Even though I get blocked, I can still write the URL to an Excel for future reference.
  2. In the extract_comments_full function, I often get the exception local variable 'more_url' referenced before assignment, is it required to add more_url = None?
  3. I'm still using "for post in get_posts(...)". How to loop to the next post if I use the following? I find I'm always scraping the same post using the following.

while True:
post = next(
get_posts(
federer,
options={"comments": "generator"},
)
)

@hc2021hc
Copy link
Author

hc2021hc commented Aug 3, 2021

Some more testing indicated that it is necessary to introduce 3-5s delay after each comment and reply, and program can continue to run for around 2 hours. Even though blocking is caused by scraping replies, it is not enough only adding delay after each reply. It is necessary to add delay after each comment as well, and 1-2s delay from my experience is insufficient. I will see if blocking occurs tomorrow using the current setting.

@neon-ninja
Copy link
Collaborator

@exnerfelix so, with ClashOfClans, 5 pages, 25 posts per page, you're looking at 101 posts and ~69,494 comments. Comments only come in pages of 30, so that's ~2316 requests to get all the comments. Some comments have replies - let's assume roughly 10% have replies. So another 231 requests to get all the replies.

Requests:
5 requests to get the pages
101 requests to click each post in turn
2316 requests to get all comments
231 requests to get all replies
Total: 5+101+2316+231 = 2653. By default, all of these requests are made as fast as possible. At this scale you should probably use the recently added generator functionality and add some time.sleep code.

@hc2021hc
Copy link
Author

hc2021hc commented Aug 4, 2021

Using 3-5s delay after each comment and reply, I can scrape 37 posts and 4442 comments in around 6 hours without blocking. If without replies, I do not need to introduce the delay and the scraping takes less than 5 minutes. So, it seems a compromise here - whether the replies are worth the extra scraping hours.

@neon-ninja
Copy link
Collaborator

neon-ninja commented Aug 4, 2021

@cmhc2021

  1. Unfortunately, it seems reply URLs are account specific in some way. So reply URLs scraped probably wouldn't be viewable or usable. By which I mean, https://m.facebook.com/comment/replies/?ctoken=4365624383521981_4365703310180755&count=1&curr&pc=1&isinline&initcomp&ft_ent_identifier=4365624383521981&eav=Afa8TSrosUgPKcSj_PhW_SS0bf9mlE031GvAGDrOZ6oEorxNPS3JA2QE0do-y4b_OA4&av=100068943456113&gfid=AQC5rcSR4kBVZjejzFU&refid=52&__tn__=R isn't usable by anyone except the account that Facebook originally served it to. Perhaps it would be better to use the comment url to jump directly to that comment.
  2. 35912da should fix this
  3. Try store the result of get_posts in a variable, then iterate through it using next. Or just iterate through get_posts directly.

@vitovt
Copy link

vitovt commented Aug 12, 2021

@neon-ninja , how to apply additional delay between requests, when I parse page from cli ?

I use command like that:
facebook-scraper --filename nintendo_page_posts.csv --pages 5000 --cookies my_cookies.json nintendo
is there any new time.sleep CLI option ?

@neon-ninja
Copy link
Collaborator

neon-ninja commented Aug 12, 2021

Not currently. You would need to write a Python script for that level of customisation

@neon-ninja
Copy link
Collaborator

c61f348 adds a new parameter called "sleep". Sample usage: facebook_scraper Nintendo -p 1 -s 1

@sla-te
Copy link

sla-te commented Aug 17, 2021

Im experiencing heavy ratelimits here as well while the account still seems to be perfectly fine if being used in the browser - Using cookies to login with the latest master and am already getting locked out at only doing get_group_info()

@vitovt
Copy link

vitovt commented Aug 18, 2021

c61f348 adds a new parameter called "sleep". Sample usage: facebook_scraper Nintendo -p 1 -s 1

There is an error when I try to use "sleep" option.

/usr/local/lib/python3.8/dist-packages/facebook_scraper/__init__.py:285: UserWarning: The sleep parameter has been removed, it won't have any effect.
  get_posts(account=account, group=group, remove_source=not bool(dump_location), **kwargs), kwargs.get("sleep", 0)

I think it conflicts with this commit ?
a806a97#diff-5293390647bfe90c64c2c6ddec0256adf4cff3b68824a5007b55cbb5671cef65

@neon-ninja
Copy link
Collaborator

Disregard that warning, the sleep is handled outside of get_posts

@sla-te
Copy link

sla-te commented Aug 25, 2021

See below my method for scraping groups including comments and replies using the latest master. Scraping 100 posts including comments and replies takes ~ 10 minutes with this configuration. After 100 posts I am getting TemporarilyBanned though. Then when it attempts to start again where it left of after the 10 minutes wait it will get TemporarilyBanned right again (tested over 30 minutes).

Any suggestions? I feel the delays are very high already.

def scrape_group_posts(self, group_ids: List[Union[int, str]]):
    def handle_pagination_url(url):
        nonlocal start_url
        start_url = url

    for k, group_id in enumerate(group_ids, 1):
        group_name = self.group_information[group_id]['name']
        log.info(f"[{k}] Scraping group: {group_name}...")

        start_url = None
        post_counter = 0
        keep_alive = True
        while keep_alive:
            try:
                posts = self.get_group_posts(
                    group=group_id,
                    options={
                        "comments": "generator" if config.SCRAPE_COMMENTS else False,
                        "comment_start_url": start_url,
                        "comment_request_url_callback": handle_pagination_url
                    }
                )

                while post := next(posts, None):
                    post_counter += 1

                    if post["time"] < config.DATELIMIT:
                        log.info(f"[{group_name}] Reached datelimit: {config.DATELIMIT}")
                        keep_alive = False
                        break

                    log.info(f"[{group_name}] Scraping post {post_counter} from {str(post['time'])[:19]}...")

                    for item in ('post_text', 'shared_text', 'text'):

                    comments = post['comments_full']
                    # It is possible, that comments are of type list
                    if type(comments) == iter:
                        comment_counter = 0

                        while comment := next(comments, None):
                            comment_counter += 1
                            log.info(f"[{group_name}] Scraping comment {comment_counter} from {str(comment['comment_time'])[:19]} to post {post_counter}...")

                            replies = comment['replies']
                            if type(replies) == iter:
                                replies_counter = 0

                                while reply := next(replies, None):
                                    replies_counter += 1
                                    log.info(f"[{group_name}] Scraping reply {replies_counter} from {str(reply['comment_time'])[:19]} to comment {comment_counter} of post {post_counter}...")

                                    random_sleep((3, 3.5), (4, 5), reason=f"{group_name}|WAIT_BEFORE_NEXT_REPLY")

                            elif type(replies) == list and replies:
                                log.warning(f"Found non-empty comment-replies as list ->\n{format_json(replies)}")

                            random_sleep((3, 3.5), (4, 5), reason=f"{group_name}|WAIT_BEFORE_NEXT_COMMENT")

                    elif type(comments) == list and comments:
                        log.warning(f"Found non-empty comments as list ->\n{format_json(comments)}")

                    random_sleep((3, 3.5), (4, 5), reason=f"{group_name}|WAIT_BEFORE_NEXT_POST")

                # We reach this without exception, then we have completed scraping this group
                keep_alive = False

            except TemporarilyBanned as e:
                random_sleep((580, 590), (600, 610), reason=f"{group_name}|{e.__class__.__name__}")

@milesgratz
Copy link

@chwba - you're impersonating a legitimate Facebook human user on a cell phone with this Python module. The more irregular your behavior, the easier it is to determine you're a bot. "Viewing" all comments of 100 posts in ~10 minutes isn't realistic/normal as a human. In my testing, I've been temp banned for 24-48+ hours before with repeated bot-like behavior, and that will increase your chance of being temp-banned again from the same source IP or authenticated profile. Worse case scenario - your account will be temporarily locked or permanently closed.

Few things you can try:

  • Rotate cookies of different accounts
  • Use IP addresses that have extremely high FB traffic (e.g., Starbucks/University WiFi, etc.)
  • Use clean IP addresses that you haven't scraped with before (e.g., build a proxy server or OpenVPN server in Azure/AWS/GCP/etc.)
  • Generate your cookies from a clean IP using a browser that matches the User Agent in the fb-scraper tool "User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Mobile Safari/537.36"

Similar to washing your hands - it's important not to cross-contaminate. If ProfileA logged in from home IP address 1.1.1.1 and was account locked / temp banned for bot-like behavior, then logging in with ProfileB from home IP address 1.1.1.1 will increase your chances of a ban. The more FB traffic from that source IP (e.g., public WiFis), the more likely that you can use ProfileA and ProfileB from the same source IP without that being obviously correlated. It's worth mentioning that you are trying to trick some of the most sophisticated bot moderation engineering in the world.

My $0.02.

@sla-te
Copy link

sla-te commented Aug 28, 2021

@milesgratz thank you for the thorough writeup. Of course I understand how difficult it is to create a request-base app, that spoofs all this and gets the desired information - I just wanted to make sure I am not doing anything wrong regarding lib usage. - For my usecase I am scraping a lot of invite only private groups, so I will be forced to create a scraper that uses a browser automation.

If you have any general hints I could consider while creating it of course it will be highly appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants