Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help with Error: <class 'TypeError'>. 'NoneType' object is not subscriptable for some posts #172

Closed
joebah-joe opened this issue Mar 8, 2021 · 44 comments
Labels
bug Something isn't working

Comments

@joebah-joe
Copy link

joebah-joe commented Mar 8, 2021

Hi - been running the facebook-scraper for a while and noticed that for about 20% of the posts, I would catch this error:

Error: <class 'TypeError'>. 'NoneType' object is not subscriptable

I understand that it's because the scraper is trying to index an object of type None (i.e: the object has no value). However, I couldn't tell the difference between the post that generates this error and the post that don't - to the point that it almost seems random. It also seems like some page has a lot more occurrence of these errors than others.

I have tried changing the number of page and word count from 1 to 8,000. The ones that work continues to work. The ones that don't don't.

Does anyone have the same issue? Does it have something to do with the python version or python environment that I'm using? (I am python 3.6) Or am I just making too many requests? (I pull the data from the page every 30 mins). If so, how were you able to fix? Thanks!

@kevinzg kevinzg added the bug Something isn't working label Mar 8, 2021
@kevinzg
Copy link
Owner

kevinzg commented Mar 8, 2021

I need more information, can you share the traceback?

@joebah-joe
Copy link
Author

joebah-joe commented Mar 9, 2021

I need more information, can you share the traceback?

If I run from command line (see below), the 'text' and 'post_text' field would jut be left as blank in the generated .csv file.

facebook-scraper --filename nintendo_page_posts.csv --pages 2 longtunman

However, if I call the scraper from my own code and catch the error with traceback, it looks like this:

Error: . 'NoneType' object is not subscriptable, line:29

longtunman https://facebook.com/longtunman/posts/1020584865140788 Error: . 'NoneType' object is not subscriptable, line:29
Full Traceback: Traceback (most recent call last): File "/home/myscraper/mysite/facebookparser_app.py", line 29, in extract_facebook_data output = output + "" + (post['text'][:word_count]) + "\n" TypeError: 'NoneType' object is not subscriptable

Unfortunately, this doesn't trace back to the error generated from your code, only form my code when I call the post function

As I mentioned, this happens on some posts. Other posts works fine.

_EDIT - TLDR: basically, the post text function is giving me back a NoneType object when it is trying to scrape a problematic post, which causes the not subscriptable error when I try to use it. I can skip these posts for now so my code can continue to run and not bug out, but wanted to make sure that returning NonTYpe object from post text command should be expected for certain FB posts (though I have no idea what would trigger it - all the posts look the same to me...), and ideally if there's some way to still extract the text from these problematic posts...

@kevinzg
Copy link
Owner

kevinzg commented Mar 9, 2021

I looked at those posts, the one that doesn't have text is 1022240714975203
Here's the HTML for that post: 1022240714975203.txt
It's a video but it does have some text (behind a see more link), we can use this case to improve the text extraction.

By the way, I noticed that that page (longtunman) doesn't work with the latest version (where the starting URL doesn't end with /posts/, #170), so you might not want to upgrade yet.

@joebah-joe
Copy link
Author

joebah-joe commented Mar 9, 2021

Oh I see... so the scraper is unable to get text if there are things like videos, and when that happens it returns a NoneType object?

Actually, I was unable to get text for the following recent posts too:

https://facebook.com/longtunman/posts/1020584865140788

https://facebook.com/longtunman/videos/129094235798751

https://facebook.com/longtunman/posts/1022861631579778

https://facebook.com/longtunman/posts/1022822748250333

https://facebook.com/longtunman/posts/1022793828253225

You were able to extract text for those? I was only able to get text for this post:

https://facebook.com/longtunman/posts/1022746714924603

By the way, I noticed that that page (longtunman) doesn't work with the latest version (where the starting URL doesn't end with /posts/, #170), so you might not want to upgrade yet.

EDIT:

Ah, good to know. I installed the scraper on pythonanywhere server less than a week ago and running my scripts off the server. I guess it's version 0.2.19

I just ran the scraper off my local desktop and it seems to extract these posts just fine. Could I have installed scraper wrong on pythonanywhere, and that's why I am having all these issues? I remembered there was a lot of compatibility issues and I sort of fumbled around until I got it to install...

@kevinzg
Copy link
Owner

kevinzg commented Mar 9, 2021

so the scraper is unable to get text if there are things like videos, and when that happens it returns a NoneType object?

Yeah, it's either because of the video or the see more link.

You were able to extract text for those? I was only able to get text for this post:

Yes, the only one with out text was the one I mentioned on my previous comment.

I installed the scraper on pythonanywhere server less than a week ago

The version I'm talking about was released yesterday (v0.2.20).
To check your version you can run pip freeze.

I ran the older version that I installed locally, and it sems to extract these page just fine.

If you can find out what version starts to fail that would be really useful.

@joebah-joe
Copy link
Author

joebah-joe commented Mar 9, 2021

If you can find out what version starts to fail that would be really useful.

Okay, here's my PIP freeze output from pythonanywhere bash console. As mentioned earlier I had a lot of problem installing initially. Perhaps I got some installation stuff wrong...

pip freeze output.txt

@kevinzg
Copy link
Owner

kevinzg commented Mar 9, 2021

There should be a facebook-scraper==<version> line there, but there isn't, maybe it is in a different virtual env?

@joebah-joe
Copy link
Author

joebah-joe commented Mar 9, 2021

There should be a facebook-scraper==<version> line there, but there isn't, may it is in a different virtual env?

Ah so it turned out I installed facebook scraper in the general environment and not virtual environment. noob mistake. This makes it difficult to find the facebook-scraper.

Should I start over and install the scraper in the virtual environment? I remember having the most horrible time dealing with urllib3 version and force install stuff to make it work...

Okay, I created a virtual environment. Install version 0.2.19 and this was the CSV generated when I ran the command line for longtunman. It seems like I am still missing a lot of the text. I attached the CSV and the pip freeze for your reference.
longtunman_page_posts.zip
pip freeze output.txt

@joebah-joe
Copy link
Author

joebah-joe commented Mar 9, 2021

There should be a facebook-scraper==<version> line there, but there isn't, may it is in a different virtual env?

Sorry - I kept editing my post. H
pip freeze output.txt
longtunman_page_posts.zip

ere's the new pip freeze output and the CSV file generated from the command line in the pythonanywhere virtual environment I just set up and installed version 0.2.19. Still having issues even with 0.2.19 with the missing text field for most of the posts...

@neon-ninja
Copy link
Collaborator

neon-ninja commented Mar 9, 2021

I just ran the scraper off my local desktop and it seems to extract these posts just fine

It sounds to me like this problem is localised to pythonanywhere. Perhaps as a cloud platform, facebook is serving it different HTML than you would otherwise get running locally. I created a free account on pythonanywhere, and I get the same problem as you when running it there, even though it runs fine for me locally too. When enabling debug logging, I get these errors in the console on pythonanywhere:

Exception while running extract_text: AttributeError("'NoneType' object has no attribute 'find'")

Using this code:

from facebook_scraper import get_posts, enable_logging
import logging
enable_logging(logging.DEBUG)
posts = list(get_posts("longtunman"))
print(f"{len(posts)} posts, {len([post for post in posts if not post['text']])} missing text")

So, it looks like the scraper hits the "more", url, then isn't able to find story_body_container in the resulting HTML. Here's a snippet of the resulting HTML:

<div class="hidden_elem"><code id="u_0_y_xn"><!-- <div id="m_story_permalink_view" data-sigil="m-story-view"><div class="_3f50"><div class="_5rgr async_like" data-store="&#123;&quot;linkdata&quot;:&quot;mf_story_key.1022822748250333:top_level_post_id.1022822748250333:tl_objid.1022822748250333:content_owner_id_new.113397052526245:throwback_story_fbid.1022822748250333:page_id.113397052526245:photo_id.1022822441583697:story_location.9:story_attachment_style.photo:tds_flgs.3:ott.AX9zWDJ7Y6F6r0Rn&quot;,&quot;share_id&quot;:1022822748250333,&quot;feedback_target&quot;:1022822748250333,&quot;feedback_source&quot;:8,&quot;action_source&quot;:2,&quot;actor_id&quot;:100022709408081&#125;" data-xt="2.mf_story_key.1022822748250333:top_level_post_id.1022822748250333:tl_objid.1022822748250333:content_owner_id_new.113397052526245:throwback_story_fbid.1022822748250333:page_id.113397052526245:photo_id.1022822441583697:story_location.9:story_attachment_style.photo:tds_flgs.3:ott.AX9zWDJ7Y6F6r0Rn" data-xt-vimp="&#123;&quot;pixel_in_percentage&quot;:0,&quot;duration_in_ms&quot;:1,&quot;subsequent_gap_in_ms&quot;:60000,&quot;log_initial_nonviewable&quot;:false,&quot;should_batch&quot;:true,&quot;require_horizontally_onscreen&quot;:false&#125;" data-ft="&#123;&quot;mf_story_key&quot;:&quot;1022822748250333&quot;,&quot;top_level_post_id&quot;:&quot;1022822748250333&quot;,&quot;tl_objid&quot;:&quot;1022822748250333&quot;,&quot;content_owner_id_new&quot;:&quot;113397052526245&quot;,&quot;throwback_story_fbid&quot;:&quot;1022822748250333&quot;,&quot;page_id&quot;:&quot;113397052526245&quot;,&quot;photo_id&quot;:&quot;1022822441583697&quot;,&quot;story_location&quot;:9,&quot;story_attachment_style&quot;:&quot;photo&quot;,&quot;tds_flgs&quot;:3,&quot;ott&quot;:&quot;AX9zWDJ7Y6F6r0Rn&quot;,&quot;tn&quot;:&quot;-R&quot;&#125;" id="u_0_s_CW" data-sigil="story-div story-popup-metadata story-popup-metadata feed-ufi-metadata"><div class="story_body_container"><header class="_7om2 _1o88 _77kd _5qc1"><div class="_5s61 _2pii _5i2i _52wc"><div class="_5xu4"><div class="_67lm _77kc" data-gt="&#123;&quot;tn&quot;:&quot;~&quot;&#125;" 

story_body_container is in there, it's just commented out. Here's a simple reprex showing this behaviour of requests_html:

from requests_html import HTML
h = HTML(html='<body><!-- <div class="story_body_container"></div> --></body>')
element = h.find('.story_body_container', first=True)
print(element) # Returns None

h.html = h.html.replace('<!--', '').replace('-->', '') seems to fix it - here's a PR #177

@kevinzg
Copy link
Owner

kevinzg commented Mar 10, 2021

Thanks for debugging it, @neon-ninja! Hopefully your fix will work.

@joebah-joe forget what I said about the latest version not working, it only happens when the page limit is very low.

@joebah-joe
Copy link
Author

h.html = h.html.replace('<!--', '').replace('-->', '') seems to fix it - here's a PR #177

Great stuff - so it's pythonanywhere specific issue!

Dumb question - where do I put this h.html.replace line to fix this isssue?

@neon-ninja
Copy link
Collaborator

h.html = h.html.replace('<!--', '').replace('-->', '') seems to fix it - here's a PR #177

Great stuff - so it's pythonanywhere specific issue!

Dumb question - where do I put this h.html.replace line to fix this isssue?

That PR is merged now, so just update to version 0.2.21

@joebah-joe
Copy link
Author

joebah-joe commented Mar 10, 2021

Oh I just saw the latest version. Will update as soon as I get to my terminal in a few hours and will report back on whether it fixed for me or not.

Thanks @neon-ninja and @kevinzg - really appreciate all your help on this!

@joebah-joe
Copy link
Author

joebah-joe commented Mar 10, 2021

@neon-ninja and @kevinzg - No luck for me with 0.2.21. This is the error I get from CLI on pythonanywhere:

(myvirtualenv) 04:27 ~/mysite $ facebook-scraper --filename nintendo_page_posts.csv --pages 5 longtunman
Couldn't get any posts.

Actually, this happens to all the FB pages now, not just Longtunman.
And this is my PIP Freeze from the virtual environment I'm running the script/CLI:

appdirs==1.4.4
beautifulsoup4==4.9.3
bs4==0.0.1
certifi==2020.12.5
chardet==4.0.0
click==7.1.2
cssselect==1.1.0
dateparser==1.0.0
facebook-scraper==0.2.21
fake-useragent==0.1.11
Flask==1.1.1
idna==2.10
itsdangerous==1.1.0
Jinja2==2.11.3
lxml==4.6.2
MarkupSafe==1.1.1
parse==1.19.0
pyee==8.1.0
pyppeteer==0.2.5
pyquery==1.4.3
python-dateutil==2.8.1
pytz==2021.1
regex==2020.11.13
requests==2.25.1
requests-html==0.10.0
six==1.15.0
soupsieve==2.2
tqdm==4.59.0
tzlocal==2.1
urllib3==1.26.3
w3lib==1.22.0
websockets==8.1
Werkzeug==1.0.1

But I presume that you tested it out in pythonanywhere? I am wondering what I did wrong. I just pip uninstalled facebook-scraper==0.2.19 and then did a pip install facebook-scraper==0.2.21 in the virtual and ran the CLI.

@neon-ninja
Copy link
Collaborator

@joebah-joe sooner or later, facebook starts insisting that you login. Try feed your cookies in as per #28 (comment)

@joebah-joe
Copy link
Author

@joebah-joe sooner or later, facebook starts insisting that you login. Try feed your cookies in as per #28 (comment)

Okay, I'll try that. So the problem I'm having with 0.2.21 is because I'm not logged in or feeding the cookies,txt, Is that why I'm getting nothing for any of the pages?

@balazssandor
Copy link

@neon-ninja @kevinzg
The issue persists on a page like dezsoandraskonyvei, where the URL https://www.facebook.com/dezsoandraskonyvei/posts doesn't exist
image

We are getting back 0 results from there with version 0.2.21

>>> list(facebook_scraper.get_posts('dezsoandraskonyvei', pages=3))
[]

@joebah-joe
Copy link
Author

joebah-joe commented Mar 10, 2021

@neon-ninja Well, I tried 0.2.21 again and passed in the cookies.txt (netscape) from my facebook page as instructed. No luck, just not getting any posts back at all from any page, be it longtunman or nintendo or whatever,

Just for fun, I went back to 0.2.20 with the cookies, and it works (though obviously longtunman still has issues with the NoneType). So there doesn't seem to be anything wrong with my facebook cookies at the very least.

Could you test 0.2.21 in your pythonanywhere environment again? I'm sure we're getting really close to fixing this but may be I'm just missing something... really appreciate your help with this so far.

@kevinzg
Copy link
Owner

kevinzg commented Mar 10, 2021

@balazssandor the problem with that page is that the mobile version https://m.facebook.com/dezsoandraskonyvei doesn't list any posts (with or without the /posts suffix). The mbasic and touch subdomains don't have any posts either, so it seems it can only be scraped from the www one, which the scraper doesn't support.

@joebah-joe I tried with 0.2.21 and it did scrape some posts. I didn't try with a cookie file as I don't have an account so maybe it's that.
Consider that Facebook can serve different content based on your IP, your usage, the date, A/B testing, etc. so the best might be to debug it yourself.

Also, if you are logging-in from your local computer, and using those cookies on your server (with a different IP), Facebook might flag it as suspicious activity?

@joebah-joe
Copy link
Author

joebah-joe commented Mar 10, 2021

@joebah-joe I tried with 0.2.21 and it did scrape some posts. I didn't try with a cookie file as I don't have an account so maybe it's that.
Consider that Facebook can serve different content based on your IP, your usage, the date, A/B testing, etc. so the best might be to debug it yourself.

Also, if you are logging-in from your local computer, and using those cookies on your server (with a different IP), Facebook might flag it as suspicious activity?

I see. I guess at the end of the day it's the pythonanywhere environment that's creating all sorts of problems for me as others don't seem to be getting it, and my local desktop version is running fine. This is a shame, I quite liked using their interface.

One last question then - I can use 0.2.21 without cookies? so .get_posts('longtunman', pages=10) works too?

@kevinzg
Copy link
Owner

kevinzg commented Mar 10, 2021

I see. I guess at the end of the day it's the pythonanywhere environment that's creating all sorts of problems for me as others don't seem to be getting it, and my local desktop version is running fine. This is a shame, I quite liked using their interface.

Well, if it works on your desktop and not on the server, and everything else is the same, that's most likely the reason.
Of course it's not really pythonanywhere's fault, it could happen with any cloud provider.
There are workarounds that you might want to consider like rotating proxies, accounts, user agents, throttling the requests, etc.

One last question then - I can use 0.2.21 without cookies? so .get_posts('longtunman', pages=10) works too?

Yes, cookies are optional.

@neon-ninja
Copy link
Collaborator

@joebah-joe on PythonAnywhere, 0.2.21 works fine for me with cookies. Without cookies, I get nothing, and facebook says "You must log in first". Perhaps your cookies aren't working for some reason. Check that your cookies.txt looks something like:

.facebook.com	false	/	true	1678314119	datr	REDACTED
.facebook.com	false	/	true	0	m_pixel_ratio	1
.facebook.com	false	/	true	1623018122	fr	REDACTED
.facebook.com	false	/	true	1678314126	sb	REDACTED
.facebook.com	false	/	true	1646778124	c_user	REDACTED
.facebook.com	false	/	true	1646778124	xs	REDACTED
.facebook.com	false	/	true	1615846927	wd	1284x895

@kevinzg @balazssandor dezsoandraskonyvei only shows posts if logged in.

@joebah-joe
Copy link
Author

on PythonAnywhere, 0.2.21 works fine for me with cookies. Without cookies, I get nothing, and facebook says "You must log in first". Perhaps your cookies aren't working for some reason. Check that your cookies.txt looks something like:

@neon-ninja Okay, so my cookies don't look exactly like that. So I created a new one and oddly enough it managed to pull posts once or twice, but then it stopped doing it.

However, the few times that it successfully pulled data, Longtunman still give me NoneType for some of the posts.

Do you mind sharing with me the script you used to do your test on pythonanywhere? May be if I mess around with that and get to the level where you are at, I might be able to figure out what I've done wrong. Thanks!

@neon-ninja
Copy link
Collaborator

@joebah-joe sure:

#!/usr/bin/env python3
from facebook_scraper import get_posts, enable_logging
import logging
enable_logging(logging.DEBUG)
posts = list(get_posts("longtunman", cookies="cookies.txt"))
print(f"{len(posts)} posts, {len([post for post in posts if not post['text']])} missing text")

@joebah-joe
Copy link
Author

Thank you very much, I will play around with this to see if I can get the same result as you.

@joebah-joe
Copy link
Author

joebah-joe commented Mar 11, 2021

@neon-ninja @kevinzg Okay, so I was able to see all the post in Longtunman using the debug code after passing in the cookies. Finally! Now I know the latest problem is with my portion of python code that calls the scraper.

I built this function this in my code:

def extract_facebook_data(page_name, page_num):

text_post =""

output = ("<PAGE>\n\n<LABEL>" + page_name + "</LABEL>\n\n")
from facebook_scraper import get_posts
for post in get_posts(page_name, pages=page_num, cookies="mysite/cookies.txt"):
    output = output + "<POST>\n"
    output = output + "<TEXT>" + (post['text']) + "</TEXT>\n"
    output = output + "<IMG>" + str(post['image']) + "</IMG>\n"
    output = output + "</POST>\n\n"

output = output + str("</PAGE>\n\n")
return output 

And that's where the trouble start, and I get Nonetype from post['text'] and everything blow up.

Have I called called the scraper wrong? I am so close to fixing this problem I can smell it!

@neon-ninja
Copy link
Collaborator

@joebah-joe It's possible for post['text'] to be None as sometimes people post images or videos with no text attached. So you need to handle that case. You could handle it the same way you handled post['image'] being None - coerce it to a string with the str function - so str(post['text']). Alternatively, if you want to put something else instead of the string "None", you could do something like post['text'] or 'no text found'

@joebah-joe
Copy link
Author

joebah-joe commented Mar 12, 2021

@neon-ninja Hi - so after some debugging work, I discovered that the get_posts function basically returned nothing.

I then went back to your code with the list function, and checked the variable that was returned. It's a list function with a length of zero.

code:

posts = list(get_posts("longtunman", pages=10, cookies="cookies.txt"))
print(posts)
print(type(posts))
print(len(posts))

output:
[]
<class 'list'>
0

However, turning the debugger on (see attached debuggeroutput.txt file), I was able to see that all the posts exist and could have been extracted.

It seems to be a similar issue to 'dezsoandraskonyvei' page but this happens to all the pages I tried and not just 'longtunman'

debuggeroutput.txt

However, when I removed the cookies.txt from the get_posts function, the print(len(posts)) shows 32 items. But when I looked at that debugger log, it was getting 'raw posts' in literally every pages - which is to be expected as there was no cookies.

Has the get_posts function somehow bugged out when I pass the cookies as argument? so with cookies it can extract all the data but can't retrieve it. Without cookies, it can retrieve it but get raw post. I'm not very good at looking through the debug log so not sure what to look for. Thanks!

@neon-ninja
Copy link
Collaborator

Hi @joebah-joe - there must be something specific to your cookies file that is inducing facebook to serve you different HTML, as this works fine for me on PythonAnywhere with cookies. Unfortunately there's no HTML in your debug output. Try add logger.debug(self.response.html) in the page_iterators.py file around line 80 (

logger.debug("The page url is: %s", self.response.url)
). You'll need to git clone the repository in your PythonAnywhere bash terminal to do this.

@joebah-joe
Copy link
Author

joebah-joe commented Mar 12, 2021

@neon-ninja Ah ha - you may be right and that would explain why get_post is not pulling anything for me but the scraper is working perfectly fine.

I'm a bit new at this but I can't just add that logger line in /home/user/.virtualenvs/myvirtualenv/lib/python3.8/site-packages/page_iterators.py?

Also, I did notice that my cookies is different format from yours. Yours look like this according to your post:

.facebook.com false / true 1678314119 datr REDACTED
.facebook.com false / true 0 m_pixel_ratio 1
.facebook.com false / true 1623018122 fr REDACTED
.facebook.com false / true 1678314126 sb REDACTED
.facebook.com false / true 1646778124 c_user REDACTED
.facebook.com false / true 1646778124 xs REDACTED
.facebook.com false / true 1615846927 wd 1284x895

Mine look like this:

.facebook.com false / true 0 datr REDACTED
.facebook.com false / true 0 wd REDACTED
.facebook.com false / true 0 sb REDACTED
.facebook.com false / true 0 c_user REDACTED
.facebook.com false / true 0 xs REDACTED
.facebook.com false / true 0 fr REDACTED
.facebook.com false / true 0 spin REDACTED

So I have spin, and you have m_pixel_ratio.... would that have been a cause?

@neon-ninja
Copy link
Collaborator

neon-ninja commented Mar 12, 2021

@joebah-joe

can't just add that logger line in /home/user/.virtualenvs/myvirtualenv/lib/python3.8/site-packages/page_iterators.py?

you could, but I don't think that's quite the right path, it would probably have facebook_scraper in it

So I have spin, and you have m_pixel_ratio

If I swap out m_pixel_ratio for spin it still works fine for me, so likely unrelated

There's still no HTML in your last comment, those are URLs. Try logger.debug(self.response.html.html) instead

@joebah-joe
Copy link
Author

joebah-joe commented Mar 12, 2021

@neon-ninja

sorry my bad I wrote the path wrong - this was the full path:

/home/user/.virtualenvs/myvirtualenv/lib/python3.8/site-packages/facebook_scraper/page_iterators.py

just to make sure, I put this:

            logger.debug("start of the self response html portion")
            logger.debug(self.response.html)
            logger.debug("end of the self response html portion")

And ran again. This was the result (see log.txt)

if you search for "start of the self response" in the log file, it looks like each time it's only pulling very small amount of HTML tag which is basically just the URL again. I presume yours didn't look this? IT should have been the full HTML page and may be that';s what messing up the get_post function?

log.txt

@neon-ninja
Copy link
Collaborator

neon-ninja commented Mar 12, 2021

@joebah-joe self.response.html is a different object (parent class) to self.response.html.html (the actual HTML) - try the latter. Alternatively try raw_page.html

@joebah-joe
Copy link
Author

@neon-ninja okay that works better - you can search for "start of the raw response html portion" in the attached log raw html.txt and you can see that we now have a big block of HTML tag.

I have no idea if this makes sense or not, but it does seem to contains all the content needed. Perhaps the HTML formatting is off?

log raw html.txt

@neon-ninja
Copy link
Collaborator

neon-ninja commented Mar 12, 2021

The scraper is expected elements like <article class="_55wo _5rgr _5gh8 _3drq async_like", but instead you're getting elements like <div class="_55wo _5rgr _5gh8 _3drq async_like" - it's like your cookies are telling facebook you don't support HTML5.

@kevinzg any ideas?

@joebah-joe in theory changing the selector at

raw_posts = raw_page.find('article')
to something like raw_posts = raw_page.find('article[data-ft],div.async_like[data-ft]') should work for you, but I don't have any way of testing that

@joebah-joe
Copy link
Author

joebah-joe commented Mar 12, 2021

The scraper is expected elements like <article class="_55wo _5rgr _5gh8 _3drq async_like", but instead you're getting elements like <div class="_55wo _5rgr _5gh8 _3drq async_like" - it's like your cookies are telling facebook you don't support HTML5.

@kevinzg any ideas?

@joebah-joe in theory changing the selector at

raw_posts = raw_page.find('article')

to something like raw_posts = raw_page.find('article[data-ft],div.async_like[data-ft]') should work for you, but I don't have any way of testing that

Okay, looks like we narrowed down the problem to page format. I'll also try the code you mention in the iterator...

@kevinzg
Copy link
Owner

kevinzg commented Mar 12, 2021

No idea why it is sending divs instead of articles.

Here are some suggestions to get more consistent results:

  • Set your browser language to US English
  • Set the user agent of the scraper to be the same as the one of your browser
  • Use the same IP address or an IP address from the same country as your server.
    If you have SSH access to your server you can use it as a proxy, it's pretty easy to do, search for SSH SOCKS5.
    You can use the FoxyProxy extension to only use the proxy with Facebook if you want.
  • Use your browser's network inspector to see if there's anything weird in the requests to Facebook.

@joebah-joe
Copy link
Author

joebah-joe commented Mar 12, 2021

@kevinzg

Hi - thanks for the feedback. I will try doing those.

Out of curiosity - the debug we put in page_iterators.py seems to show that your scraper was already able to extract all the content on each page no problems. However, it still needed to output to HTML5 first so that it can parse the data from individual posts to get_post function? Just wondering how the process works. Thanks!

@joebah-joe
Copy link
Author

@neon-ninja @kevinzg

Well imagine that, it finally worked!

Replacing raw_posts = raw_page.find('article') with raw_posts = raw_page.find('article[data-ft],div.async_like[data-ft]') I was finally able extract the pages from longtunman consistently. Other pages works too!

So it's a bummer that I have to alter the scraper code to fix this, as I'd imagine it won't be helpful to anyone who's not having the same HTML5 problem a me (I seem to be a rare occurrence here!) so this line won't be incorporated. But at least I hope I contributed something to this awesome project. May be some unlucky soul run into the same issue as me and you can just direct them to this post to fix the problem.

Thank you both for helping. There was no way I could have fixed this by myself. Learnt a lot (if you can't tell already, I'm a beginner at the whole Python development thing though I do have some experience coding in other languages). I guess this 'bug' can be closed now.

Thank again to you both, I know you guys put in a lot of time to help me out so I really appreciate it!

@kevinzg
Copy link
Owner

kevinzg commented Mar 12, 2021

@joebah-joe Good to hear that!

Actually, that change would make very little harm, so I think it's worth adding.

About your previous question the scraper content you see on the log is extracted using a very simple method (print every text node), but at that point it doesn't know what's part of a post and what isn't, or where every post starts and ends, there's no structure just plain text.
To structure the text into post objects/dictionaries the scraper needs to look more carefully into what each node means, for example, one rule was that an article node contains a post, now we know that's not always the case.

@joebah-joe
Copy link
Author

@joebah-joe Good to hear that!

Actually, that change would make very little harm, so I think it's worth adding.

About your previous question the scraper content you see on the log is extracted using a very simple method (print every text node), but at that point it doesn't know what's part of a post and what isn't, or where every post starts and ends, there's no structure just plain text.
To structure the text into post objects/dictionaries the scraper needs to look more carefully into what each node means, for example, one rule was that an article node contains a post, now we know that's not always the case.

Ah, so that's how it works. Thank you sir!

neon-ninja added a commit to neon-ninja/facebook-scraper that referenced this issue Mar 14, 2021
neon-ninja added a commit to neon-ninja/facebook-scraper that referenced this issue Mar 14, 2021
@kevinzg
Copy link
Owner

kevinzg commented Mar 15, 2021

The fix has been released in v0.2.23.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants