Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
craigslist rss requests fail with 403 error, but wget and browser succeed. #168
Note: I filed this issue first with rss2email, but the maintainer states it is a feedparser issue.
I have duplicated this on separate machines in different physical locations.
'r2e run' fails fetching the feed with a 403 error. However the url loads just fine in wget and in any web browser. so it is not IP related. proof (steps to reproduce) below.
Using r2e version 3.9, from ubuntu repo, and also master from github/rss2email.
All craigslist.org feed URLs have been failiing since approx May 9. I notified Craig (of craigslist) and he replied that he sent it to his eng team. On May 17, the feeds started working again and I thought the problem resolved, but by May 18 the 403's were back, and continue on. Prior to May 9, the feeds were working fine for years.
I also tried modifying the USER_AGENT string in feed.py to eg 'Mozilla/5.0' and also omitting the string (to use feedparser default) but no change.
This seems to be a server-side issue since my installation was working well until May 9, however it is very interesting that wget works when r2e does not, and indicates there must be a client-side way to achieve a correct fetch.
I had initially thought the problem to likely be related to too many requests in a given time interval, however I tried with a brand new rss2email install on a remote server and it failed on the very first request, as shown below.
Anyway, I hope we can get ti working again.
hmm, running your script I'm seeing different results on different machines. on my laptop at home, I get a failure:
It works correctly from another machine at home (same public IP) and fails on a remote datacenter server (different public IP).
home laptop (fails): Python 3.5.2, Ubuntu 16.04.1 LTS, feedparser 5.2.1
any ideas why it might always succeed on the home desktop and fail on the other two? It's kinda interesting that the 16.04.* machines fail, and the 14.04 succeeds.
What might I try to narrow this down further?
@samuelclay yeah I originally thought they must have banned my IP, but as I stated above it actually works fine from another machine in my home, using the same router and public IP. So from this I conclude there must be something different in the stack, that or somehow CL is fingerprinting individual devices...
any thoughts or things to try/test?
a thought: maybe they are fingerprinting based on the SSL/TLS cipher being used.
I wanted to run wireshark to check the exact http traffic of wget vs feedparser, but soon realized I couldn't due to https. But maybe it is that very https handshake that is causing the 403....
ok, I made requests to an http server running on local laptop (that exhibits problem) and ran wireshark to capture the requests.
feedparser: (from laptop with problem)
wget: (from laptop with problem)
feedparser: (from machine without problem)
The headers are sent in different order in the feedparser request from laptop vs the other machine, but I ran both through "sort" and then diff, and there are no differences. In other words, exact same headers are being sent, except for header order -- which shouldn't be important.
Since I don't see any diffs at http request, I become more suspicious about the TLS/SSL handshake.
Yup, pretty sure @dan-da is spot on.
I setup an nginx proxy on the same host to forward http requests to craigslist on HTTPS. The test script grabs the content perfectly, with no 403.
@kurtmckee can you take a look at how you are making SSL requests, or what library you are using? Something is broken.
@ddn, that's awesome work! However, I cannot recreate this problem on my machine and am not willing to pursue this further.
I will note that Halcyon is using custom HTTP code that will be ripped out after the 6.x release series and replaced with something more standard, like requests.
my best guess is that maybe it has to do with the default tls ciphers used by some library. Probably in the python stack rather than at OS level. I don't have any time to investigate the ssl connection right now.
I will note that the working machine is Ubuntu 14.04 x with python 3.5.3 and both non-working machines are Ubuntu 16.04.x with python 3.5.2. Perhaps python 3.5.3 would fix it.
@ddn What version of python and OS are you using? btw, nice hack with the nginx proxy.
Python 2.7.12 on Ubuntu 16.04.6 LTS exhibits the issue.
macOS 10.14.5 Python 2.7.10 does not.
And in fact, I can VPN through the host where the issue occurs, and the mac host still works.
So it's definitely not an IP thing.
I understand the issue is kind of esoteric, but "I'm not going to pursue this" is a weird attitude to take, imho.
This is the report from https://www.howsmyssl.com/:
Insecure Cipher Suites
Your client supports cipher suites that are known to be insecure:
TLS_DHE_DSS_WITH_3DES_EDE_CBC_SHA: This cipher suite uses 3DES which is vulnerable to the Sweet32 attack but was not configured as a fallback in the ciphersuite order.
TLS_DHE_RSA_WITH_3DES_EDE_CBC_SHA: This cipher suite uses 3DES which is vulnerable to the Sweet32 attack but was not configured as a fallback in the ciphersuite order.
TLS_DH_DSS_WITH_3DES_EDE_CBC_SHA: This cipher suite uses 3DES which is vulnerable to the Sweet32 attack but was not configured as a fallback in the ciphersuite order.
TLS_DH_RSA_WITH_3DES_EDE_CBC_SHA: This cipher suite uses 3DES which is vulnerable to the Sweet32 attack but was not configured as a fallback in the ciphersuite order.
TLS_ECDHE_ECDSA_WITH_3DES_EDE_CBC_SHA: This cipher suite uses 3DES which is vulnerable to the Sweet32 attack but was not configured as a fallback in the ciphersuite order.
TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA: This cipher suite uses 3DES which is vulnerable to the Sweet32 attack but was not configured as a fallback in the ciphersuite order.
TLS_ECDH_ECDSA_WITH_3DES_EDE_CBC_SHA: This cipher suite uses 3DES which is vulnerable to the Sweet32 attack but was not configured as a fallback in the ciphersuite order.
TLS_ECDH_RSA_WITH_3DES_EDE_CBC_SHA: This cipher suite uses 3DES which is vulnerable to the Sweet32 attack but was not configured as a fallback in the ciphersuite order.
I added one line to feedparser to print the whole output before it parses it, and then I just used the few lines of test code above to grab that site. I can post code later if that’s not specific enough. It should be pretty easy for someone with the python chops to clean up the list of ciphers offered, but unfortunately that’s not me or I’d PR it.…
On Jul 18, 2019, at 11:17 AM, dan-da ***@***.***> wrote: @ddn how did you generate/view the report? Can you post a script? I'd like to try it on my working and non-working machines to see diffs. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Results from howsmyssl.com
Ubuntu 14.04 machine - works.
Ubuntu 16.04 machine - fails.
This may be relevant/helpful.