Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

craigslist rss requests fail with 403 error, but wget and browser succeed. #168

Closed
dan-da opened this issue May 28, 2019 · 29 comments

Comments

@dan-da
Copy link

commented May 28, 2019

Note: I filed this issue first with rss2email, but the maintainer states it is a feedparser issue.


I have duplicated this on separate machines in different physical locations.

'r2e run' fails fetching the feed with a 403 error. However the url loads just fine in wget and in any web browser. so it is not IP related. proof (steps to reproduce) below.

Using r2e version 3.9, from ubuntu repo, and also master from github/rss2email.

All craigslist.org feed URLs have been failiing since approx May 9. I notified Craig (of craigslist) and he replied that he sent it to his eng team. On May 17, the feeds started working again and I thought the problem resolved, but by May 18 the 403's were back, and continue on. Prior to May 9, the feeds were working fine for years.

I also tried modifying the USER_AGENT string in feed.py to eg 'Mozilla/5.0' and also omitting the string (to use feedparser default) but no change.

This seems to be a server-side issue since my installation was working well until May 9, however it is very interesting that wget works when r2e does not, and indicates there must be a client-side way to achieve a correct fetch.

I had initially thought the problem to likely be related to too many requests in a given time interval, however I tried with a brand new rss2email install on a remote server and it failed on the very first request, as shown below.

Anyway, I hope we can get ti working again.

$ r2e add cl1 'https://sfbay.craigslist.org/search/sss?format=rss&query=sw5548&searchNearby=1' <email>

$ r2e run
HTTP status 403 fetching feed cl1 (https://sfbay.craigslist.org/search/sss?format=rss&query=sw5548&searchNearby=1 -> [EMAIL]

$ wget -O feed.xml "https://sfbay.craigslist.org/search/sss?format=rss&query=sw5548&searchNearby=1"
--2019-05-27 07:49:19--  https://sfbay.craigslist.org/search/sss?format=rss&query=sw5548&searchNearby=1
Resolving sfbay.craigslist.org (sfbay.craigslist.org)... 208.82.238.18
Connecting to sfbay.craigslist.org (sfbay.craigslist.org)|208.82.238.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/rss+xml]
Saving to: ‘feed.xml’

feed.xml                              [ <=>                                                         ]   1.40K  --.-KB/s    in 0s      

2019-05-27 07:49:20 (55.3 MB/s) - ‘feed.xml’ saved [1433]

$ head -n 6 feed.xml 
<?xml version="1.0" encoding="UTF-8"?>

<rdf:RDF
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
 xmlns="http://purl.org/rss/1.0/"
 xmlns:enc="http://purl.oclc.org/net/rss_2.0/enc#"
@dan-da

This comment has been minimized.

Copy link
Author

commented Jun 20, 2019

still failing. any thoughts on this?

@kurtmckee

This comment has been minimized.

Copy link
Owner

commented Jul 1, 2019

Hi @dan-da! I imported feedparser and ran feedparser.parse(<url>) and it pulled the feed without any issue. I made no modifications to the user agent or other settings so I'm not able to recreate the problem you're describing.

@kurtmckee kurtmckee closed this Jul 1, 2019

@dan-da

This comment has been minimized.

Copy link
Author

commented Jul 1, 2019

@kurtmckee thx for looking into this. Can you please attach the script you made? So I can test it in my env...

@kurtmckee

This comment has been minimized.

Copy link
Owner

commented Jul 2, 2019

@dan-da

This comment has been minimized.

Copy link
Author

commented Jul 2, 2019

hmm, running your script I'm seeing different results on different machines. on my laptop at home, I get a failure:

$ python3  /tmp/test.py 
{'bozo': 1,
 'bozo_exception': SAXParseException('syntax error',),
 'encoding': 'iso-8859-1',
 'entries': [],
 'feed': {},
 'headers': {'Content-Length': '117',
             'Set-Cookie': 'cl_b=OAqIItGc6RGf7nx9TSR2vwARiTA;path=/;domain=.craigslist.org;expires=Fri, '
                           '01-Jan-2038 00:00:00 GMT',
             'Strict-Transport-Security': 'max-age=86400'},
 'href': 'https://sfbay.craigslist.org/search/sss?format=rss&query=sw5548&searchNearby=1',
 'namespaces': {},
 'status': 403,
 'version': ''}

It works correctly from another machine at home (same public IP) and fails on a remote datacenter server (different public IP).

home laptop (fails): Python 3.5.2, Ubuntu 16.04.1 LTS, feedparser 5.2.1
datacenter server: Python 3.5.2, Ubuntu 16.04.2 LTS, feedparser 5.1.3 and 5.2.1
home desktop (succeeds): Python 3.4.3 and/or Python 3.5.3, Ubuntu 14.04.4 LTS, feedparser 5.2.1

any ideas why it might always succeed on the home desktop and fail on the other two? It's kinda interesting that the 16.04.* machines fail, and the 14.04 succeeds.

What might I try to narrow this down further?

@samuelclay

This comment has been minimized.

Copy link

commented Jul 2, 2019

I'll say from experience of running a news reader that CL likes to ban IPs left and right. It's temporary, lasts about a month, and stems from hitting their servers too often. The remote datacenter server was probably used by a bot that hit CL a bunch.

@dan-da

This comment has been minimized.

Copy link
Author

commented Jul 6, 2019

@samuelclay yeah I originally thought they must have banned my IP, but as I stated above it actually works fine from another machine in my home, using the same router and public IP. So from this I conclude there must be something different in the stack, that or somehow CL is fingerprinting individual devices...

any thoughts or things to try/test?

@ddn

This comment has been minimized.

Copy link

commented Jul 9, 2019

I can reproduce the exact same issue with the test script. I also came here looking for a solution to the rss2email issue. feedparser is obviously somehow being fingerprinted and banned, for some reason, and it's not user agent.

@samuelclay

This comment has been minimized.

Copy link

commented Jul 9, 2019

It has to be user agent or IP address (or both) that they're using as fingerprints, but I think it's unlikely to be anything else (exact URL or cookies come to mind).

@ddn

This comment has been minimized.

Copy link

commented Jul 9, 2019

Not sure how you figure. If you spoof the UA from the same IP it doesn't result in a 403, and conversely if you change the Feedparser UA it doesn't resolve the 403 issue. So it's definitely not user agent, IP, or both.

@dan-da

This comment has been minimized.

Copy link
Author

commented Jul 10, 2019

It's also not exact URL because I use the exact same test script with same URL from a machine that works and another that doesn't, both behind same public IP. No cookies being sent at all.

@dan-da

This comment has been minimized.

Copy link
Author

commented Jul 10, 2019

a thought: maybe they are fingerprinting based on the SSL/TLS cipher being used.

I wanted to run wireshark to check the exact http traffic of wget vs feedparser, but soon realized I couldn't due to https. But maybe it is that very https handshake that is causing the 403....

@dan-da

This comment has been minimized.

Copy link
Author

commented Jul 10, 2019

ok, I made requests to an http server running on local laptop (that exhibits problem) and ran wireshark to capture the requests.

feedparser: (from laptop with problem)

GET /tmp/feed.xml HTTP/1.1
A-Im: feed
Host: random
Accept-Encoding: gzip, deflate
Connection: close
Accept: application/atom+xml,application/rdf+xml,application/rss+xml,application/x-netcdf,application/xml;q=0.9,text/xml;q=0.2,*/*;q=0.1
User-Agent: UniversalFeedParser/5.2.1 +https://code.google.com/p/feedparser/

HTTP/1.1 200 OK
...

wget: (from laptop with problem)

GET /tmp/feed.xml HTTP/1.1
User-Agent: Wget/1.17.1 (linux-gnu)
Accept: */*
Accept-Encoding: identity
Host: random
Connection: Keep-Alive

HTTP/1.1 200 OK
...

feedparser: (from machine without problem)

GET /tmp/feed.xml HTTP/1.1
Accept: application/atom+xml,application/rdf+xml,application/rss+xml,application/x-netcdf,application/xml;q=0.9,text/xml;q=0.2,*/*;q=0.1
User-Agent: UniversalFeedParser/5.2.1 +https://code.google.com/p/feedparser/
Host: random
Connection: close
A-Im: feed
Accept-Encoding: gzip, deflate

HTTP/1.1 200 OK
...

The headers are sent in different order in the feedparser request from laptop vs the other machine, but I ran both through "sort" and then diff, and there are no differences. In other words, exact same headers are being sent, except for header order -- which shouldn't be important.

Since I don't see any diffs at http request, I become more suspicious about the TLS/SSL handshake.

@dan-da

This comment has been minimized.

Copy link
Author

commented Jul 10, 2019

Can we please get this issue re-opened since @ddn has the same problem?

@kurtmckee

This comment has been minimized.

Copy link
Owner

commented Jul 10, 2019

No. I'm not able to recreate it on my own system and it smells like a ban issue. I don't have additional recommendations for things to try, but I do understand that this is a very frustrating situation and I'm sorry that I don't have a resolution to offer!

@ddn

This comment has been minimized.

Copy link

commented Jul 10, 2019

Yup, pretty sure @dan-da is spot on.

I setup an nginx proxy on the same host to forward http requests to craigslist on HTTPS. The test script grabs the content perfectly, with no 403.

@kurtmckee can you take a look at how you are making SSL requests, or what library you are using? Something is broken.

@kurtmckee

This comment has been minimized.

Copy link
Owner

commented Jul 10, 2019

@ddn, that's awesome work! However, I cannot recreate this problem on my machine and am not willing to pursue this further.

I will note that Halcyon is using custom HTTP code that will be ripped out after the 6.x release series and replaced with something more standard, like requests.

@dan-da

This comment has been minimized.

Copy link
Author

commented Jul 10, 2019

my best guess is that maybe it has to do with the default tls ciphers used by some library. Probably in the python stack rather than at OS level. I don't have any time to investigate the ssl connection right now.

I will note that the working machine is Ubuntu 14.04 x with python 3.5.3 and both non-working machines are Ubuntu 16.04.x with python 3.5.2. Perhaps python 3.5.3 would fix it.

@ddn What version of python and OS are you using? btw, nice hack with the nginx proxy.

@ddn

This comment has been minimized.

Copy link

commented Jul 10, 2019

Python 2.7.12 on Ubuntu 16.04.6 LTS exhibits the issue.

macOS 10.14.5 Python 2.7.10 does not.

And in fact, I can VPN through the host where the issue occurs, and the mac host still works.

So it's definitely not an IP thing.

I understand the issue is kind of esoteric, but "I'm not going to pursue this" is a weird attitude to take, imho.

@dan-da

This comment has been minimized.

Copy link
Author

commented Jul 10, 2019

so all hosts with the issue thus far (3) are Ubuntu 16.04.x.

@ddn

This comment has been minimized.

Copy link

commented Jul 17, 2019

This is the report from https://www.howsmyssl.com/:

Insecure Cipher Suites

Bad

Your client supports cipher suites that are known to be insecure:

TLS_DHE_DSS_WITH_3DES_EDE_CBC_SHA: This cipher suite uses 3DES which is vulnerable to the Sweet32 attack but was not configured as a fallback in the ciphersuite order.

TLS_DHE_RSA_WITH_3DES_EDE_CBC_SHA: This cipher suite uses 3DES which is vulnerable to the Sweet32 attack but was not configured as a fallback in the ciphersuite order.

TLS_DH_DSS_WITH_3DES_EDE_CBC_SHA: This cipher suite uses 3DES which is vulnerable to the Sweet32 attack but was not configured as a fallback in the ciphersuite order.

TLS_DH_RSA_WITH_3DES_EDE_CBC_SHA: This cipher suite uses 3DES which is vulnerable to the Sweet32 attack but was not configured as a fallback in the ciphersuite order.

TLS_ECDHE_ECDSA_WITH_3DES_EDE_CBC_SHA: This cipher suite uses 3DES which is vulnerable to the Sweet32 attack but was not configured as a fallback in the ciphersuite order.

TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA: This cipher suite uses 3DES which is vulnerable to the Sweet32 attack but was not configured as a fallback in the ciphersuite order.

TLS_ECDH_ECDSA_WITH_3DES_EDE_CBC_SHA: This cipher suite uses 3DES which is vulnerable to the Sweet32 attack but was not configured as a fallback in the ciphersuite order.

TLS_ECDH_RSA_WITH_3DES_EDE_CBC_SHA: This cipher suite uses 3DES which is vulnerable to the Sweet32 attack but was not configured as a fallback in the ciphersuite order.

@dan-da

This comment has been minimized.

Copy link
Author

commented Jul 18, 2019

@ddn how did you generate/view the report? Can you post a script? I'd like to try it on my working and non-working machines to see diffs.

@ddn

This comment has been minimized.

Copy link

commented Jul 18, 2019

@dan-da

This comment has been minimized.

Copy link
Author

commented Jul 18, 2019

Results from howsmyssl.com

Ubuntu 14.04 machine - works.

Your SSL client is Probably Okay.

    Insecure Cipher Suites

        Good. Your client doesn't use any cipher suites that are known to be insecure.

Ubuntu 16.04 machine - fails.

Your SSL client is Bad.

Insecure Cipher Suites
 
    Bad Your client supports cipher suites that are known to be insecure:
 
        TLS_DHE_DSS_WITH_3DES_EDE_CBC_SHA: This cipher suite uses 3DES which is vulnerable to the Sweet32 attack but was not configured as a fallback in the ciphersuite order.
         
        TLS_DHE_RSA_WITH_3DES_EDE_CBC_SHA: This cipher suite uses 3DES which is vulnerable to the Sweet32 attack but was not configured as a fallback in the ciphersuite order.
         
        TLS_DH_DSS_WITH_3DES_EDE_CBC_SHA: This cipher suite uses 3DES which is vulnerable to the Sweet32 attack but was not configured as a fallback in the ciphersuite order.
         
        TLS_DH_RSA_WITH_3DES_EDE_CBC_SHA: This cipher suite uses 3DES which is vulnerable to the Sweet32 attack but was not configured as a fallback in the ciphersuite order.
         
        TLS_ECDHE_ECDSA_WITH_3DES_EDE_CBC_SHA: This cipher suite uses 3DES which is vulnerable to the Sweet32 attack but was not configured as a fallback in the ciphersuite order.
         
        TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA: This cipher suite uses 3DES which is vulnerable to the Sweet32 attack but was not configured as a fallback in the ciphersuite order.
         
        TLS_ECDH_ECDSA_WITH_3DES_EDE_CBC_SHA: This cipher suite uses 3DES which is vulnerable to the Sweet32 attack but was not configured as a fallback in the ciphersuite order.
         
        TLS_ECDH_RSA_WITH_3DES_EDE_CBC_SHA: This cipher suite uses 3DES which is vulnerable to the Sweet32 attack but was not configured as a fallback in the ciphersuite order.
@dan-da

This comment has been minimized.

Copy link
Author

commented Jul 18, 2019

@kurtmckee does this narrow it down enough for you?

@dan-da

This comment has been minimized.

@ddn

This comment has been minimized.

Copy link

commented Jul 22, 2019

Fixed it @dan-da .

Add these lines in feedparser.py:

import ssl
ssl.PROTOCOL_SSLv23 = ssl.PROTOCOL_TLSv1

they go between:

import warnings

import ssl
ssl.PROTOCOL_SSLv23 = ssl.PROTOCOL_TLSv1

from html.entities import name2codepoint, codepoint2name, entitydefs

@dan-da

This comment has been minimized.

Copy link
Author

commented Jul 23, 2019

@ddn you are a hero! :) I confirm your fix works here. Will you make it a PR?

@ddn

This comment has been minimized.

Copy link

commented Jul 23, 2019

PR made. Be gentle @kurtmckee

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.