GitHub - lay-on-rock/scrapy-case-sensitive-headers

Scrapy case senstive, alphabetically-ordered headers

Project start

pip install -r requirements/dev.txt
# OPTIONAL: If you also want to order Content-Length in the request headers,
# Modify site package
# => twisted/web/_newclient.py
# Change lines 787-790 from 
# self._writeHeaders(transport, networkString("Content-Length": ...))
# to: 
# self._writeHeaders(transport, None)
scrapy crawl test

Project motivations

It is true that section 3.2.2 of the RFC Standard for HTTP/1.1 protocol states that header sequencing and casing is insignificant. So, if this problem exists for you, it means a server has been customized to reject RFC standards. That being said, if you have reason to believe that header order is leading to anti-bot, this might be worth something to try.

Before doing this, you can consider:

proxy rotation, user-agent rotation, spoofing headers/cookies;
DOWNLOADER_CLIENT_TLS_METHOD=TLSv1.2 as per Scrapy #4951;
Check if request if HTTP/2 (h2), if so can try using HTTP/2 in Scrapy

If this doesn't work, you can turn to other modules:

curl_cffi, curl-impersonate in python, to impersonate browsers' TLS signatures or JA3 fingerprints (see reading on TLS fingerprinting, HTTP/2 fingerprinting);

asynchronous requests, something like this code snippet:

from concurrent.futures import ThreadPoolExecutor, as_completed
import time

# How many threads are you comfortable with?
THREAD_POOL = 8

session = requests.Session()
# Use this adapter instead, for ordered headers
# class SortedHTTPAdapter(requests.adapters.HTTPAdapter):
# def add_headers(self, request, **kwargs):
#     request.headers = collections.OrderedDict(
#         ((key, value) for key, value in sorted(request.headers.items()))
#     )
session.mount("https://", requests.adapters.HTTPAdapter(pool_maxsize=THREAD_POOL, max_retries=3, pool_block=True))

def post(url):
    """Function used to make request"""
    return session.post(url, ...)

def download(urls):
    start = datetime.now()
    with ThreadPoolExecutor(max_workers=THREAD_POOL) as executor:
        for response in list(executor.map(post, urls)):
            if 500 <= response.status_code <= 600:
                # Server overload, wait
                time.sleep(5)
    duration = datetime.now() - start
    duration_seconds = duration.total_seconds()
    print(f"Completed in {duration_seconds = }")

download(['url1', 'url2', '...'])

httpx to make HTTP/2 request;

Project summary

Scrapy natively formats headers.

import scrapy
scrapy.http.Headers({'caCHE-conTROL': 'test'})
# ['Cache-Control']

Anti-bot can potentially be case senstive in headers. References:

A situation where someone needed a case sensitive header, Scrapy #5910

To handle case sensitivity, this spider uses _caseMappings attribute from Twisted internal headers class:

# Declare before request or spider declaration
from twisted.web.http_headers import Headers as TwistedHeaders
TwistedHeaders._caseMappings[b'cache-control'] = b'caCHE-conTROL'

References:

Scrapy capitalizes headers for request, Scrapy #2711

Current issues

It is a lot of work to modify the twisted code itself to make content-length header respect order, there should be an easier work-around to that

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
crawl		crawl
requirements		requirements
README.md		README.md
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

crawl

crawl

requirements

requirements

README.md

README.md

scrapy.cfg

scrapy.cfg

Repository files navigation

Scrapy case senstive, alphabetically-ordered headers

Project start

Project motivations

Project summary

Current issues

About

Releases

Packages

Languages

lay-on-rock/scrapy-case-sensitive-headers

Folders and files

Latest commit

History

Repository files navigation

Scrapy case senstive, alphabetically-ordered headers

Project start

Project motivations

Project summary

Current issues

About

Resources

Stars

Watchers

Forks

Languages