Skip to content

lay-on-rock/scrapy-case-sensitive-headers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scrapy case senstive, alphabetically-ordered headers

Project start

pip install -r requirements/dev.txt
# OPTIONAL: If you also want to order Content-Length in the request headers,
# Modify site package
# => twisted/web/_newclient.py
# Change lines 787-790 from 
# self._writeHeaders(transport, networkString("Content-Length": ...))
# to: 
# self._writeHeaders(transport, None)
scrapy crawl test

Project motivations

It is true that section 3.2.2 of the RFC Standard for HTTP/1.1 protocol states that header sequencing and casing is insignificant. So, if this problem exists for you, it means a server has been customized to reject RFC standards. That being said, if you have reason to believe that header order is leading to anti-bot, this might be worth something to try.

Before doing this, you can consider:

If this doesn't work, you can turn to other modules:

  • curl_cffi, curl-impersonate in python, to impersonate browsers' TLS signatures or JA3 fingerprints (see reading on TLS fingerprinting, HTTP/2 fingerprinting);
  • asynchronous requests, something like this code snippet:
    from concurrent.futures import ThreadPoolExecutor, as_completed
    import time
    
    # How many threads are you comfortable with?
    THREAD_POOL = 8
    
    session = requests.Session()
    # Use this adapter instead, for ordered headers
    # class SortedHTTPAdapter(requests.adapters.HTTPAdapter):
    # def add_headers(self, request, **kwargs):
    #     request.headers = collections.OrderedDict(
    #         ((key, value) for key, value in sorted(request.headers.items()))
    #     )
    session.mount("https://", requests.adapters.HTTPAdapter(pool_maxsize=THREAD_POOL, max_retries=3, pool_block=True))
    
    def post(url):
        """Function used to make request"""
        return session.post(url, ...)
    
    def download(urls):
        start = datetime.now()
        with ThreadPoolExecutor(max_workers=THREAD_POOL) as executor:
            for response in list(executor.map(post, urls)):
                if 500 <= response.status_code <= 600:
                    # Server overload, wait
                    time.sleep(5)
        duration = datetime.now() - start
        duration_seconds = duration.total_seconds()
        print(f"Completed in {duration_seconds = }")
    
    download(['url1', 'url2', '...'])
  • httpx to make HTTP/2 request;

Project summary

Scrapy natively formats headers.

import scrapy
scrapy.http.Headers({'caCHE-conTROL': 'test'})
# ['Cache-Control']

Anti-bot can potentially be case senstive in headers. References:

To handle case sensitivity, this spider uses _caseMappings attribute from Twisted internal headers class:

# Declare before request or spider declaration
from twisted.web.http_headers import Headers as TwistedHeaders
TwistedHeaders._caseMappings[b'cache-control'] = b'caCHE-conTROL'

References:

Current issues

  • It is a lot of work to modify the twisted code itself to make content-length header respect order, there should be an easier work-around to that

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages