Skip to content
A middleware layer for Scrapy that detects CAPTCHA tests and solves them
Python HTML
Branch: master
Clone or download
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore revising ignore file to be more generic May 23, 2017
LICENSE Creating a licence May 23, 2017
README.rst Revising debugging instructions for Tesseract Dec 17, 2018
requirements.txt Increasing PEP8 compliance; introducing Python v3 support; freezing c… May 19, 2019 Correcting syntax error in installation Dec 18, 2018


I must be a robot then


Checks for a CAPTCHA test and tries solving it. This is open-source so as to prevent slaves from being forced to solve CAPTCHA tests.


Note that this program relies on Tesseract v4, which is available on Ubuntu 18.04.

Install Tesseract and the language file

sudo apt-get install tesseract-ocr
sudo mkdir /usr/local/share/tessdata
sudo mv eng.traineddata /usr/local/share/tessdata
sudo chmod a+w /usr/local/share/tessdata/eng.traineddata
export TESSDATA_PREFIX=/usr/local/share/tessdata
which tesseract

Make sure to include the TESSDATA_PREFIX in your bash profile.

Install Pillow in Python to substitute for PIL

pip install pillow

Install captchaMiddleware

python test
python install

If the tests fail, test your tesseract installation:

tesseract "unknown letter 0.jpg" prediction --psm 10 --oem 0


Include this in the Downloader Middleware


In your spider, set a meta key to prevent trying the tests too many times:

from captchaMiddleware.middleware import RETRY_KEY
def start_requests(self):
     for url in urls:
          yield scrapy.Request(url=url, callback=self.parse,
               errback=self.errorHandler, meta={RETRY_KEY:0})
You can’t perform that action at this time.