Selenium Crawler

a web crawler written in python, powered by Selenium and Tesseract OCR

Motivation

In a project I need to collect the name and address of all dental laboratories in Taiwan. Unfortunately the Ministry of Health and Welfare doesn't provide a structured format(csv, json...etc) for download. The data only available as website tables having just 10 records each page, and there are about 1000 pages... so I got to crawl it without having other faster and simpler options(or maybe I should contact the government officials, which is faster?? 🤔...).

Choosing Web Crawler

I struggled quite some time for which web crawling library should be used. Besides Selenium, other candidates included Scrapy, Playwright and Puppeteer.

Scrapy is a highly structured framework. I need to follow its structure to write python classes and middle-wares if using it. Then use CLI to run the crawling. The well structured format does provide clear separation of concerns. Its good for creating serious crawler, especially when working with a team that each member responsible for a specific part of a project. However, its a bit overkill for an one-man simple crawling project like this one, so maybe next time.

Playwright, as a pytest plugin in python, its more test oriented and depend on pytest with CLI. This added an other layer of complication, also not well suited in jupyter's exploratory programming style. Therefore 🙅‍♀️

Puppeteer is the closest competitor. I like its flexible and elegant API on selecting elements. The main reason for not choosing it is that exploratory programming style for javascript is not that "out-of-box" comparing with python. Yes, I can do exploratory programming with js, such as using Quokka in VSCode, installing iJavascript kernel for jupyter, or using RunKit online... etc. The DX of these toolings are just not as smooth and mature as jupyter + python.

Another essential benefit of using jupyter is that, I can easily "deploy" the program online using Binder, so the users can test it immediately by simply pressing , and tell me if it fit their needs. This fast feedback loop significantly accelerates the development cycle 👍

So, the final stack for this simple crawler is: python + selenium + jupyter -> Binder 🎉

Usage

run online with
or install and run locally
- clone repo
- install dependencies poetry install
- run selenium-crawler.ipynb in VSCode

tech stack

Jupyter: exploratory programming in python
- effectively try out any CSS selector / XPath combinations
Selenium
- First use normal mode to have visual feedback while exploring XPaths
- then use headless mode in production
Tesseract: Google's OCR library for recognizing captcha
pil: create image object from the captcha binary retrieved by Selenium

Questions?

Open a github issue or ping me on Twitter twitter-icon

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
lab_details.csv		lab_details.csv
labs.txt		labs.txt
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
selenium-crawler.ipynb		selenium-crawler.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

lab_details.csv

lab_details.csv

labs.txt

labs.txt

poetry.toml

poetry.toml

pyproject.toml

pyproject.toml

requirements.txt

requirements.txt