Web scraping API benchmark by Scraping Fish 🐟

This repository contains python code used to benchmark web scraping APIs.

The results are presented at https://scrapingfish.com/webscraping-benchmark.

You can also use the script from this repository to run a scraping job with Scraping Fish API by providing an input file with a list of URLs to scrape. Possible configuration options are documented in the Usage section below.

Prerequisites

To run the code, you need to install dependencies:

pip install -r requirements.txt

Input

For input, you have to prepare a text file with a list of URLs separated by new line character.

Example input files with URLs provided in the urls folder used for the benchmark:

alexa.txt: URLs from the top 1,000 Alexa rank
amazon.txt: Amazon product URLs
google.txt: Google search queries
instagram.txt: the top 10 Instagram profiles (as of 2022)
similarweb.txt: websites from the similarweb ranking (excluding adult and russian websites)

Usage

Run main.py script with --help argument to show usage options.

$ python main.py --help
usage: main.py [-h] --api-key API_KEY --urls URLS [--api API] [--concurrency CONCURRENCY] 
               [--retries RETRIES] [--timeout TIMEOUT] [--limit LIMIT] [--render-js] [--verbose]

optional arguments:
  -h, --help                 show this help message and exit
  --api-key API_KEY          Scraping Fish API key
  --urls URLS                Path to a file with the list of urls to scrape
  --api API                  Scraping Fish API url
  --concurrency CONCURRENCY  Number of concurrent connections
  --retries RETRIES          Number of retries in case of unsuccessful response
  --timeout TIMEOUT          How many seconds to wait for the server to respond before giving up
  --limit LIMIT              Limit the number of URLs to scrape (set to 0 to disable)
  --render-js                Enable JS rendering
  --verbose                  Verbose mode

Example

To run a job for scraping Amazon product URLs

using 5 concurrent connections,
with 2 retries after the request fails or times out,
interrupting unfinished request after 60 seconds.

python main.py --api-key <YOUR_SCRAPING_FISH_API_KEY> --urls ./urls/amazon.txt \
               --concurrency 5 --retries 2 --timeout 60

Benchmark methodology

We run scraping jobs for the 5 example input files in the urls folder by making 1,000 requests for each of them. Concurrency and response timeout parameters were set to achieve optimal processing time and keep the error rate low.

📝 If you want to use this script for your web scraping job, you might need to adjust these parameters to the websites you want to scrape.

For the benchmark, we executed the following commands.

Alexa

python main.py --api-key API_KEY --urls ./urls/alexa.txt \
               --concurrency 5 --timeout 60.0  --limit 1000

Amazon

python main.py --api-key API_KEY --urls ./urls/amazon.txt \
               --concurrency 5 --timeout 15.0  --limit 1000

Google

python main.py --api-key API_KEY --urls ./urls/google.txt \
               --concurrency 5 --timeout 15.0  --limit 1000

Instagram

python main.py --api-key API_KEY --urls ./urls/instagram.txt \
               --concurrency 2 --timeout 60.0  --limit 1000

Similarweb

python main.py --api-key API_KEY --urls ./urls/similarweb.txt \
               --concurrency 5 --timeout 15.0  --limit 1000

Metrics

For each test, we recorded

✅ successful URLs
❌ failed URLs
⛔️ blocked URLs
⏱ average URL processing time (seconds/URL)
💰 cost of running the benchmark (1000 requests)

The results are presented in the section below.

Results

Scraping Fish 🐟

Test	✅ Successful	❌ Failed	⛔️ Blocked	⏱ Processing time	💰 Cost
Alexa	99.9%	0.1%	0%	2.63	$2
Amazon	100.0%	0%	0%	3.37	$2
Google	100.0%	0%	0%	1.63	$2
Instagram	97.0%	3.0%	0%	23.25	$2
Similarweb	100.0%	0%	0%	2.50	$2
Total	99.4%	0.6%	0.0%	6.28	$10

📝 $0.002 per each successfully scraped URL. The highest overall success rate and the best processing time.

Other web scraping APIs

ScrapingAnt 🐜

Benchmarks run using --api "https://api.scrapingant.com/v1/general/?proxy_type=residential&" parameter and adjusted code to pass API key as a header instead of query parameter.

Test	✅ Successful	❌ Failed	⛔️ Blocked	⏱ Processing time	💰 Cost
Alexa	100.0%	0%	0%	6.92	$19
Amazon	98.0%	2.0%	0%	9.84	$19
Google	95.0%	5.0%	0%	13.80	$19
Instagram	99.5%	0.5%	0%	6.76	$19
Similarweb	96.0%	4.0%	0%	7.40	$19
Total	97.7%	2.3%	0.0%	8.94	$49

📝 $49 Startup subscription required to scrape 5,000 URLs in total (each consuming 50 or 250 API credits) and using 5 concurrent connections.

ScrapingBee 🐝

Benchmarks run using --api "https://app.scrapingbee.com/api/v1/?premium_proxy=true&" and custom_google parameter set to true for Google benchmark.

Test	✅ Successful	❌ Failed	⛔️ Blocked	⏱ Processing time	💰 Cost
Alexa	81.0%	18.0%	1.0%	4.86	$99
Amazon	99.0%	1.0%	0%	11.48	$99
Google	100.0%	0%	0%	3.74	$99
Instagram	99.0%	1.0%	0%	18.52	$99
Similarweb	90.0%	8.0%	2.0%	4.70	$99
Total	93.8%	5.6%	0.6%	8.66	$99

📝 $99 Startup subscription required to scrape 5,000 URLs in total (each consuming 10, 20, or 25 API credits) and using 5 concurrent connections.

ScraperAPI

Benchmarks run using --api "http://api.scraperapi.com/?premium=true&" parameter.

Test	✅ Successful	❌ Failed	⛔️ Blocked	⏱ Processing time	💰 Cost
Alexa	95.5%	4.5%	0%	7.19	$49
Amazon	96.0%	4.0%	0%	10.97	$49
Google	100.0%	0%	0%	4.50	$49
Instagram^*	0.0%	100.0%	0%	-	-
Similarweb	90.0%	8.0%	2.0%	4.70	$49
Total	76.3%	23.3%	0.4%	6.84	$49

^* Scraping Instagram is not allowed and returns 403 status code.

📝 $49 Hobby subscription required to scrape 5,000 URLs in total (each consuming 10 or 25 API credits) and using 5 concurrent connections.

Try it with Scraping Fish API

To run the scraping script, you need to get your Scraping Fish API key here.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
urls		urls
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web scraping API benchmark by Scraping Fish 🐟

Prerequisites

Input

Usage

Example

Benchmark methodology

Alexa

Amazon

Google

Instagram

Similarweb

Metrics

Results

Scraping Fish 🐟

Other web scraping APIs

ScrapingAnt 🐜

ScrapingBee 🐝

ScraperAPI

Try it with Scraping Fish API

About

Languages

mateuszbuda/webscraping-benchmark

Folders and files

Latest commit

History

Repository files navigation

Web scraping API benchmark by Scraping Fish 🐟

Prerequisites

Input

Usage

Example

Benchmark methodology

Alexa

Amazon

Google

Instagram

Similarweb

Metrics

Results

Scraping Fish 🐟

Other web scraping APIs

ScrapingAnt 🐜

ScrapingBee 🐝

ScraperAPI

Try it with Scraping Fish API

About

Topics

Resources

Stars

Watchers

Forks

Languages