Selenium + Headless Chrome scraper that calculates actual full web page sizes (including dynamic content).
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore
2018-09-15-alexa-topsites-50-preview.txt
README.md
from_list.py
requirements.txt

README.md

For Webpages Are Getting Larger Every Year, and Here’s Why it Matters
Author: Jorge Orpinel Perez
© 2018 Pingdom AB.

Website Page Size Scraper

Python script that uses Selenium and Headless Chrome to determine the average page size among a list of websites. This will include transferSize AND any other content loaded dynamically to display the home page of each site.

Installation

This tool was developed and ran with Python 3.6.5 on macOS 10.13

Further versions should continue to work.

External dependencies

Required Python package

See requirements.txt

  • Python language bindings for Selenium WebDriverselenium 3.14 used

To install, we will use virtualenv:

virtualenv venv
source venv/bin/activate
pip install -r requirements.txt

Virtualenv installs pip automatically.

Usage

Save a list of web page URIs (one per line) in a plain text file. Included in 2018-09-15-alexa-topsites-50-preview.txt is a sample list of 50 top sites published by Alexa (Sep 2018).
Make sure the script is executable by your user:

chmod u+x from_list.py

You may now run it:

chromedriver 2> /dev/null &  # Implies --remote-debugging-port=9515. Runs in background.
./from_list.py 2018-09-15-alexa-top-sites-50.txt

See the file docstring in from_list.py for further info.

Don't forget to stop chromedriver after running the Python script e.g.:

fg  # To bering chromedriver tot he background
^C  # Ctrl + C