Skip to content

harveydevereux/TimesRichList2020

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 

Times Rich List 2020 Python Scraper

This repo provides Python-Selenium code to scrape the data from the Times website.

Pretty hard coded so the data is included

Setup

The code requires a working python installation with Selenium installed (+ the driver for you browser)

Code works with the webpage on 30 May 2020, with Python 3.6.9, Selenium (Python) 3.141.0, and numpy 1.18.1 for the ceil function only

Running/Options

run with

python ScrapeData.py

or

./run.sh

Both support the options --csv [string] and --headless, the first takes the name of the csv you want to save the data as, and the sceond will launch Selenium without openning a browser (otherwise you'll watch the scraper in action)

Caveats

  • Sometimes the webpage bugs or takes to long to load and so Selenium does not find the "I Agree [to cookies]" button. This will show as selenium.common.exceptions.ElementNotInteractableException: Message: Element <button class="message-component message-button no-children"> could not be scrolled into view re-running usually works
  • If the webtext changes it will likely break

Data Analysis

Example Notebook to get started

The Wealth Distribution

alt text

Top 10 Sectors by Median Wealth

alt text

Top 10 Sectors By Total Wealth

alt text

Releases

No releases published

Packages

No packages published