Skip to content

Webscraping

Henry Kaplan edited this page Sep 14, 2022 · 6 revisions

Web Scraping Tutorial

Breaking Down Web Scraping

Scraping data from the web requires three steps:

  1. Loading the site
  2. Locating the data
  3. Saving the data

There are lots of automation tools that will do some or all of these steps for you. Some are simple. But depending on how complex the websites are that you wish to scrape, and depending on how specific your needs for the data are, you may need the more advanced options.

For that reason, this is not a step-by-step tutorial. Rather, this is a launching pad, with links and resources to tools that you can use to automate different steps in the web-scraping process.

Things to think about

How often do you need to run your scraper? Does it need to run at regular intervals or irregularly?

How hands-off does your scraper need to be? Do you want the entire process to run unattended or do you want a user involved?

How complex are the sites you are scraping? Does your scraper need to navigate interactive websites or is it scraping simpler static html pages?

What do you want to use to run your scraper? Do you want to run it from within your browser, while viewing the pages you need to scrape? Do you want to run scripts locally on your machine, which don’t require you to actively use the browser, but do require your computer to be on and running? Or do you want to run it in a cloud computing service at regular intervals?

Conscientious Web Scraping

When deployed incautiously, web-scraping tools can cause real problems to others, whether by reducing the level of service of other site users, or by increasing data fees for site owners. To minimize the impact of your project and scrape responsibly:

  • Schedule your scraping scripts to run no more often than necessary
  • Limit the number of requests that you send simultaneously or in close succession
  • Schedule scripts for off-peak hours
  • Use caches and avoid duplicate requests

And lastly, avoid web scraping entirely if the data you need is available from official APIs.

Tools to consider

  • Simple Tools
  • Python packages
    • Selenium, a programmable browser for automated web browsing. Can be used as a python package or as a standalone application.
    • Beautiful Soup, a python package for navigating and extracting data from html files.
    • Scrapy, another package for crawling web pages and extracting data from them.
  • Javascript-based methods
    • Bookmarklets are a simple way of running javascript code from your browser, and can help you quickly pull data from large pages. However, they may run differently in different browsers, and security features of different browsers may create hurdles for saving the data to your hard drive.
    • Userscripts software for running javascript on websites
    • Google Scripts allows you to schedule and run javascript code from the cloud, and has built-in methods for saving data to Google Drive. It includes basic methods for fetching website data, though it doesn’t have specialized tools for parsing or navigating websites.
  • Services
    • Scrapestack (100 requests per month free)
    • Octoparse (Free plan allows you to run 10 projects on your local machine, paid plans allow cloud-based scraping.)
    • Webscraper.io (Free in local plugin, paid plans allow cloud-based scraping.)
Tool Platform Scheduled Scraping? Navigate Complex Websites? Output
Google Sheets Google Drive No No Google sheets only
Browser plugins Browser No Yes (with limitations) Any
Selenium Python Yes Yes Any
Beautiful Soup Python n/a n/a Any
Scrapy Python Yes Yes Any
Bookmarklets Javascript (Browser) No Yes (with limitations) Dependent on browser
Userscripts Javascript (Browser) No Yes Dependent on browser
Google Scripts Javascript (Google Drive) Yes No Google drive, other cloud services

Finding Data

Once you’ve loaded the website contents, your next step is extracting just the data you need. For the simpler projects, you might need nothing more than a straightforward search of the website text. For Most websites, though, you’ll want to use a selector to identify where in the page’s html structure your data will be found.

About Selectors

Selectors use css selector syntax to find parts of the page. As an example, to find all links (<a> html elements) inside list bullets (<li>) with the class reference, you could use the selector li.reference a.

Javascript and Python both have methods for finding specific nodes within a page’s structure.

Deciding which selectors to use

Use your browser’s developer console to explore the page you want to scrape. The console inspector will let you click on specific elements of the page and see which selectors apply to them. Try to think about how this webpage might change over time, and to pick the selectors least likely to be affected by small changes in the layout. For example, if your webpage has a single <main> html element, you may want to include that, so that your selector will never grab any html elements later added to, say, the navigational bar.

Saving your data

Many of the tools have built in methods for exporting your data to a file. But some methods, especially browser-based web-scrapinng, may make it tricky to save your data.

Saving from a browser

There are different methods of creating and downloading files via bookmarklets, but they may be blocked by the security features of some browsers. One possible solution from stackoverflow here.

Sending data to the web

Alternatively, you can use javascript’s fetch method to send the data to another server, or to a cloud computing service like Google Scripts via its web app functions, for further processing.

Some Tutorial Links


Issues used in the creation of this page

(Some of these issues may be closed or open/in progress.)

Contributors

Henry Kaplan

Clone this wiki locally