📜 Day 6: Scraping the Web

⏱ Agenda

[05m] 🏆 Objectives
[05m] 🤷‍♀️ Why You Should Know This
[15m] 📖 Overview: Web Scraping
1. Web Crawling vs. Web Scraping
2. Parsing & Extracting Data Using Selectors
[20m] 💻 Game: Selector Diner
[10m] 💻 Demo: Selecting Selectors
1. Techniques Demonstrated
[10m] BREAK
[15m] 📖 Overview: Colly
[20m] Activity: Colly Calls Back
[10m] TT: Advantages and Disadvantages to Using Colly
1. Good
2. Not So Good
[30m] Video: Headless Web Scraping
[20m] Example Code / Demo
📚 Resources & Credits

[05m] 🏆 Objectives

Identify the critical steps to collecting data using web scraping techniques.
Apply selectors to an HTML document to retreive data.
Design and create a web scraper that retrieves data from your favorite website!

[05m] 🤷‍♀️ Why You Should Know This

All projects need data before launching!
Available datasets may not meet your needs or require additional supporting data from a source on the web.
Save important data before a website goes offline for archival purposes.

[15m] 📖 Overview: Web Scraping

Web Scrapers crawl a website, extract it's data, transform that data to a usable structured format, finally writing it to a file or database for subsequent use.

Programs that use this design pattern follow the Extract-Transform-Load (ETL) Process.

Web Crawling vs. Web Scraping

Not interchangeable terms!
Crawlers download and store the contents of large numbers of sites by following the links in pages.
- How Google got famous
Scrapers are built for the structure of a specific website.
- Use site's own structure to extract individual specific data elements.
- Crawling is the first step to web scraping.

Parsing & Extracting Data Using Selectors

Below are the most common selectors used when scraping the web for the purposes of data collection.

Name	Syntax	Description
Element	`a`	Any element `section`, `a`, `table`, etc.
ID	`#home-link`	First element with `id="video-player"`
Class	`.blog-post`	Any element with `class="blog-post"`
Attribute	`a[href]`	All values of the `href` attribute assigned to any `a` element
Pseudo-Attribute	`a:first-child`	The first `a` element

Let's practice selectors now --- they're the most important part of writing an awesome web scraper! If the selector isn't correct, nothing will return, and no data will have been collected as a result of running your scraper.

[20m] 💻 Game: Selector Diner

Choose the right plates while working the window at the CSS Diner. This fun game will level up your selector skills in preparation for your Web Scraper project.

[10m] 💻 Demo: Selecting Selectors

Instructor will demonstrate how to find and test selectors in Chrome before integrating them in your web scraper.

Techniques Demonstrated

Inspect an element > right click it's node in the DOM tree > choosing Copy > Copy Selector.
Testing using Ctrl + F in the inspector.

If your selector does not work using these methods, it WILL NOT WORK IN YOUR SCRAPER!

[10m] BREAK

[15m] 📖 Overview: Colly

A popular open source package, Colly, provides a clean foundation to write any kind of crawler/scraper/spider. Features include:

Lots of cool Go language concepts!
Fast (>1k request/sec on a single core)
Manages request delays and maximum concurrency per domain
Automatic cookie and session handling
Sync/async/parallel scraping
Distributed scraping
Caching

[20m] Activity: Colly Calls Back

Colly works via a series of callbacks that are executed anytime Visit() is called on a collector.

Callbacks are functions that execute after another function completes.

Colly supports the following callbacks:

package main

import (
	"fmt"

	"github.com/gocolly/colly"
)

// main() contains code adapted from example found in Colly's docs:
// http://go-colly.org/docs/examples/basic/
func main() {
	// Instantiate default collector
	c := colly.NewCollector()

	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		// Find link using an attribute selector
		// Matches any element that includes href=""
		link := e.Attr("href")

		// Print link
		fmt.Printf("Link found: %q -> %s\n", e.Text, link)

		// Visit link
		e.Request.Visit(link)
	})

	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL)
	})

	c.OnError(func(_ *colly.Response, err error) {
		fmt.Println("Something went wrong:", err)
	})

	c.OnResponse(func(r *colly.Response) {
		fmt.Println("Visited", r.Request.URL)
	})

	c.OnScraped(func(r *colly.Response) {
		fmt.Println("Finished", r.Request.URL)
	})

	// Start scraping on https://hackerspaces.org
	c.Visit("https://hackerspaces.org/")
}

With a partner, use the sample code to determine which order these callbacks fire To examine the output, paste the above snippet, build, and run your executable.

[10m] TT: Advantages and Disadvantages to Using Colly

Good

Quick to copy and paste an example from the docs and modify it to create your own web scraper.
Lots of plugins and libraries with good documentation
Security features allow you cloak your scraper so it isn't detected

Not So Good

Can't scrape websites that take advantage of a shadow DOM to render components
This means you can't use Colly to scrape websites written in Angular, Vue, and React

[30m] Video: Headless Web Scraping

[20m] Example Code / Demo

chromedp/examples: awesome chromedp examples, use these to get started
droxey/makeshort: example app for creating make.sc shortlinks from the command line

📚 Resources & Credits

ScrapeHero: What is Web Scraping – Part 1 – Beginner’s Guide
W3C: Selectors
Colly: Starter code derived from basic example.
chromedp/examples: various chromedp examples
Chrome DevTools Protocol: Chrome DevTools Protocol Domain documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebScraping.md

WebScraping.md

📜 Day 6: Scraping the Web

⏱ Agenda

[05m] 🏆 Objectives

[05m] 🤷‍♀️ Why You Should Know This

[15m] 📖 Overview: Web Scraping

Web Crawling vs. Web Scraping

Parsing & Extracting Data Using Selectors

[20m] 💻 Game: Selector Diner

[10m] 💻 Demo: Selecting Selectors

Techniques Demonstrated

[10m] BREAK

[15m] 📖 Overview: Colly

[20m] Activity: Colly Calls Back

[10m] TT: Advantages and Disadvantages to Using Colly

Good

Not So Good

[30m] Video: Headless Web Scraping

[20m] Example Code / Demo

📚 Resources & Credits

Files

WebScraping.md

Latest commit

History

WebScraping.md

File metadata and controls

📜 Day 6: Scraping the Web

⏱ Agenda

[05m] 🏆 Objectives

[05m] 🤷‍♀️ Why You Should Know This

[15m] 📖 Overview: Web Scraping

Web Crawling vs. Web Scraping

Parsing & Extracting Data Using Selectors

[20m] 💻 Game: Selector Diner

[10m] 💻 Demo: Selecting Selectors

Techniques Demonstrated

[10m] BREAK

[15m] 📖 Overview: Colly

[20m] Activity: Colly Calls Back

[10m] TT: Advantages and Disadvantages to Using Colly

Good

Not So Good

[30m] Video: Headless Web Scraping

[20m] Example Code / Demo

📚 Resources & Credits