Skip to content

Latest commit

 

History

History
201 lines (139 loc) · 8.16 KB

WebScraping.md

File metadata and controls

201 lines (139 loc) · 8.16 KB

📜 Day 6: Scraping the Web

⏱ Agenda

  1. [05m] 🏆 Objectives
  2. [05m] 🤷‍♀️ Why You Should Know This
  3. [15m] 📖 Overview: Web Scraping
    1. Web Crawling vs. Web Scraping
    2. Parsing & Extracting Data Using Selectors
  4. [20m] 💻 Game: Selector Diner
  5. [10m] 💻 Demo: Selecting Selectors
    1. Techniques Demonstrated
  6. [10m] BREAK
  7. [15m] 📖 Overview: Colly
  8. [20m] Activity: Colly Calls Back
  9. [10m] TT: Advantages and Disadvantages to Using Colly
    1. Good
    2. Not So Good
  10. [30m] Video: Headless Web Scraping
  11. [20m] Example Code / Demo
  12. 📚 Resources & Credits

[05m] 🏆 Objectives

  1. Identify the critical steps to collecting data using web scraping techniques.
  2. Apply selectors to an HTML document to retreive data.
  3. Design and create a web scraper that retrieves data from your favorite website!

[05m] 🤷‍♀️ Why You Should Know This

  • All projects need data before launching!
  • Available datasets may not meet your needs or require additional supporting data from a source on the web.
  • Save important data before a website goes offline for archival purposes.

[15m] 📖 Overview: Web Scraping

Web Scrapers crawl a website, extract it's data, transform that data to a usable structured format, finally writing it to a file or database for subsequent use.

Programs that use this design pattern follow the Extract-Transform-Load (ETL) Process.

Web Crawling vs. Web Scraping

  • Not interchangeable terms!
  • Crawlers download and store the contents of large numbers of sites by following the links in pages.
    • How Google got famous
  • Scrapers are built for the structure of a specific website.
    • Use site's own structure to extract individual specific data elements.
    • Crawling is the first step to web scraping.

Parsing & Extracting Data Using Selectors

Below are the most common selectors used when scraping the web for the purposes of data collection.

Name Syntax Description
Element a Any element section, a, table, etc.
ID #home-link First element with id="video-player"
Class .blog-post Any element with class="blog-post"
Attribute a[href] All values of the href attribute assigned to any a element
Pseudo-Attribute a:first-child The first a element

Let's practice selectors now --- they're the most important part of writing an awesome web scraper! If the selector isn't correct, nothing will return, and no data will have been collected as a result of running your scraper.

[20m] 💻 Game: Selector Diner

Choose the right plates while working the window at the CSS Diner. This fun game will level up your selector skills in preparation for your Web Scraper project.

[10m] 💻 Demo: Selecting Selectors

Instructor will demonstrate how to find and test selectors in Chrome before integrating them in your web scraper.

Techniques Demonstrated

  • Inspect an element > right click it's node in the DOM tree > choosing Copy > Copy Selector.
  • Testing using Ctrl + F in the inspector.

If your selector does not work using these methods, it WILL NOT WORK IN YOUR SCRAPER!

[10m] BREAK

[15m] 📖 Overview: Colly

A popular open source package, Colly, provides a clean foundation to write any kind of crawler/scraper/spider. Features include:

  • Lots of cool Go language concepts!
  • Fast (>1k request/sec on a single core)
  • Manages request delays and maximum concurrency per domain
  • Automatic cookie and session handling
  • Sync/async/parallel scraping
  • Distributed scraping
  • Caching

[20m] Activity: Colly Calls Back

Colly works via a series of callbacks that are executed anytime Visit() is called on a collector.

Callbacks are functions that execute after another function completes.

Colly supports the following callbacks:

package main

import (
	"fmt"

	"github.com/gocolly/colly"
)

// main() contains code adapted from example found in Colly's docs:
// http://go-colly.org/docs/examples/basic/
func main() {
	// Instantiate default collector
	c := colly.NewCollector()

	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		// Find link using an attribute selector
		// Matches any element that includes href=""
		link := e.Attr("href")

		// Print link
		fmt.Printf("Link found: %q -> %s\n", e.Text, link)

		// Visit link
		e.Request.Visit(link)
	})

	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL)
	})

	c.OnError(func(_ *colly.Response, err error) {
		fmt.Println("Something went wrong:", err)
	})

	c.OnResponse(func(r *colly.Response) {
		fmt.Println("Visited", r.Request.URL)
	})

	c.OnScraped(func(r *colly.Response) {
		fmt.Println("Finished", r.Request.URL)
	})

	// Start scraping on https://hackerspaces.org
	c.Visit("https://hackerspaces.org/")
}

With a partner, use the sample code to determine which order these callbacks fire To examine the output, paste the above snippet, build, and run your executable.

[10m] TT: Advantages and Disadvantages to Using Colly

Good

  • Quick to copy and paste an example from the docs and modify it to create your own web scraper.
  • Lots of plugins and libraries with good documentation
  • Security features allow you cloak your scraper so it isn't detected

Not So Good

  • Can't scrape websites that take advantage of a shadow DOM to render components
  • This means you can't use Colly to scrape websites written in Angular, Vue, and React

[30m] Video: Headless Web Scraping

<iframe id="ytplayer" type="text/html" width="720" height="405" src="https://www.youtube.com/embed/_7pWCg94sKw?modestbranding=1&playsinline=1" frameborder="0" allow="picture-in-picture" allowfullscreen>

[20m] Example Code / Demo

📚 Resources & Credits