Skip to content
Itinerary for "Web-scraping and Web-crawling with Python" Workshop, University of Pittsburgh, April 2019
HTML Jupyter Notebook
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
scraping_examples
.gitignore
README.md
requirements.txt

README.md

Web-scraping and Web-crawling with Python

Binder

Schedule and various materials for "Web-scraping and Web-crawling with Python" Workshop, University of Pittsburgh, April 2019

Schedule

Workshop Begins at 1 p.m.

Install Anaconda, connect to wifi

(approximately 15 minutes)

Note: Anaconda takes up 2.17GB of space! If you are short on disk space, you might want to install miniconda, which is a lightweight version.

  1. Go to https://www.anaconda.com/download/ and select Python 3.6 download (make sure the installer version matches your operating system)
  2. Click and the downloaded installer and follow the on-screen instructions (on macs, this is a .pkg file After it's done installing (can take 5-10 minutes), double-click the application "Anaconda-Navigator" and make sure it loads properly
  3. If you would like to work in a Jupyter Notebook (which I recommend), you can open "Anaconda Navigator," click the Launch button on the Jupyter Notebook card (not the JupyterLab card), and then click "New > Python 3" to load a new notebook.

For Windows-specific instructions, visit https://docs.anaconda.com/anaconda/install/windows/

Participant Intros

(approximately 20 minutes)

As we go around the room, tell us your name and what you are studying or working on. If you have a specific example, name something you are interested in scraping and why it's potentially a challenge for you. If your interest is more general, say a little about the kinds of sites you would want to scrape and why.

Fundamentals Pecha Kucha

(20 seconds x 20 slides = 7 minutes, plus transition time)

Below is a copy of my slide deck for later reference:

https://docs.google.com/presentation/d/e/2PACX-1vTyU_MB3a6KeBckheSvJ-WmkGUC1COVa5zlM6B8FA-rju2XL4qkf7aKhBt5Zynjn_SwEypxhyP3Pi8_/pub?start=false&loop=false&delayms=20000

Activity 1

(approximately 15 minutes)

In this part of the workshop, I'll take us through some of the examples I used in my slides, as well as a couple wildcards. We will look at the source code of several websites and try to think about how to break down the problem of web-scraping. The example sites are:

Break around 2:10 p.m.

(approximately 5 minutes)

Best Practices Pecha Kucha

(20 seconds x 20 slides = 7 minutes, plus transition time)

https://docs.google.com/presentation/d/e/2PACX-1vTXMKkAhOD7yPHUTp51_TuKKYCDTA7qfgxhSSkddb_U__NB1pwgSrEAGVrX4QnqNniftBkhxrVXaLla/pub?start=false&loop=false&delayms=20000

Activity 2

(approximately 20 minutes)

In this section of the workshop, we will break into groups and focus on some of the webscraping questions or use cases you have in mind. If participants don't have examples sites of their own, we will return to my examples and discuss end-to-end data collection, including modeling, scraping, crawling, and setting up datastores.

Share Results

(approximately 20 minutes)

Farewells

(approximately 10 minutes)

You can’t perform that action at this time.