ISSR Summer Methodology Workshops: Practical Web and Text Scraping (2016)
R C++
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
Data
Scripts
README.md

README.md

ISSR Practical Web and Text Scraping 2016

Overview

This Git repo contains all materials necessary for this ISSR Practical Web and Text Scraping workshop. You can email me at mdenny@psu.edu or matthewjdenny@gmail.com with any questions. There are lots of materials available on my website at: http://www.mjdenny.com/teaching.html. To download the materials you see here, you will want to start by downloading a GUI client for Git.

You will then want to clone this repo onto your computer using either the

https://github.com/matthewjdenny/ISSR_Practical_Scraping_2016.git

link and your client or by clicking the "Clone in Desktop" button on the right hand side of the page. If you want to directly edit the files posted here and track your changes, you can copy individual files into another directory and create your own repo with the files in it. We will go over using Github in detail on the first morning of the workshop, so there is no need to spend too much time trying to figure Github out.

Schedule

This is a draft outline of the workshop schedule, it will likely change over the course of the workshop depending on how fast we end up going.

Before the workshop

You should be comfortable with R up to the level presented in the [ISSR Data Management in R Workshop]. Also, if you have not already done so, please download R and RStudio (or update your installation to the newest version):

Wednesday

  1. 9:00-10:00 Basic web scraping in R: [Script].
  2. 10:00-10:10 Break
  3. 10:10-12:00 Scraping Twitter: [Script], [Helpful Tutorial], [StreamR Github].
  4. 12:00-1:00 Lunch
  5. 1:00-3:00 Advanced web scraping [Script]

Thursday

  1. 9:00-11:00 Introduction to text processing in R [Script], [Tutorial].
  2. 11:00-11:10 Break.
  3. 11:10-12:00 Text processing and analysis using Quanteda and SpeedReader.
  4. 12:00-1:00 Lunch
  5. 1:00-2:00 Text processing and analysis using Quanteda and SpeedReader (continued).
  6. 2:00-3:00 Synthesis exercise: [Script].

Resources

  • Hadley Wickham has an R package rvest for web scraping that is detailed in this blog post.
  • A blog post by Charles Dimaggio that I have referred to in the past: blog post.
  • Another blog post by Zev Ross that I have referred to in the past: blog post.
  • Hadley Wickham wrote a book that covers a bunch of advanced functionality in R, titled Advanced R -- which is available online for free here -- http://adv-r.had.co.nz/.