ISSR Practical Web and Text Scraping 2016
This Git repo contains all materials necessary for this ISSR Practical Web and Text Scraping workshop. You can email me at firstname.lastname@example.org or email@example.com with any questions. There are lots of materials available on my website at: http://www.mjdenny.com/teaching.html. To download the materials you see here, you will want to start by downloading a GUI client for Git.
- For Windows: https://windows.github.com/
- For Mac: https://mac.github.com/
- For Linux, you may have to rely on the command line, although https://git-scm.com/downloads/guis has some options (depending on your distro).
You will then want to
clone this repo onto your computer using either the
link and your client or by clicking the "Clone in Desktop" button on the right hand side of the page. If you want to directly edit the files posted here and track your changes, you can copy individual files into another directory and create your own repo with the files in it. We will go over using Github in detail on the first morning of the workshop, so there is no need to spend too much time trying to figure Github out.
This is a draft outline of the workshop schedule, it will likely change over the course of the workshop depending on how fast we end up going.
Before the workshop
You should be comfortable with R up to the level presented in the [ISSR Data Management in R Workshop]. Also, if you have not already done so, please download R and RStudio (or update your installation to the newest version):
- Download R: https://cran.r-project.org/
- Download RStudio: https://www.rstudio.com/products/rstudio/download/
- 9:00-10:00 Basic web scraping in R: [Script].
- 10:00-10:10 Break
- 10:10-12:00 Scraping Twitter: [Script], [Helpful Tutorial], [StreamR Github].
- 12:00-1:00 Lunch
- 1:00-3:00 Advanced web scraping [Script]
- 9:00-11:00 Introduction to text processing in R [Script], [Tutorial].
- 11:00-11:10 Break.
- 11:10-12:00 Text processing and analysis using Quanteda and SpeedReader.
- 12:00-1:00 Lunch
- 1:00-2:00 Text processing and analysis using Quanteda and SpeedReader (continued).
- 2:00-3:00 Synthesis exercise: [Script].
- Hadley Wickham has an R package
rvestfor web scraping that is detailed in this blog post.
- A blog post by Charles Dimaggio that I have referred to in the past: blog post.
- Another blog post by Zev Ross that I have referred to in the past: blog post.
- Hadley Wickham wrote a book that covers a bunch of advanced functionality in R, titled Advanced R -- which is available online for free here -- http://adv-r.had.co.nz/.