slides/index.Rmd

---
title: "R case study: web scraping"
author: "John Little"
date: "`r Sys.Date()`"
output:
  xaringan::moon_reader:
    lib_dir: libs
    css: 
      - xaringan-themer.css
      - styles/my-theme.css
    nature:
      highlightStyle: github
      highlightLines: true
      countIncrementalSlides: false
---

```{r setup, include=FALSE}
options(htmltools.dir.version = FALSE)
```

```{r xaringan-themer, include=FALSE, warning=FALSE}
library(xaringanthemer)
library(tidyverse)
library(gt)
library(xaringanExtra)
xaringanExtra::use_tachyons()
library(htmltools)
tagList(rmarkdown::html_dependency_font_awesome())

style_duo_accent(primary_color = "#012169", secondary_color = "#005587")
```

## Duke University: Land Acknowledgement

I would like to take a moment to honor the land in Durham, NC.  Duke University sits on the ancestral lands of the Shakori, Eno and Catawba people. This institution of higher education is built on land stolen from those peoples.  These tribes were here before the colonizers arrived.  Additionally this land has borne witness to over 400 years of the enslavement, torture, and systematic mistreatment of African people and their descendants.  Recognizing this history is an honest attempt to breakout beyond persistent patterns of colonization and to rewrite the erasure of Indigenous and Black peoples.  There is value in acknowledging the history of our occupied spaces and places.  I hope we can glimpse an understanding of these histories by recognizing the origins of collective journeys.

---
## Demonstration Goals

```{r child="_child-footer.Rmd", include=FALSE}
```

- Building on earlier [Rfun workshops](https://rfun.library.duke.edu/)
- Web scraping is fundamentally a deconstruction process
- Introduce just enough HTML/CSS
- Introduce the `library(rvest)` package for harvesting websites/HTML
- Tidyverse iteration with `purrr::map`
- Point out useful documentation & resources 

.f7.i.moon-gray[This is a demonstration of leveraging the Tidyverse.  This is not an research design or HTML design class.  YMMV: data gathering and cleaning are vital and can be complex. ]

--

### Caveats
- You will be as successful as the web author(s) were consistent 
- Read and follow the _Terms of Use_ for any target web host
- Read and honor the host's robots.txt | https://www.robotstxt.org
    - Always **pause** to avoid the perception of a _Denial of Service_ (DOS) attack
    
---
```{r child="_child-footer.Rmd", include=FALSE}
```

.left-column[
### Scraping =

.f6[Gather or ingest web page data for analysis]

![scraping bee propolis](images/Scraping_propolis.jpg "scraping propolis")
&nbsp;  
`rvest::`  
`read_html()`  
]

--

.right-column[

** Crawling + Parsing**

<div class = "container" id = "imgspcl" style="width: 100%; max-width: 100%;">
<img src = "images/crawling_med.jpg" width = "50%"> &nbsp; + &nbsp; <img src = "images/strain_comb.jpg" width="50%">
</div>

.pull-left[
.f7[Systematically iterating through a website, gathering data from more than one page (URL)]

`purrr::map()`  

&nbsp;  

&nbsp;  

&nbsp;  


.f7[
https://purrr.tidyverse.org  
]
]


.pull-right[
.f7[Separating the syntactic elements of the HTML.  Keeping only the data you need]

`rvest::html_nodes()`  
`rvest::html_text()`  
`rvest::html_attr()`  

&nbsp;  

&nbsp;   


.f7[
https://rvest.tidyverse.org    

]
]
]


---
## HTML

Hypter Text Markup Language

```html
<html>
  <body>
  
    <h1>My First Heading</h1>
    <p>My first paragraph contains a 
    <a href="https://www.w3schools.com">link</a> to
    W3schools.com
    </p>
  
  </body>
</html>

```

```{r child="_child-footer.Rmd", include=FALSE}
```
---
## HTML + CSS

Cascading Style Sheets

```css

<html>
<body>

  <div class="abc"> ... </div>
  
  <div id="xyz"> 
    <span class="foo"> ... </span>
  </div>
  
  <span id="bar"> ... </span>

</body>
</html>

```

&nbsp;  

for example:  http://www.vondel.humanities.uva.nl/style.css


```{r child="_child-footer.Rmd", include=FALSE}
```

---
## Procedure

The basic workflow of web scraping is

1. Development

    - Import raw HTML of a single target page (page detail:  a leaf or node)
    - Parse the HTML of the test page and gather specific data 
    - Check _robots.txt_ and _Terms Of Use_ (TOU)
    - In a web browser, manually browse and understand the target site's navigation (site navigation: branches)
    - _Parse_ the site navigation and develop an _iteration_ plan
    - _Iterate_: orchestrate/automate page crawling
    - Perform a dry run with a limited subset of the target web site
    - Construct pauses:  avoid the posture of a DNS attack

1. Production

    - Iterate/Crawl the site (navigation: branches)
    - Parse HTML for each target page (pages: leaves or nodes)

```{r child="_child-footer.Rmd", include=FALSE}
```
---
background-image: url(images/selector_graph.png)

<!-- an image of branches and nodes -->


```{r child="_child-footer.Rmd", include=FALSE}
```

---
class:  middle, center

.bg-washed-blue.b--navy.ba.bw2.br3.shadow-5.ph4.mt5[

## John R Little

.f5.blue[Data Science Librarian  
Center for Data & Visualization Sciences  
Duke University Libraries  
]

.f7[https://johnlittle.info]  
.f7[https://Rfun.library.duke.edu]  
.f7[https://library.duke.edu/data]
]

<i class="fab fa-creative-commons fa-2x"></i> &nbsp; <i class="fab fa-creative-commons-by fa-2x"></i><i class="fab fa-creative-commons-nc fa-2x"></i>  
.f6.moon-gray[Creative Commons:  Attribution-NonCommercial 4.0]  
.f7.moon-gray[https://creativecommons.org/licenses/by-nc/4.0]