forked from libjohn/workshop_webscraping
-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.Rmd
232 lines (157 loc) · 5.42 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
---
title: "R case study: web scraping"
author: "John Little"
date: "`r Sys.Date()`"
output:
xaringan::moon_reader:
lib_dir: libs
css:
- xaringan-themer.css
- styles/my-theme.css
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
```{r setup, include=FALSE}
options(htmltools.dir.version = FALSE)
```
```{r xaringan-themer, include=FALSE, warning=FALSE}
library(xaringanthemer)
library(tidyverse)
library(gt)
library(xaringanExtra)
xaringanExtra::use_tachyons()
library(htmltools)
tagList(rmarkdown::html_dependency_font_awesome())
style_duo_accent(primary_color = "#012169", secondary_color = "#005587")
```
## Duke University: Land Acknowledgement
I would like to take a moment to honor the land in Durham, NC. Duke University sits on the ancestral lands of the Shakori, Eno and Catawba people. This institution of higher education is built on land stolen from those peoples. These tribes were here before the colonizers arrived. Additionally this land has borne witness to over 400 years of the enslavement, torture, and systematic mistreatment of African people and their descendants. Recognizing this history is an honest attempt to breakout beyond persistent patterns of colonization and to rewrite the erasure of Indigenous and Black peoples. There is value in acknowledging the history of our occupied spaces and places. I hope we can glimpse an understanding of these histories by recognizing the origins of collective journeys.
---
## Demonstration Goals
```{r child="_child-footer.Rmd", include=FALSE}
```
- Building on earlier [Rfun workshops](https://rfun.library.duke.edu/)
- Web scraping is fundamentally a deconstruction process
- Introduce just enough HTML/CSS
- Introduce the `library(rvest)` package for harvesting websites/HTML
- Tidyverse iteration with `purrr::map`
- Point out useful documentation & resources
.f7.i.moon-gray[This is a demonstration of leveraging the Tidyverse. This is not an research design or HTML design class. YMMV: data gathering and cleaning are vital and can be complex. ]
--
### Caveats
- You will be as successful as the web author(s) were consistent
- Read and follow the _Terms of Use_ for any target web host
- Read and honor the host's robots.txt | https://www.robotstxt.org
- Always **pause** to avoid the perception of a _Denial of Service_ (DOS) attack
---
```{r child="_child-footer.Rmd", include=FALSE}
```
.left-column[
### Scraping =
.f6[Gather or ingest web page data for analysis]
![scraping bee propolis](images/Scraping_propolis.jpg "scraping propolis")
`rvest::`
`read_html()`
]
--
.right-column[
** Crawling + Parsing**
<div class = "container" id = "imgspcl" style="width: 100%; max-width: 100%;">
<img src = "images/crawling_med.jpg" width = "50%"> + <img src = "images/strain_comb.jpg" width="50%">
</div>
.pull-left[
.f7[Systematically iterating through a website, gathering data from more than one page (URL)]
`purrr::map()`
.f7[
https://purrr.tidyverse.org
]
]
.pull-right[
.f7[Separating the syntactic elements of the HTML. Keeping only the data you need]
`rvest::html_nodes()`
`rvest::html_text()`
`rvest::html_attr()`
.f7[
https://rvest.tidyverse.org
]
]
]
---
## HTML
Hypter Text Markup Language
```html
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph contains a
<a href="https://www.w3schools.com">link</a> to
W3schools.com
</p>
</body>
</html>
```
```{r child="_child-footer.Rmd", include=FALSE}
```
---
## HTML + CSS
Cascading Style Sheets
```css
<html>
<body>
<div class="abc"> ... </div>
<div id="xyz">
<span class="foo"> ... </span>
</div>
<span id="bar"> ... </span>
</body>
</html>
```
for example: http://www.vondel.humanities.uva.nl/style.css
```{r child="_child-footer.Rmd", include=FALSE}
```
---
## Procedure
The basic workflow of web scraping is
1. Development
- Import raw HTML of a single target page (page detail: a leaf or node)
- Parse the HTML of the test page and gather specific data
- Check _robots.txt_ and _Terms Of Use_ (TOU)
- In a web browser, manually browse and understand the target site's navigation (site navigation: branches)
- _Parse_ the site navigation and develop an _iteration_ plan
- _Iterate_: orchestrate/automate page crawling
- Perform a dry run with a limited subset of the target web site
- Construct pauses: avoid the posture of a DNS attack
1. Production
- Iterate/Crawl the site (navigation: branches)
- Parse HTML for each target page (pages: leaves or nodes)
```{r child="_child-footer.Rmd", include=FALSE}
```
---
background-image: url(images/selector_graph.png)
<!-- an image of branches and nodes -->
```{r child="_child-footer.Rmd", include=FALSE}
```
---
class: middle, center
.bg-washed-blue.b--navy.ba.bw2.br3.shadow-5.ph4.mt5[
## John R Little
.f5.blue[Data Science Librarian
Center for Data & Visualization Sciences
Duke University Libraries
]
.f7[https://johnlittle.info]
.f7[https://Rfun.library.duke.edu]
.f7[https://library.duke.edu/data]
]
<i class="fab fa-creative-commons fa-2x"></i> <i class="fab fa-creative-commons-by fa-2x"></i><i class="fab fa-creative-commons-nc fa-2x"></i>
.f6.moon-gray[Creative Commons: Attribution-NonCommercial 4.0]
.f7.moon-gray[https://creativecommons.org/licenses/by-nc/4.0]