In this notebook, we will be diving into the world of web scraping using the rvest library.

Web scraping is a technique for extracting information from websites. The rvest library is a powerful tool for web scraping in R, providing a wide range of functions for working with HTML and XML documents.

First, let's start by installing and loading the necessary library: rvest. 

In [1]:
library("rvest")

Now, let's use the read_html function to read the HTML of a website. We will be using the example of scraping data from "Wikipedia", you can do this by running the following code:

In [2]:
# Read the HTML of the website
url <- "https://www.wikipedia.org/"
html <- read_html(url)

The read_html function takes one argument: the URL of the website to scrape.

Now, we can use the html_nodes function to select specific elements of the HTML document. Let's select all of the anchor tags on the page, we can do this using the following code:

In [3]:
# Select all of the anchor tags
anchors <- html_nodes(html, "a")

In [4]:
anchors

{xml_nodeset (334)}
 [1] <a id="js-link-box-en" href="//en.wikipedia.org/" title="English — Wikip ...
 [2] <a id="js-link-box-ru" href="//ru.wikipedia.org/" title="Russkiy — Викип ...
 [3] <a id="js-link-box-es" href="//es.wikipedia.org/" title="Español — Wikip ...
 [4] <a id="js-link-box-ja" href="//ja.wikipedia.org/" title="Nihongo — ウィキペデ ...
 [5] <a id="js-link-box-de" href="//de.wikipedia.org/" title="Deutsch — Wikip ...
 [6] <a id="js-link-box-fr" href="//fr.wikipedia.org/" title="français — Wiki ...
 [7] <a id="js-link-box-it" href="//it.wikipedia.org/" title="Italiano — Wiki ...
 [8] <a id="js-link-box-zh" href="//zh.wikipedia.org/" title="Zhōngwén — 维基百科 ...
 [9] <a id="js-link-box-fa" href="//fa.wikipedia.org/" title="Fārsi — ویکی‌پد ...
[10] <a id="js-link-box-pt" href="//pt.wikipedia.org/" title="Português — Wik ...
[11] <a href="//pl.wikipedia.org/" lang="pl">Polski</a>
[12] <a href="//ar.wikipedia.org/" lang="ar" title="Al-ʿArabīyah"><bdi dir="r ...
[13] <a href="//de.wik

The html_nodes function takes two arguments: the HTML document and a CSS selector for the elements to select.

We can use the html_text function to extract the text from the selected elements. Let's extract the text from the anchor tags, we can do this using the following code:

In [5]:
# Extract the text from the anchor tags
anchor_text <- html_text(anchors)

In [6]:
anchor_text

The html_text function takes one argument: the selected elements.

Now, we can use the html_attrs function to extract the attributes of the selected elements. Let's extract the href attribute of the anchor tags, we can do this using the following code:

In [7]:
# Extract the href attribute from the anchor tags
anchor_hrefs <- html_attrs(anchors)$href

The html_attrs function takes one argument: the selected elements.

That's it! You've just seen some examples of how to use the rvest library for web scraping in R. The rvest library is a powerful tool for working with HTML and XML documents, providing a wide range of functions for selecting, extracting, and manipulating elements of a document. I hope you found this notebook helpful and you can start using the rvest library in your own projects.