In [24]:
library(tidyverse)
library(rvest)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.2     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.3     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m         masks [34mstats[39m::filter()
[31m✖[39m [34mreadr[39m::[32mguess_encoding()[39m masks [34mrvest[39m::guess_encoding()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m            masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


# Lecture 11: Web scraping

<div style="border: 1px double black; padding: 10px; margin: 10px">

**After today's lecture you will:**
* Understand how to import data from online sources by scraping web pages.
</div>

These notes correspond to Chapter 26 of your book.


## Ethics of scraping data online
You should carefully read [Section 26.2](https://r4ds.hadley.nz/webscraping.html#scraping-ethics-and-legalities) of the book concerning various ethical and legal issues surrounding scraping web sites for data. In this class we will only look at large, public web sites like Wikipedia and IMDB, where there is no risk of anything bad happening. However, there are other situations where it may be unethical, or even illegal, to harvest data from a website, even if you are technically able. **As data scientists in the real world, it will be up to you to carefully weigh these concerns before using the tools discussed in today's lecture.**

## Reading data from the Internet
These days, it's increasingly common to pull data from online sources. For example, say I wanted to know the population of European countries. This is [easily found](https://en.wikipedia.org/wiki/Demographics_of_Europe#Population_by_country) on Wikipedia. How can I get these data into R and analyze them?

## How do web pages work?

Web pages are written in a special language called HTML (**H**yper**t**ext **M**arkup **L**anguage). Here is a simple example of some HTML:

    <html>
    <head> 
      <title>Page title</title>
    </head>
    <body>
      <h1 id='first'>A heading</h1>
      <p>Some text &amp; <b>some bold text.</b></p>
      <img src='myimg.png' width='100' height='100'>
    </body>

Web scraping is possible because most web pages have a consistent, hierarchical structure. For example, if I asked you how to navigate to the title of the web page shown above, you would follow the "path"

    html > head > title
    
to arrive at "Page title".

## HTML elements

A pair of opening and closing tags make up an element. E.g.: `<p> ... </p>` is a paragraph element. `<head>...</head>` is a header element etc..

There are a lot of HTML elements that might contain interesting information. Here are a few of the most common:

- Block elements (render elements in a new line) that denote sections of text: `<h1>` (heading), `<p>` (paragraph), `<ul>`/`<ol>` (un)ordered list, etc.
- Inline elements do not render the element in a new line but are in line with the other inline elements. E.g.; `<span>...</span>`, `<a href='url'>click me</a>` etc..
- `<table>` (a table), `<tr>` (a table row), `<td>` (a table cell), etc.
- Each of these elements can contain attributes such as `id=` or `class=`. For example, `<table id="movies">` is probably a table that contains movie information.
- The `id` value uniquely identifies an element in the HTML page.
- The `class` value may be applied to more than one element on the HTML page
- `id` and `class` are used in conjunction with CSS (Cascading Style Sheets) to control the visual appearance of the page but the usage of these is not restricted to visual appearance only; for e.g., we can use them for WEB scrapping too

The `rvest` package is used to load a web page and extract elements and tables based on their HTML tags. Let's see how it works by scraping the Wikipedia page mentioned earlier:

In [95]:
library(rvest)
europop <- read_html("http://en.wikipedia.org/wiki/Demographics_of_Europe#Population_by_country")
class(europop)

In [96]:
typeof(europop)

An xml_document object is internally represented as a list-like structure, which is why typeof() returns "list".

`read_html` returns a `xml_document` object

On this page, there are many tables. Let us open the Chrome Developer tool and check the class attributes of the tables. There are some tables with class names `wikitable` and `sortable`. We will get only these tables

## 🤔 Quiz

How many tables are there with class names being `wikitable` and `sortable`?

<ol style="list-style-type: upper-alpha;">
    <li>3</li>
    <li>1</li>
    <li>4</li>
    <li>5</li>
</ol>

In [104]:
# get those tables
wiki_tables <- europop %>% html_elements("table.wikitable.sortable") %>% html_table
class(wiki_tables)

Understanding a little on list:
- a single pair of `[]` is used to subset a list
- a double pair of `[[]]` is used to get an element in the list
  

In [129]:
l1 <- list(10, 11, 12)
paste("type = ", typeof(l1[1]), "value = ", l1[1])
paste("type = ", typeof(l1[[1]]), "value = ", l1[[1]])

### Get the 3rd table
We will get the third table. How can we find the correct one? One option is to use our browser to find something that uniquely identifies the table that we want. Alternatively, if a small number, we can just use the index number to find the one we want.

Once the table is retrieved, select the first three columns, slice off the first row using 'slice()' function and then convert the datatypes of the columns using either 
* as.integer
* parse_number

In [58]:
# using gsub to replace ',' with nothing
str1 = "123,456"
gsub(",", "", str1)

In [130]:
#

Find the table that contains the population for each country

In [None]:
# 


## 🤔 Quiz

The country with the maximum population is

<ol style="list-style-type: upper-alpha;">
    <li>Russia</li>
    <li>Turkey</li>
    <li>United Kingdom</li>
    <li>Germany</li>
</ol>



In [None]:
#

## 🤔 Quiz

Use the same page Wikipedia page (Demographics of Europe) to answer the following question:

On average, how many people were born *each day* in Europe between 2010 and 2021 (inclusive)?

<ol style="list-style-type: upper-alpha;">
    <li>90210.10</li>
    <li>23043.97</li>
    <li>7710127</li>
    <li>21123.64</li>
    <li>21109.18</li>
</ol>



In [123]:
library(lubridate)
# lubridate::make_date
difftime(make_date(2022, 1, 1), make_date(2020, 1, 1), "days")

# 366 + 365

Time difference of 731 days

In [133]:
# 
# wiki_tables[[3]]

### The UofM Stats department
Let's say I wanted to extract all the [undergraduate stats courses](https://lsa.umich.edu/stats/undergraduate-students/statistics-courses.html) offered by the department. 

In [134]:
stats <- read_html('https://webapps.lsa.umich.edu/CrsMaint/Public/CB_PublicBulletin.aspx?crselevel=UG&subject=STATS')
stats

{html_document}
<html xmlns="http://www.w3.org/1999/xhtml">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\r\n    <form method="post" action="./CB_PublicBulletin.aspx?crsele ...

How should we extract the data from this web page? We notice from inspecting the page that each course title is a `<b>` (bold) element. Use `html_elements` to extract `b` elements and then use `html_text` to extract the text from the element

In [147]:
#


## IMDB top movies

Let's consider a well-known table: the [top 250 movies on IMDB](https://www.imdb.com/chart/top/).

In [80]:
imdb.250 <- read_html("https://www.imdb.com/chart/top/")
imdb.250

{html_document}
<html lang="en-US" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\n<div>    <img height="1" width="1" style="display:none;visibility ...

In [124]:
imdb.250 %>% html_elements('li.ipc-metadata-list-summary-item') %>% html_text %>% tibble 

.
<chr>
1. The Shawshank Redemption19942h 22mR9.3 (2.8M)Rate
2. The Godfather19722h 55mR9.2 (2M)Rate
3. The Dark Knight20082h 32mPG-139.0 (2.8M)Rate
4. The Godfather Part II19743h 22mR9.0 (1.3M)Rate
5. 12 Angry Men19571h 36mApproved9.0 (835K)Rate
6. Schindler's List19933h 15mR9.0 (1.4M)Rate
7. The Lord of the Rings: The Return of the King20033h 21mPG-139.0 (1.9M)Rate
8. Pulp Fiction19942h 34mR8.9 (2.2M)Rate
9. The Lord of the Rings: The Fellowship of the Ring20012h 58mPG-138.8 (1.9M)Rate
"10. The Good, the Bad and the Ugly19662h 58mApproved8.8 (791K)Rate"


## Super Bowl TV ratings
How have the TV ratings for the Super Bowl changed over the years?

In [91]:
sbtv <- read_html('https://en.wikipedia.org/wiki/Super_Bowl_television_ratings') %>% html_elements('table') %>% .[[1]] %>% html_table

In [92]:
# viewers over time
sbtv

SuperBowl,Date,Network,Avg. viewers(millions),Households,Households,18–49 demographic,18–49 demographic,Avg. cost of 30-second ad,Avg. cost of 30-second ad
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>.1,<chr>,<chr>.1,<chr>,<chr>.1
SuperBowl,Date,Network,Avg. viewers(millions),Rating,Share,Rating,Share,Original,2021 inflation-adjusted[13]
I,"January 15, 1967",CBS,26.75[14],22.6[14],43[14],Un­known,Un­known,"$42,500[14]","$345,551"
I,"January 15, 1967",NBC,24.43[14],18.5[14],36[14],Un­known,Un­known,"$37,500[14]","$304,898"
II,"January 14, 1968",CBS,39.12[14],36.8[14],68[14],Un­known,Un­known,"$54,500[14]","$425,160"
III,"January 12, 1969",NBC,41.66[14],36.0[14],70[14],Un­known,Un­known,"$55,000[14]","$406,952"
IV,"January 11, 1970",CBS,44.27[14],39.4[14],69[14],Un­known,Un­known,"$78,200[14]","$546,434"
V,"January 17, 1971",NBC,46.04[14],39.9[14],75[14],Un­known,Un­known,"$72,500[14]","$486,080"
VI,"January 16, 1972",CBS,56.64[14],44.2[14],74[14],Un­known,Un­known,"$86,100[14]","$558,898"
VII,"January 14, 1973",NBC,53.32[14],42.7[14],72[14],Un­known,Un­known,"$88,100[14]","$538,158"
VIII,"January 13, 1974",CBS,51.70[14],41.6[14],73[14],Un­known,Un­known,"$103,500[14]","$569,544"


How does this compare with other major sports?

- https://en.wikipedia.org/wiki/World_Series_television_ratings
- https://en.wikipedia.org/wiki/NBA_Finals_television_ratings

In [None]:
# super bowl vs world series