---
author: "Юрій Клебан"
---

# Web-pages (HTML)

Sometimes decision making needs scrap data from web sources and pages.

Let's try to parse data from `Wikipedia` as table. 



In [170]:
#install.packages("rvest")
library(rvest) # Parsing of HTML/XML files

Go to web page https://en.wikipedia.org/wiki/List_of_largest_banks and check it.

In [171]:
# fix URL
url <- "https://en.wikipedia.org/wiki/List_of_largest_banks"
#url <- "data/List of largest banks - Wikipedia_.html"

In [172]:
# read html content of the page
page <- read_html(url)
page

{html_document}
<html class="client-nojs" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject  ...

In [173]:
# read all yables on page
tables <- html_nodes(page, "table")
tables

{xml_nodeset (4)}
[1] <table class="box-Missing_information plainlinks metadata ambox ambox-con ...
[2] <table class="wikitable sortable mw-collapsible"><tbody>\n<tr>\n<th data- ...
[3] <table class="wikitable sortable mw-collapsible">\n<caption>Number of ban ...
[4] <table class="wikitable sortable mw-collapsible"><tbody>\n<tr>\n<th data- ...

For now, let's read a table of Total Assets in US Billion

In [174]:
# with pipe operator
#tables[2] %>% 
 #   html_table(fill = TRUE) %>% 
 #   as.data.frame()
#without pipe operator
assets_table <- as.data.frame(html_table(tables[2], fill = TRUE))   
head(assets_table)

Unnamed: 0_level_0,Rank,Bank.name,Total.assets.2020..US..billion.
Unnamed: 0_level_1,<int>,<chr>,<chr>
1,1,Industrial and Commercial Bank of China,5518.0
2,2,China Construction Bank,4400.0
3,3,Agricultural Bank of China,4300.0
4,4,Bank of China,4200.0
5,5,JPMorgan Chase,3831.65
6,6,Mitsubishi UFJ Financial Group,3175.21


Next is reading data of market capitalization table (4th):

In [175]:
capital_table <- as.data.frame(html_table(tables[4], fill = TRUE))   
head(capital_table)

Unnamed: 0_level_0,Rank,Bank.name,Market.cap.US..billion.
Unnamed: 0_level_1,<int>,<chr>,<dbl>
1,1,JPMorgan Chase,368.78
2,2,Industrial and Commercial Bank of China,295.65
3,3,Bank of America,279.73
4,4,Wells Fargo,214.34
5,5,China Construction Bank,207.98
6,6,Agricultural Bank of China,181.49


And now let's `merge()` this two datasets:

In [176]:
merged_data <- merge(assets_table, capital_table, by = "Bank.name")
head(merged_data)

Unnamed: 0_level_0,Bank.name,Rank.x,Total.assets.2020..US..billion.,Rank.y,Market.cap.US..billion.
Unnamed: 0_level_1,<chr>,<int>,<chr>,<int>,<dbl>
1,Agricultural Bank of China,3,4300.0,6,181.49
2,Australia and New Zealand Banking Group,48,661.72,26,54.88
3,Banco Bilbao Vizcaya Argentaria,42,782.16,37,37.42
4,Banco Bradesco,79,345.21,18,74.67
5,Banco Santander,16,1702.61,17,75.47
6,Bank of America,8,2434.08,3,279.73


## Task 3

From a page https://en.wikipedia.org/wiki/List_of_largest_banks read and `merge by country` named tables:

- [x] Number of banks in the top 100 by total assets
- [x] Total market capital (US$ billion) across the top 70 banks by country

**Solution**

In [178]:
library(rvest)
url <- "https://en.wikipedia.org/wiki/List_of_largest_banks" # got to url in other tab
#url <- "data/List of largest banks - Wikipedia_.html"
page_data <- read_html(url) # read html content

tables <- html_nodes(page_data, "table")
html_table(tables[1]) #its not needed table

X1,X2
<lgl>,<chr>
,This article is missing information about Revenue and Employment. Please expand the article to include this information. Further details may exist on the talk page. (September 2020)


In [179]:
html_table(tables[3]) # thats solution for "Number of banks in the top 100 by total assets"
#check the end of table. There are NA record
# lets remove it

Rank,Country,Number
<int>,<chr>,<int>
1,China,19.0
2,United States,11.0
3,Japan,8.0
4,United Kingdom,6.0
4,France,6.0
4,South Korea,6.0
5,Canada,5.0
5,Germany,5.0
6,Australia,4.0
6,Brazil,4.0


In [180]:
table1 <- as.data.frame(html_table(tables[3]))
table1 <- table1[!is.na(table1$Country), ]
table1 # now it OK!

Unnamed: 0_level_0,Rank,Country,Number
Unnamed: 0_level_1,<int>,<chr>,<int>
1,1,China,19
2,2,United States,11
3,3,Japan,8
4,4,United Kingdom,6
5,4,France,6
6,4,South Korea,6
7,5,Canada,5
8,5,Germany,5
9,6,Australia,4
10,6,Brazil,4


In [183]:
# SOlution for "Total market capital (US$ billion) across the top 70 banks by country"
# compare this with table on a given page
table2 <- as.data.frame(html_table(tables[4]))
table2 # now it OK!

Rank,Bank.name,Market.cap.US..billion.
<int>,<chr>,<dbl>
1,JPMorgan Chase,368.78
2,Industrial and Commercial Bank of China,295.65
3,Bank of America,279.73
4,Wells Fargo,214.34
5,China Construction Bank,207.98
6,Agricultural Bank of China,181.49
7,HSBC Holdings PLC,169.47
8,Citigroup Inc.,163.58
9,Bank of China,151.15
10,China Merchants Bank,133.37


---

## Набори даних

1. https://github.com/kleban/r-book-published/tree/main/datasets/telecom_users.csv
2. https://github.com/kleban/r-book-published/tree/main/datasets/telecom_sers.xlsx
3. https://github.com/kleban/r-book-published/tree/main/datasets/Default_Fin.csv
4. https://github.com/kleban/r-book-published/tree/main/datasets/employes.xml

---

## References

1. [SQLite in R. Datacamp](https://www.datacamp.com/community/tutorials/sqlite-in-r)
2. [Tidyverse googlesheets4 0.2.0](https://www.tidyverse.org/blog/2020/05/googlesheets4-0-2-0/)
<!-- 3. [Telecom users dataset. Practice classification with a telco dataset.Kaggle](https://www.kaggle.com/radmirzosimov/telecom-users-dataset) -->
4. [Binanace spot Api Docs](https://github.com/binance/binance-spot-api-docs/blob/master/rest-api.md#klinecandlestick-data)
5. [Web Scraping in R: rvest Tutorial](https://www.datacamp.com/community/tutorials/r-web-scraping-rvest) by Arvid Kingl