<center>
    <img src="https://fpt.edu.vn/Resources/brand/uploads/749540_132829686029858301_o.jpg" width="500" alt="cognitiveclass.ai logo"  />
</center>

# Lab 5: WebScraping

<br>

#### Class name: AI1706

#### Student code: HE170707

#### Student name: Phạm Thế Hưng

<br>

## Objectives

After completing this lab you will be able to:

* Understand HTML via coding practice
* Handle the HTTP Requests and response using R
* Perform basic webscraping using rvest


Estimated time needed: **60** minutes
<h4 style='color:red; font-weight:bold'>DO NOT CHEAT! 1 point for anybody copy or share code</h4>

<a id="ref0"></a>

<h2 id="http">Overview of HTTP</h2>

When the **client** uses a web page your browser sends an **HTTP** request to the **server** where the page is hosted. The server tries to find the desired **resource** such as the home page (index.html). 

If your request is successful, the server will send the resource to the client in an **HTTP response**; this includes information like the type of the **resource**, the length of the **resource**, and other information.   

<p>
The figure below represents the process; the circle on the left represents the client, the circle on the right represents the  Web server.  The table under the Web server represents a list of resources stored in the web server. In  this case an <code>HTML</code> file, <code>png</code> image, and <code>txt</code> file .
</p>
<p>
The <b>HTTP</b> protocol allows you to send and receive information through the web including webpages, images, and other web resources.
</p



<h2 id="#httr">The httr library</h2>

`httr` is a R library that allows you to build and send <code>HTTP</code> requests, as well as process <code>HTTP</code> requests easily.  We can import the package as follows (may take less than minute to import):

In [2]:
# This lab require some library packages. If error happen when running please uncomment lines below to install them:
# install.packages("httr")
# install.packages("rvest")


In [3]:
library(httr)
library(rvest)

## 1. Example code

In [127]:
url <- 'https://fap.fpt.edu.vn/'
response<-GET(url, encodeString='utf-8')

print(sprintf("Time: %s", response$date))
print(sprintf("URL link: %s", response$url))
print(sprintf("Status code: %d", response$status_code))

[1] "Time: 2023-09-21 09:54:37"
[1] "URL link: https://fap.fpt.edu.vn/"
[1] "Status code: 200"


In [128]:
root <- read_html(response)
options_node <- html_nodes(root, "option")
values <- c()
print("List of FPT University campus: ")
for(node in options_node){
    v <- as.integer(html_attr(node, "value"))
    if(!is.na(v) && !(v %in% values)){
        values<- c(values, v)
        print(html_text(node))
    }
}

[1] "List of FPT University campus: "
[1] "FU-Hòa Lạc"
[1] "FU-Hồ Chí Minh"
[1] "FU-Đà Nẵng"
[1] "FU-Cần Thơ"
[1] "FU-Quy Nhơn"


## 2. Data source
Implement that code by change the URL

* https://webtygia.com/

* https://giavang.org/

* https://tygiadola.net/giavang/gia-vang-hom-nay

* https://nongnghiep.vn/bang-gia-vang-sjc-9999-24k-18k-14k-10k-hom-nay-24-10-2022-d335344.html

or any other URL that you can find!


## 3. Tasks

#### 3.1 Getting the data

Using Webscraping to crawling data of SJC gold price in major cities and provinces in Vietnam. The data should have more than 10 records. Display a table to show the data. 

In [134]:
# Enter code here
url <- "https://giavang.org/"
page <- read_html(url) %>% html_nodes("table") %>% html_table(fill = TRUE)
data <- as.data.frame(data)
df <- data[1:(nrow(data)-1), ];
df[['Bán.ra']] <- as.numeric(df[['Bán.ra']])
df[['Mua.vào']] <- as.numeric(df[['Mua.vào']])
df <- df[df["Hệ.thống"] == "SJC", ]
df

Unnamed: 0_level_0,Khu.vực,Hệ.thống,Mua.vào,Bán.ra
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<dbl>
1,TP. Hồ Chí Minh,SJC,68.35,69.05
4,Hà Nội,SJC,68.35,69.07
8,Đà Nẵng,SJC,68.35,69.07
10,Nha Trang,SJC,68.35,69.07
11,Cà Mau,SJC,68.35,69.07
12,Huế,SJC,68.32,66.83
13,Miền Tây,SJC,66.65,69.07
14,Biên Hòa,SJC,68.35,69.05
15,Quảng Ngãi,SJC,68.35,69.05
16,Long Xuyên,SJC,68.35,69.05


#### 3.2 Which province has the highest gold selling price?

In [130]:
# Enter code here
max_selling <- which.max(df[["Bán.ra"]])
max_selling_province <- df[max_selling, ][["Khu.vực"]]
print(paste("Province with highest gold selling price is: ", max_selling_province))

[1] "Province with highest gold selling price is:  Bạc Liêu"


#### 3.3 Which provinces have the biggest difference in selling and buying prices?

In [131]:
# Enter code here
index <- which.max(df[["Bán.ra"]] - df[["Mua.vào"]])
provinces_with_max_difference <- df[index,]["Khu.vực"]
print(paste("Province with biggest difference in selling and buying prices: ", provinces_with_max_difference))

[1] "Province with biggest difference in selling and buying prices:  Miền Tây"


#### 3.4 Find all the province has selling price below average

In [132]:
# Enter code here
province_below_average <- df[df[["Bán.ra"]] < mean(df[["Bán.ra"]]), ][["Khu.vực"]]
print(paste("The province has selling price below average:", province_below_average))

[1] "The province has selling price below average: Huế"


#### 3.5 Find the difference between highest buying price and lowest selling price of all provinces

In [133]:
# Enter code here
difference <- max(df[["Mua.vào"]]) - min(df[["Bán.ra"]])
print(paste("The province has selling price below average:", difference))

[1] "The province has selling price below average: 1.54000000000001"


## Author

#### <a href="" target="_blank">Do Thai Giang</a>

## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                 |
| ----------------- | ------- | ---------- | ---------------------------------- |
| 2022-10-24        | 1.0     | Giangdt26  | Create the 1st version             |
|                   |         |            |                                    |
|                   |         |            |                                    |

<hr>

## <h3 align="center"> © FPT University. All rights reserved. <h3/>
