# R Programming Basics for Data Science: edX course MODULE 4 PART - 2

### **4.5 HTTP Request and REST API**

**Make a GET request using httr**

In [None]:
library(httr)

In [None]:
response <- GET("https://www.ibm.com")

In [None]:
response

**Make a POST request using httr**

In [None]:
body <- list(course_name = "Introduction to R", instructor = "Yan")

In [None]:
response <- POST("http://httpbin.org/post", body = body)

In [None]:
response

### **4.6 Hands-on Lab: HTTP Requests in R**

**HTTP Requests in R**

Estimated time needed: 30 minutes

Objectives

After completing this lab you will be able to:

* Understand HTTP
* Handle the HTTP Requests and response using R

**Overview of HTTP**

When the client uses a web page your browser sends an HTTP request to the server where the page is hosted. The server tries to find the desired resource such as the home page (index.html).

If your request is successful, the server will send the resource to the client in an HTTP response; this includes information like the type of the resource, the length of the resource, and other information.

The figure below represents the process; the circle on the left represents the client, the circle on the right represents the Web server. The table under the Web server represents a list of resources stored in the web server. In this case an HTML file, png image, and txt file .

The HTTP protocol allows you to send and receive information through the web including webpages, images, and other web resources.

**Uniform Resource Locator:URL**

Uniform resource locator (URL) is the most popular way to find resources on the web. We can break the URL into four parts.

* scheme: This is this protocol, for this lab it will always be http://
* Internet address or Base URL: This will be used to find the location here are some examples: www.ibm.com and  www.gitlab.com 
* route location on the web server for example: /images/IDSNlogo.png
* URL parameters parameters included in an URL for example: ?userid=1

You may also here the term uniform resource identifier (URI), URL are actually a subset of URIs. Another popular term is endpoint, this is the URL of an operation provided by a Web server.

**Request**

The process can be broken into the request and response process.

The request using the get method is partially illustrated below. In the start line we have the GET method, this is an HTTP method. Also the location of the resource /index.html and the HTTP version.

The Request header passes additional information with an HTTP request:

When an HTTP request is made, an HTTP method is sent, this tells the server what action to perform.

A list of several HTTP methods is shown below.

**Response**

The figure below represents the response; the response start line contains the version number HTTP/1.0, a status code (200) meaning success, followed by a descriptive phrase (OK).

The response header contains useful meta information.

Finally, we have the response body containing the requested file an  HTML  document. It should be noted that some request have headers.

Some status code examples are shown in the table below, the prefix indicates the class; these are shown in yellow, with actual status codes shown in white. Check out the following link for more descriptions.

The `httr` library

`httr` is a R library that allows you to build and send HTTP requests, as well as process HTTP requests easily. We can import the package as follows (may take less than minute to import):

In [None]:
library(httr)

You can make a GET request via the method get to www.ibm.com:

In [None]:
url <- 'https://www.ibm.com/'
response <- GET(url)
response

We have the response object response , this has information about the response, like the status of the request. We can view the status code using the attribute status

In [None]:
response$status

You can also check the headers of the response

In [None]:
response_headers <- headers(response)
response_headers

We can obtain the date. The request was sent using the key Date.

In [None]:
response_headers['date']

Content-Type indicates the type of data:

In [None]:
response_headers['content-type']

To obtain the original request, you can view it via response object:

In [None]:
response$request$headers

Coding Exercise: in the code cell below, find the content-length attribute in the response header

In [None]:
# my solution

length(response_headers)

In [None]:
# course solution

response_headers['content-length']

Now, let's get the content of HTTP response

In [None]:
content(response, encoding = "UTF-8")

which is the IBM home page (in fact, HTML page which you will learn later in this course).

You can load other types of data for non-text requests like images, consider the URL of the following image:

In [None]:
image_url <- 'https://gitlab.com/ibm/skills-network/courses/placeholder101/-/raw/master/labs/module%201/images/IDSNlogo.png'

We can make a get request:

In [None]:
image_response <- GET(image_url)

We can look at the response header:

In [None]:
image_headers <- headers(image_response)

We can the 'Content-Type', which is an image

In [None]:
image_headers['content-type']

An image is a response object that contains the image as a bytes-like object. As a result, we must save it using a file object. First, we specify the file path and name.

In [None]:
image <- content(image_response, "raw")
writeBin(image, "logo.png")

Then you should be able to find the logo.png at the file explorer on the left

Coding Exercise: in the code cell below, find another image url and use above code to request and download the image

In [None]:
# my_solution

url_image <- 'https://upload.wikimedia.org/wikipedia/commons/1/1b/R_logo.svg'

In [None]:
response_image <- GET(url_image)

In [None]:
headers_image <- headers(response_image)

In [None]:
headers_image['content-type']

In [None]:
imageR <- content(response_image, "raw")
writeBin(imageR, "logo.png")

**Get Request with URL Parameters**

You can also add URL parameters to HTTP GET request to filter resources. For example, instead of return all users from an API, I only want to get the user with id 1. To do so, I can add a URL parameter like userid = 1 in my GET request.

Let's see an GET example with URL parameters:

Suppose we have a simple GET API with base URL for http://httpbin.org/

In [None]:
url_get <- 'http://httpbin.org/get'

and we want to add some URL parameters to above GET API. To do so, we simply create a named list with parameter names and values:

In [None]:
query_params <- list(name = "Yan", ID = "123")

Then passing the list `query_params` to the query argument of the  `GET()` function.

It basically tells the GET API I only want to get resources with name equals Yan and id equals 123.

OK, let's make the GET request to 'http://httpbin.org/get' with the two arameters

In [None]:
response <- GET(url_get, query=query_params)

We can print out the updated URL and see the attached URL parameters.

In [None]:
response$request$url

After the base URL http://httpbin.org/get, you can see the URL parameters name=Yan&ID=123 are seperated by ?

The attribute args of the response had the name and values:

In [None]:
content(response)$args

**Post Requests **

Like a GET request a POST is used to send data to a server in a request body. In order to send the Post Request in R in the URL we change the route to POST:

In [None]:
url_post <- 'http://httpbin.org/post'

This endpoint will expect data as a file or as a form, a form is convenient way to configure an HTTP request to send data to a server.

To make a `POST` request we use the `POST()` function, the list body is passed to the parameter  body :

In [None]:
body <- list(course_name='Introduction to R', instructor='Yan')
response <- POST('http://httpbin.org/post', body = body)
response

We can see POST request has a body stored in fields attribute

In [None]:
response$request$fields

There is a lot more you can do check out `httr` here.

### **4.7 Web Scraping in R**

In [None]:
library(rvest)

**Read HTML from character variable**

In [None]:
simple_html <- "<html>
       <body>
         <p> This is a html page</p>
       </body>
       </html>"

root_node <- read_html(simple_html)
root_node

**Read HTML from a URL**

In [None]:
root_node_ibm1 <- read_html("https://www.ibm.com/us-en/")
root_node_ibm1

**Download a HTML file**

In [None]:
download.file("https://www.ibm.com/", destfile = "ibm.html")
root_node_ibm2 <- read_html("ibm.html")
root_node_ibm2

**Extract node content**

In [None]:
root_node <- read_html("<html>
       <body>
         <p> This is a html page</p>
       </body>
</html>")

body_node <- html_node(root_node, "body")
p_node <- html_node(root_node, "p")
p_content <- html_text(p_node)

p_content

**Extract data table from HTML page**

In [None]:
download.file("https://en.wikipedia.org/wiki/Color", destfile = "color.html")

root_node_wiki <- read_html("color.html")
table_node <- html_node(root_node_color, "table")
color_data_frame <- html_table(table_node)
color_data_frame

###  **4.8 Hands-on Lab: Webscraping in R**

**Webscraping in R**

Estimated time needed: 15 minutes

**Objectives**

After completing this lab you will be able to:

Understand HTML via coding practice
Perform basic webscraping using rvest

**Overview of HTML**

HTML stands for Hypertext Markup Language and it is used mainly for writing web pages.

An HTML page consists of many organized HTML nodes or elements that tell a browser how to render its content.

Each node or element has a start tag and an end tag with the same name and wraps some textual content.

One key feature of HTML is that nodes can be nested within other nodes, organizing into a tree-like structure like the folders in a file system. Below is a basic HTML node structure:

* `<html>` node is the root node,
* `<html>` node has two children: `<head>` and `<body>`.
* Since the `<head>` and `<body>` nodes have the same parent `<html>` node they are siblings to each other.
* Similarly, the `<body>` node has two child nodes, the `<h1>` and `<p>` nodes.

It is important to understand this tree-structure when writing a new HTML page or extracting data from an existing HTML page.

The `rvest` library

The `rvest` package is a popular web scraping package for R. After rvest reads an HTML page, you can use the tag names to find the child nodes of the current node. We also need to import httr library to get some HTML pages by sending HTTP GET request

In [2]:
library(rvest)
library(httr)

First let's warm-up by reading HTML from the following character variable simple_html_text

In [3]:
# A simple HTML document in a character variable
simple_html_text <- "
<html>
    <body>
        <p>This is a test html page</p>
    </body>
</html>"

Then use the read_html function to create the HTML root node, i.e., the html node by loading the simple_html_text

In [4]:
root_node <- read_html(simple_html_text)
root_node

You can also check the type of root_node

In [5]:
class(root_node)

You can see the class is xml_node because rvest load HTML pages and convert them using XML format internally. XML has similar parent-child tree structure but more suitable for storing and tranporting data than HTML.

Next, let's try to create a HTML node by loading a remote HTML page given a URL

In [6]:
ibm_html_node <- read_html("http://www.ibm.com")
ibm_html_node

Sometimes you want to download some HTML pages and analyze them offline, you could use download.file to do so:

In [7]:
# download the R home page and save it to an HTML file locally called r.html
download.file('https://www.r-project.org', destfile = 'r.html')

In [8]:
# Create a html_node from the local r.html file
html_node <- read_html('r.html')
html_node

Coding Exercise: in the code cell below, download a html node using any URL you like.

In [9]:
download.file('https://kulis.az/', destfile = 'kulis.html')
html_node <- read_html('kulis.html')
html_node

Now you know how to read an HTML page from a character variable, a URL, or a local HTML file. Next let's see how to parse and extract data from a specific node(s) starting from the root node

In [10]:
simple_html_text <- "
<html>
    <body>
        <p style=\"color:red;\">This is a test html page</p>
    </body>
</html>"

root_node <- read_html(simple_html_text)
root_node

Get the <body> node by using its parent node <html>

In [11]:
# Get the child <body> node from <html> root_node
body_node <- html_node(root_node, "body")
body_node

You can see it has a child node paragraph `<p>`
    
Let's get the content of the `<p>`

In [12]:
# Get the child <p> node from its <body> node
p_node <- html_node(body_node, "p")
p_content <- html_text(p_node)
p_content

The `<p>` node also has style attribute with value `color:red`;, which means we want the browser to render its text using red color. To get an attribute of a node, we can use a function called `html_attr("attribute name")`

In [14]:
# Use the p_node as the first argument to get its attribute
style_attr <- html_attr(p_node, "style")
style_attr

In the code cell below, the downloaded r.html file (from https://www.r-project.org) has an <img> node representing an image URL to R logo image (a relative path on its web server), let's try to find the image URL and download it.

Your need to paste the relative path in <img> with the the https://www.r-project.org to get the full URL of the image, and use the GET function to request the image as bytes in the response

In [15]:
# Write your code below. Don't forget to press Shift+Enter to execute the cell
url <- 'https://www.r-project.org'
html_node <- read_html('r.html')
# Get the image node using its root node
img_node <- html_node(html_node, "img")
# Get the "src" attribute of img node, representing the image location
img_relative_path <- html_attr(img_node, "src")
img_relative_path
# Paste img_relative_path with 'https://www.r-project.org'
image_path <- paste(url, img_relative_path, sep="")
# Use GET request to get the image
image_response<-GET(image_path)

Then use writeBin() function to save the returned bytes into an image file.

In [18]:
# Parse the body from the response as bytes
image <- content(image_response, "raw")
# Write the bytes to a png file
writeBin(image, "r.png")

Now, from the file list on the left, you should be able to find a saved r.png file.

In HTML, many tabluar data are stored in <table> nodes. Thus, it is very important to be able to extract data from <table> nodes and preferably convert them into R data frames.

Below is a sample HTML page contains a color table showing the supported HTML colors, and we want to load it as a R data frame so we can analyze it using data frame-related operations.

In [19]:
table_url <- "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"

Like other HTML nodes, let's first get the `<table>` node using html_node function

In [21]:
root_node <- read_html(table_url)
table_node <- html_node(root_node, "table")
table_node

You can see the table node in a messy HTML format. Fortunately, you dont need to parse it by yourself, `rvest` provides a handy function called `html_table()` to convert `<table>` node into R dataframe

In [22]:
# Extract content from table_node and convert the data into a dataframe
color_data_frame <- html_table(table_node)
head(color_data_frame)

But you could see the table headers were parsed as the first row, no worries, we could manually fix that.

In [24]:
names(color_data_frame)

In [25]:
# Convert the first row as column names
names(color_data_frame) <- as.matrix(color_data_frame[1, ])
# Then remove the first row
data_frame <- color_data_frame[-1, ]
head(data_frame)
names(color_data_frame)

That's it for webscraping in R, there is a lot more you can do check out [rvest](https://cran.r-project.org/web/packages/rvest/rvest.pdf?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkRP0101ENCoursera23911160-2022-01-01).