## Importing data from the web (Part 1)

More and more of the information that data scientists are using resides on the web. Importing this data into R requires an understanding of the protocols used on the web. In this chapter, you'll get a crash course in HTTP and learn to perform your own HTTP requests from inside R.

### Import flat files from the web
In the video, you saw that the utils functions to import flat file data, such as read.csv() and read.delim(), are capable of automatically importing from URLs that point to flat files on the web.

You must be wondering whether Hadley Wickham's alternative package, readr, is equally potent. Well, figure it out in this exercise! The URLs for both a .csv file as well as a .delim file are already coded for you. It's up to you to actually import the data. If it works, that is…

In [2]:
# Load the readr package
library(readr)

# Import the csv file: pools
url_csv <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv"


# Import the txt file: potatoes
url_delim <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/potatoes.txt"


# Print pools and potatoes
pools = read_csv(url_csv)
print(pools)
potatoes = read_tsv(url_delim)
print(potatoes)


-- Column specification ------------------------------------------------------------------------------------------------
cols(
  Name = col_character(),
  Address = col_character(),
  Latitude = col_double(),
  Longitude = col_double()
)



# A tibble: 20 x 4
   Name                         Address                       Latitude Longitude
   <chr>                        <chr>                            <dbl>     <dbl>
 1 Acacia Ridge Leisure Centre  1391 Beaudesert Road, Acacia~    -27.6      153.
 2 Bellbowrie Pool              Sugarwood Street, Bellbowrie     -27.6      153.
 3 Carole Park                  Cnr Boundary Road and Waterf~    -27.6      153.
 4 Centenary Pool (inner City)  400 Gregory Terrace, Spring ~    -27.5      153.
 5 Chermside Pool               375 Hamilton Road, Chermside     -27.4      153.
 6 Colmslie Pool (Morningside)  400 Lytton Road, Morningside     -27.5      153.
 7 Spring Hill Baths (inner Ci~ 14 Torrington Street, Spring~    -27.5      153.
 8 Dunlop Park Pool (Corinda)   794 Oxley Road, Corinda          -27.5      153.
 9 Fortitude Valley Pool        432 Wickham Street, Fortitud~    -27.5      153.
10 Hibiscus Sports Complex (up~ 90 Klumpp Road, Upper Mount ~    -27.6      153.
11 Ithaca


-- Column specification ------------------------------------------------------------------------------------------------
cols(
  area = col_double(),
  temp = col_double(),
  size = col_double(),
  storage = col_double(),
  method = col_double(),
  texture = col_double(),
  flavor = col_double(),
  moistness = col_double()
)



# A tibble: 160 x 8
    area  temp  size storage method texture flavor moistness
   <dbl> <dbl> <dbl>   <dbl>  <dbl>   <dbl>  <dbl>     <dbl>
 1     1     1     1       1      1     2.9    3.2       3  
 2     1     1     1       1      2     2.3    2.5       2.6
 3     1     1     1       1      3     2.5    2.8       2.8
 4     1     1     1       1      4     2.1    2.9       2.4
 5     1     1     1       1      5     1.9    2.8       2.2
 6     1     1     1       2      1     1.8    3         1.7
 7     1     1     1       2      2     2.6    3.1       2.4
 8     1     1     1       2      3     3      3         2.9
 9     1     1     1       2      4     2.2    3.2       2.5
10     1     1     1       2      5     2      2.8       1.9
# ... with 150 more rows


### Secure importing
In the previous exercises, you have been working with URLs that all start with http://. There is, however, a safer alternative to HTTP, namely HTTPS, which stands for HypterText Transfer Protocol Secure. Just remember this: HTTPS is relatively safe, HTTP is not.

Luckily for us, you can use the standard importing functions with https:// connections since R version 3.2.2.

In [3]:
# https URL to the swimming_pools csv file.
url_csv <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv"

# Import the file using read.csv(): pools1
pools1 = read.csv(url_csv)

# Import the file using read_csv(): pools2
pools2 = read_csv(url_csv)

# Print the structure of pools1 and pools2
str(pools1)
str(pools2)


-- Column specification ------------------------------------------------------------------------------------------------
cols(
  Name = col_character(),
  Address = col_character(),
  Latitude = col_double(),
  Longitude = col_double()
)



'data.frame':	20 obs. of  4 variables:
 $ Name     : Factor w/ 20 levels "Acacia Ridge Leisure Centre",..: 1 2 3 4 5 6 19 7 8 9 ...
 $ Address  : Factor w/ 20 levels "1 Fairlead Crescent, Manly",..: 5 20 18 10 9 11 6 15 12 17 ...
 $ Latitude : num  -27.6 -27.6 -27.6 -27.5 -27.4 ...
 $ Longitude: num  153 153 153 153 153 ...
tibble [20 x 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Name     : chr [1:20] "Acacia Ridge Leisure Centre" "Bellbowrie Pool" "Carole Park" "Centenary Pool (inner City)" ...
 $ Address  : chr [1:20] "1391 Beaudesert Road, Acacia Ridge" "Sugarwood Street, Bellbowrie" "Cnr Boundary Road and Waterford Road Wacol" "400 Gregory Terrace, Spring Hill" ...
 $ Latitude : num [1:20] -27.6 -27.6 -27.6 -27.5 -27.4 ...
 $ Longitude: num [1:20] 153 153 153 153 153 ...
 - attr(*, "spec")=
  .. cols(
  ..   Name = col_character(),
  ..   Address = col_character(),
  ..   Latitude = col_double(),
  ..   Longitude = col_double()
  .. )


### Import Excel files from the web
When you learned about gdata, it was already mentioned that gdata can handle .xls files that are on the internet. readxl can't, at least not yet. The URL with which you'll be working is already available in the sample code. You will import it once using gdata and once with the readxl package via a workaround.

In [None]:
# Load the readxl and gdata package
library(readxl)

# Specification of url: url_xls
url_xls <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls"

# Download file behind URL, name it local_latitude.xls
download.file(url_xls, "local_latitude.xls")
# local_latitude = load("local_latitude.xls")

# Import the local .xls file with readxl: excel_readxl
excel_readxl = read_excel("local_latitude.xls")

### Downloading any file, secure or not
In the previous exercise you've seen how you can read excel files on the web using the read_excel package by first downloading the file with the download.file() function.

There's more: with download.file() you can download any kind of file from the web, using HTTP and HTTPS: images, executable files, but also .RData files. An RData file is very efficient format to store R data.

You can load data from an RData file using the load() function, but this function does not accept a URL string as an argument. In this exercise, you'll first download the RData file securely, and then import the local data file.

In [10]:
# https URL to the wine RData file.
url_rdata <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/wine.RData"

# Download the wine file to your working directory
download.file(url_rdata, destfile = "wine_local.RData")

# Load the wine data into your workspace using load()
load("wine_local.RData")

# Print out the summary of the wine data
summary(wine)


    Alcohol        Malic acid        Ash        Alcalinity of ash
 Min.   :11.03   Min.   :0.74   Min.   :1.360   Min.   :10.60    
 1st Qu.:12.36   1st Qu.:1.60   1st Qu.:2.210   1st Qu.:17.20    
 Median :13.05   Median :1.87   Median :2.360   Median :19.50    
 Mean   :12.99   Mean   :2.34   Mean   :2.366   Mean   :19.52    
 3rd Qu.:13.67   3rd Qu.:3.10   3rd Qu.:2.560   3rd Qu.:21.50    
 Max.   :14.83   Max.   :5.80   Max.   :3.230   Max.   :30.00    
   Magnesium      Total phenols     Flavanoids    Nonflavanoid phenols
 Min.   : 70.00   Min.   :0.980   Min.   :0.340   Min.   :0.1300      
 1st Qu.: 88.00   1st Qu.:1.740   1st Qu.:1.200   1st Qu.:0.2700      
 Median : 98.00   Median :2.350   Median :2.130   Median :0.3400      
 Mean   : 99.59   Mean   :2.292   Mean   :2.023   Mean   :0.3623      
 3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.860   3rd Qu.:0.4400      
 Max.   :162.00   Max.   :3.880   Max.   :5.080   Max.   :0.6600      
 Proanthocyanins Color intensity       Hu

### HTTP? httr! (1)
Downloading a file from the Internet means sending a GET request and receiving the file you asked for. Internally, all the previously discussed functions use a GET request to download files.

httr provides a convenient function, GET() to execute this GET request. The result is a response object, that provides easy access to the status code, content-type and, of course, the actual content.

You can extract the content from the request using the content() function. At the time of writing, there are three ways to retrieve this content: as a raw object, as a character vector, or an R object, such as a list. If you don't tell content() how to retrieve the content through the as argument, it'll try its best to figure out which type is most appropriate based on the content-type.

In [None]:
# Load the httr package
library(httr)

# Get the url, save response to resp
url <- "http://www.example.com/"
resp <- GET(url)

# Print resp
resp

# Get the raw content of resp: raw_content
raw_content <- content(resp, as = "raw")

# Print the head of raw_content
head(raw_content)

### HTTP? httr! (2)
Web content does not limit itself to HTML pages and files stored on remote servers such as DataCamp's Amazon S3 instances. There are many other data formats out there. A very common one is JSON. This format is very often used by so-called Web APIs, interfaces to web servers with which you as a client can communicate to get or store information in more complicated ways.

You'll learn about Web APIs and JSON in the video and exercises that follow, but some experimentation never hurts, does it?

In [14]:
# Get the url
url <- "http://www.omdbapi.com/?apikey=72bc447a&t=Annie+Hall&y=&plot=short&r=json"


# Print resp
resp = GET(url)
resp

# Print content of resp as text
content(resp, as = "text")

# Print content of resp
content(resp)

Response [http://www.omdbapi.com/?apikey=72bc447a&t=Annie+Hall&y=&plot=short&r=json]
  Date: 2021-03-19 13:42
  Status: 200
  Content-Type: application/json; charset=utf-8
  Size: 1.05 kB
