# Reading from the Web
___

A friendly reminder that getting too keen on web scraping is a good way to have your IP blocked.

Useful links:
* [Web Scraping](http://en.wikipedia.org/wiki/Web_scraping)
* [Scraping articles on R-Bloggers](http://www.r-bloggers.com/?s=Web+Scraping)
* [How Netflix Reverse Engineering Hollywood](https://www.theatlantic.com/technology/archive/2014/01/how-netflix-reverse-engineered-hollywood/282679/)

## HTML Scraping

### A first approach - readLines()

In [1]:
## Don't forget the s in https"
con = url("https://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en")  
htmlCode = readLines(con)
close(con)

“incomplete final line found on 'https://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en'”

In [2]:
htmlCode

In [3]:
library(XML)

In [11]:
## Don't forget the s in https"
address <- ("https://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en")
con <- url(address)  
htmlCode <- readLines(con)
close(con)
html <- htmlTreeParse(htmlCode, useInternalNodes=T)

“incomplete final line found on 'https://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en'”

In [12]:
xpathSApply(html, "//title", xmlValue)

### Retrieving with the httr package

[Summary](http://cran.r-project.org/web/packages/httr/httr.pdf) of the HTTR package is available.

In [8]:
library(httr)

In [17]:
html2 <- GET(address)
class(html2)
print(html2)

Response [https://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en]
  Date: 2018-03-24 11:39
  Status: 200
  Content-Type: text/html; charset=ISO-8859-1
  Size: 130 kB
<!doctype html><html><head><title>Jeff Leek - Google Scholar Citations</title...
2]=arguments[e];return b.prototype[c].apply(a,d)}};var ia=function(a){var b=[...
document.documentElement.classList||function(){function a(a){return(a=(a=a.cl...
qa=ra?0<+ra[1]:r("Android")?!0:window.matchMedia&&window.matchMedia("(pointer...
c.type="hidden",c.name=b,a.appendChild(c));return c},ya=function(a){t("gsc_md...
var Ca=function(a){var b=a.b,c=b.length;a=a.m;for(var d=0,e=0;e<c;e++){var g=...
a},Ea=function(a,b,c,d,e){a.addEventListener(b,c,La(d,e))},Ga=function(a,b,c,...
var Ra=function(){Ha(["mousedown","touchstart"],function(){q(document.documen...
a;){var c=b||D.length>a+1;D.pop()(!!c)}},Va=function(a){for(var b=0;a&&!(b=C[...
y(document,"focus",function(a){var b=D.length;if(b)for(var c=Va(a.target);c<b...
...


In [14]:
content2 <- content(html2,as="text")
parsedHtml <- htmlParse(content2,asText=TRUE)
xpathSApply(parsedHtml, "//title", xmlValue)

** Making use of Authentication **

We can use `httpbin.org` to test authentication.

In [18]:
pg2 <- GET("http://httpbin.org/basic-auth/user/passwd",
          authenticate("user","passwd"))
pg2

Response [http://httpbin.org/basic-auth/user/passwd]
  Date: 2018-03-24 11:48
  Status: 200
  Content-Type: application/json
  Size: 47 B
{
  "authenticated": true, 
  "user": "user"
}

In [19]:
names(pg2)

**Using handles**

In [21]:
google = handle("https://google.com")
pg1 <- GET(handle=google,path="/")
pg2 <- GET(handle=google,path="/search")

In [22]:
pg1

Response [https://www.google.com.au/?gfe_rd=cr&dcr=0&ei=lju2WqHMJsXr8Afy1qToBQ]
  Date: 2018-03-24 11:50
  Status: 200
  Content-Type: text/html; charset=ISO-8859-1
  Size: 11.4 kB
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="...
</style><style>body,td,a,p,.h{font-family:arial,sans-serif}body{margin:0;over...
if (!iesg){document.f&&document.f.q.focus();document.gbqf&&document.gbqf.q.fo...
}
})();</script><div id="mngb"> <div id=gbar><nobr><b class=gb1>Search</b> <a c...

In [23]:
pg2

Response [https://www.google.com.au/webhp?gfe_rd=cr&dcr=0&ei=mDu2WvO2GPHDXpzTl5gD]
  Date: 2018-03-24 11:50
  Status: 200
  Content-Type: text/html; charset=ISO-8859-1
  Size: 11.4 kB
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="...
</style><style>body,td,a,p,.h{font-family:arial,sans-serif}body{margin:0;over...
if (!iesg){document.f&&document.f.q.focus();document.gbqf&&document.gbqf.q.fo...
}
})();</script><div id="mngb"> <div id=gbar><nobr><b class=gb1>Search</b> <a c...