# Week 2 Quiz 
___

## Question 1
Register an application with the Github API here https://github.com/settings/applications. Access the API to get information on your instructors repositories (hint: [this is the url you want](https://api.github.com/users/jtleek/repos)). Use this data to find the time that the datasharing repo was created. What time was it created?

[This tutorial may be useful.](https://github.com/hadley/httr/blob/master/demo/oauth2-github.r) You may also need to run the code in the base R package and not R studio.




In [1]:
library(jsonlite)
library(httr)

In [2]:
github_credentials <- fromJSON("github_credentials.json")

In [3]:
myapp <- oauth_app("github",
                  key=github_credentials$client_ID,
                  secret=github_credentials$client_secret)

Note: The first time this is run, it might need to be carried out in a non-Docker instance due to how R manages it's redirecting.

In [4]:
github_token <- oauth2.0_token(oauth_endpoints("github"), myapp)

In [5]:
gtoken <- config(token = github_token)

In [6]:
req = GET("https://api.github.com/rate_limit", gtoken)

In [7]:
req

Response [https://api.github.com/rate_limit]
  Date: 2018-03-24 13:14
  Status: 200
  Content-Type: application/json; charset=utf-8
  Size: 387 B
{
  "resources": {
    "core": {
      "limit": 5000,
      "remaining": 5000,
      "reset": 1521900868
    },
    "search": {
      "limit": 30,
      "remaining": 30,
...

In [8]:
req = GET("https://api.github.com/", gtoken)

In [9]:
req

Response [https://api.github.com/]
  Date: 2018-03-24 13:15
  Status: 200
  Content-Type: application/json; charset=utf-8
  Size: 2.16 kB
{
  "current_user_url": "https://api.github.com/user",
  "current_user_authorizations_html_url": "https://github.com/settings/connec...
  "authorizations_url": "https://api.github.com/authorizations",
  "code_search_url": "https://api.github.com/search/code?q={query}{&page,per_...
  "commit_search_url": "https://api.github.com/search/commits?q={query}{&page...
  "emails_url": "https://api.github.com/user/emails",
  "emojis_url": "https://api.github.com/emojis",
  "events_url": "https://api.github.com/events",
  "feeds_url": "https://api.github.com/feeds",
...

In [13]:
content(req, as="parsed")

In [14]:
req <- GET("https://api.github.com/users/jtleek/repos", gtoken)

In [18]:
parsed_req <- content(req, as="parsed")

In [19]:
class(parsed_req)

In [22]:
length(parsed_req)

In [57]:
str(parsed_req)

List of 30
 $ :List of 72
  ..$ id               : int 101394164
  ..$ name             : chr "advdatasci"
  ..$ full_name        : chr "jtleek/advdatasci"
  ..$ owner            :List of 17
  .. ..$ login              : chr "jtleek"
  .. ..$ id                 : int 1571674
  .. ..$ avatar_url         : chr "https://avatars2.githubusercontent.com/u/1571674?v=4"
  .. ..$ gravatar_id        : chr ""
  .. ..$ url                : chr "https://api.github.com/users/jtleek"
  .. ..$ html_url           : chr "https://github.com/jtleek"
  .. ..$ followers_url      : chr "https://api.github.com/users/jtleek/followers"
  .. ..$ following_url      : chr "https://api.github.com/users/jtleek/following{/other_user}"
  .. ..$ gists_url          : chr "https://api.github.com/users/jtleek/gists{/gist_id}"
  .. ..$ starred_url        : chr "https://api.github.com/users/jtleek/starred{/owner}{/repo}"
  .. ..$ subscriptions_url  : chr "https://api.github.com/users/jtleek/subscriptions"
  .. ..$ organizat

Technically, the answer we want is contained in the list above and we can just ctrl+f to find it. But that's inelegant - can we retrieve it here?

In [59]:
sapply(parsed_req, function(x) if (x$name=="datasharing") {x$created_at} else {""} )

And that's our answer!

## Question 2
The sqldf package allows for execution of SQL commands on R data frames. We will use the sqldf package to practice the queries we might send with the dbSendQuery command in RMySQL.

Download the American Community Survey data and load it into an R object called `acs`.
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv

Which of the following commands will select only the data for the probability weights pwgtp1 with ages less than 50?

In [6]:
if (!file.exists("acs.csv")) {
    download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv", "acs.csv")
    }

### On Sqldf
`Sqldf` is a package in R that allows the manipulation of DataFrames as per SQL files. 

We can accomplish this because we really are manipulating a SQL file - from the docs and error messages it appears that the library is creating a `sqlite` database under the hood!

Further reading:
* [Package summary](https://cran.r-project.org/web/packages/sqldf/sqldf.pdf)
* [Github homepage](https://github.com/ggrothendieck/sqldf) - has links to Sqlite pages for formulating queries.

In [2]:
library(data.table)
library(sqldf)

Loading required package: gsubfn
Loading required package: proto
“no DISPLAY variable so Tk is not available”Loading required package: RSQLite


In [3]:
acs <- fread("acs.csv")

In [4]:
str(acs)

Classes ‘data.table’ and 'data.frame':	14931 obs. of  239 variables:
 $ RT      : chr  "P" "P" "P" "P" ...
 $ SERIALNO: int  186 186 186 186 306 395 395 506 506 506 ...
 $ SPORDER : int  1 2 3 4 1 1 2 1 2 3 ...
 $ PUMA    : int  700 700 700 700 700 100 100 700 700 700 ...
 $ ST      : int  16 16 16 16 16 16 16 16 16 16 ...
 $ ADJUST  : int  1015675 1015675 1015675 1015675 1015675 1015675 1015675 1015675 1015675 1015675 ...
 $ PWGTP   : int  89 92 107 91 309 108 90 239 213 219 ...
 $ AGEP    : int  43 42 16 14 29 40 15 28 30 4 ...
 $ CIT     : int  1 1 1 1 1 1 1 1 1 1 ...
 $ COW     : int  7 4 1 NA 5 8 NA 1 1 NA ...
 $ DDRS    : int  2 2 2 2 2 2 2 2 2 NA ...
 $ DEYE    : int  2 2 2 2 2 2 2 2 2 NA ...
 $ DOUT    : int  2 2 2 NA 2 2 NA 2 2 NA ...
 $ DPHY    : int  2 2 2 2 2 2 2 2 2 NA ...
 $ DREM    : int  2 2 2 2 2 2 2 2 2 NA ...
 $ DWRK    : int  2 2 2 NA 2 2 NA 2 2 NA ...
 $ ENG     : int  NA NA NA NA NA NA NA NA NA NA ...
 $ FER     : int  NA 2 NA NA NA 2 NA NA 2 NA ...
 $ GCL     : i

In [9]:
# Call for both pwgtp1 and AGEP so we can validate that ages are under 50
sqldf("select pwgtp1, AGEP from acs where AGEP <50")

pwgtp1,AGEP
87,43
88,42
94,16
91,14
539,29
192,40
153,15
232,28
205,30
226,4


## Question 3
Using the same data frame you created in the previous problem, what is the equivalent function to unique(acs$AGEP)?

In [15]:
unique_traditional <- unique(acs$AGEP)

In [14]:
unique_sql <- sqldf("select distinct AGEP from acs")

In [25]:
str(unique_traditional)

 int [1:91] 43 42 16 14 29 40 15 28 30 4 ...


In [26]:
str(unique_sql$AGEP)

 int [1:91] 43 42 16 14 29 40 15 28 30 4 ...


Do these match?

In [30]:
all(unique_traditional == unique_sql$AGEP)

They sure do.

## Question 4
How many characters are in the 10th, 20th, 30th and 100th lines of HTML from this page:

http://biostat.jhsph.edu/~jleek/contact.html

(Hint: the nchar() function in R may be helpful)

In [70]:
library(httr)
library(XML)

#### A painful approach - `HTTR` and `XML` deconstruction
We can use the `httr` package to receive the repsonse and attempt to split it up. However, as becomes quickly obvious, this is actually not a terribly effective way to do it - this is a more heavy-handed approach that would be better suited for extracting particular elements out of the HTML content.

In [None]:
?httr

In [None]:
?GET

In [35]:
response <- GET("http://biostat.jhsph.edu/~jleek/contact.html")

In [80]:
parsed_resp <- content(response, as="text")

In [82]:
parsedHTML <- htmlParse(parsed_resp, asText=TRUE)

#### Retrieving via `readLines`
For the purposes of this exercise, we're not *really* interested in the HTML, per se - we really care about the lines of HTML text. So the `readLines` function might be the best solution instead.

In [102]:
con <- url("http://biostat.jhsph.edu/~jleek/contact.html")

In [103]:
html_content <- readLines(con)

In [104]:
html_content

In [106]:
for (i in c(10, 20, 30, 100)) {
    print(nchar(html_content[[i]]))
}

[1] 45
[1] 31
[1] 7
[1] 25


EZ

## Question 5
Read this data set into R and report the sum of the numbers in the fourth of the nine columns.

https://d396qusza40orc.cloudfront.net/getdata%2Fwksst8110.for

Original source of the data: http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for

(Hint this is a fixed width file format)

In [None]:
?read.fwf

In [107]:
if (!file.exists("wksst8110.for")) {
    download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fwksst8110.for", "wksst8110.for")
}

In [111]:
print(readLines("wksst8110.for", n=5))

[1] " Weekly SST data starts week centered on 3Jan1990"             
[2] ""                                                              
[3] "                Nino1+2      Nino3        Nino34        Nino4" 
[4] " Week          SST SSTA     SST SSTA     SST SSTA     SST SSTA"
[5] " 03JAN1990     23.4-0.4     25.1-0.3     26.6 0.0     28.6 0.3"


In [113]:
df <- read.fwf("wksst8110.for",
              header = FALSE,
              skip = 4,
              widths = c(15, 4, 9, 4, 9, 4, 9, 4, 4))

In [116]:
head(df)

V1,V2,V3,V4,V5,V6,V7,V8,V9
03JAN1990,23.4,-0.4,25.1,-0.3,26.6,0.0,28.6,0.3
10JAN1990,23.4,-0.8,25.2,-0.3,26.6,0.1,28.6,0.3
17JAN1990,24.2,-0.3,25.3,-0.3,26.5,-0.1,28.6,0.3
24JAN1990,24.4,-0.5,25.5,-0.4,26.5,-0.1,28.4,0.2
31JAN1990,25.1,-0.2,25.8,-0.2,26.7,0.1,28.4,0.2
07FEB1990,25.8,0.2,26.1,-0.1,26.8,0.1,28.4,0.3


In [119]:
sum(df[[4]])