# R Workshop 8

## Review

In the last workshop we rounded out the **dplyr** functions with

* `arrange` - sorting rows,
* `select` - restricting columns,
* `mutate` - computing new columns,
* `transmute` - replacing columns,
* `groupby` - specifying groups of rows,
* `summarize` - summarizing groups of rows

We found that each of these functions is pretty simple on its own.
The dataset is always the first argument.  The other parameters
determine how to process the data.

The real power comes with combining these functions into a **pipeline**.
The dplyr pipeline operator is `%>%`.  This is introduced in
Section 5.6.1 of **R for Data Science**.

<http://r4ds.had.co.nz/transform.html#grouped-summaries-with-summarise>

When these functions are cascaded into a dplyr pipeline,
the input is implied to come from the previous stage.

```
result <- mydata    %>%
          f1(args1) %>%
          f2(args2) %>%
          f3(args3)
```

where `f1`, `f2`, and `f3` are any of the dplyr functions.
This is equivalent to the following.

```
temp1  <- f1(mydata, args1)
temp2  <- f2(temp1, args2)
result <- f3(temp2, args3)
```

The pipeline is a bit cleaner in that it

* is easier to read, and
* reduces temporary variables.

The simply example above only seems a bit cleaner;
but as the problems get more involved, the improvement
is greater.

In [1]:
library(nycflights13)
library(tidyverse)

Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats


Let's dig into the syntax to understand it better and consider
what benefits it may have.  First notice the plus sign, `+`, as
if we're adding two numbers together.  This is an example of
*operator overloading*.  The developers of **ggplot** have overloaded
the `+` operator to represent **layering** of plot components.

In the above example, there are only two layers.  The first layer
assigns the data frame that is the source of the data.  This implies
that the source dataset for **ggplot** is always a `data.frame` object;
and this is true.  Only `data.frame` objects can be plotted with
**ggplot**.  Most **ggplot** invocations will start with a call to the
`ggplot` function passing in the source `data.frame` object.  At this
point we've provided no information on which parts to render or how
to render it.

The next layer will generally be a **geom** (pronounced GEE-ohm).
It's short for "geometry" and represents of type of plot.  The
`geom_point` function plots points; the `geom_line` function plots
lines.  A rendering can have several geom layers.  The most important
parameter to a geom function is the **aesthetic**.  The aesthetic
determines which elements of the data frame are mapped to aesthetics
of the geom.  The most common aesthetics are `x` and `y` positions.
But as we saw in the example above, `color` is also useful.

Look at Exercise `1` of Section `3.3.1` in
[R for Data Science](http://r4ds.had.co.nz/data-visualisation.html#aesthetic-mappings).
This is a good way to avoid a common misunderstanding.

```
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
```

Someone might think this should make all points blue.
But an assignment within an aesthetic **always** applies to the
underlying data frame.  In this case, it looks for a column in
the data frame named `blue` and attempt to set the color based
on this (likely non-existing) content.  The intention was probably

```
geom_point(mapping = aes(x = displ, y = hwy), col = "blue")
```

Notice how the color assignment was done **outside** the `aes`
function, but still within the `geom_point` function.

Now read through Sections
[3.5](http://r4ds.had.co.nz/data-visualisation.html#facets)
and
[3.6](http://r4ds.had.co.nz/data-visualisation.html#geometric-objects)
of the **R for Data Science** Visualization section.

## R Studio

Let's do some work in R Studio and see how it helps us go from
free exploration to scripting to documented analysis.
We'll explore some data from the LA County Sheriff's Open Data Portal.
This data is made available through JSON over HTTP.  Details are
provided at the following URL.

<https://dev.socrata.com/foundry/data.lacounty.gov/uvdj-ch3p>

We'll use the **httr** package to invoke the API and parse the
JSON.  The tricky part will involve converting the JSON to a
data frame.

In [7]:
library(httr)
response <- GET('https://data.lacounty.gov/',
                path='resource/uvdj-ch3p',
                query=list('$limit'='5'),
                add_headers(accept='application/json'))
response$status_code

If the status code is anything other than `200`, we need provide
additional handling.  Anything `400` and above usually means an error.
The `content` function is part of the **httr** package.  It uses the
`Content-Type` response header as a clue on how to parse the content.
In the present case, we expect the content type to be `application/json`,
which will cause the `content` function to parse the response as JSON.
The result will be a list of top level JSON elements, one for each row.

In [8]:
result_rows <- content(response)
length(result_rows)

The number of result rows should equal the limit we placed in the API's URL.
In this case, this is `5`.

We are presented with a few challenges when trying to wedge `result_rows`
into a data frame.

1. The first is that `result_rows` is organized by rows.
   Each top level element is a row.  But an R `data.frame`, internally,
   is a list of columns.  In order to convert `result_rows` to a
   `data.frame` we need to somehow exchange rows and columns.

2. Not all rows in `result_rows` have the same columns.  Moreover, some
   columns have values which are themselves lists.  We need to determine
   which columns we want and handle missing values.

Let's review the available columns by printing the names of the first row.

In [9]:
names(result_rows[[1]])

This is somewhat haphazard since this row could be missing columns.
We're going to choose from this list the columns we want to handle.

In [10]:
our_columns <- names(result_rows[[1]])[c(1:5, 7, 15:23)]
our_columns

`result_rows` is a list of lists.  Each top level element
is a row.  The "inner list" is the set of named columns.
We want to create the opposite: a list of columns, each
element of which is an atomic vector (*atomic* means
"same type for all elements).  Given a column name, we
can create it's atomic vector with the following function.

```
sapply(result_rows, function(row) { row[[col_name]] }
```

The `sapply` function iterates over a list and does something
with each element.  It has two arguments.

1. a list
2. a function to be applied to each element of the list

The result is a new list with the same number of elements as
the first argument.  The value of each element is the
result of calling the function from the second argument.
The `s` in the name of `sapply` means *simplify if possible*.
This means that if the resulting list contains primitive elements
of the same type, the list is converted to a vector before returning.
Otherwise it's left as a new list.

Recall the second challenge described above: not all rows have all
columns.  When a column is missing, `result_row['col_name']` will
return `NULL`.  This will cause the lengths of the column vectors
to be unequal (which will cause the conversion to a `data.frame` to
fail).  To maintain placeholders for missing elements, we assign `NA`
instead of `NULL`.

```
sapply(result_rows, function(row) { 
    if ( is.null(row[[col_name]]) ) 
       NA
    else
       row[[col_name]] 
    })
```

This will assign `NA` if the value is `NULL`, or the value if
it is not `NULL`.  Let's see this in action.  First let's
apply it to a column with no `NULL` values to see the simple case.

In [15]:
col_name <- 'crime_identifier'
sapply(result_rows, function(row) { row[[col_name]] })

In [16]:
col_name <- 'zip'
sapply(result_rows, function(row) { row[[col_name]] })

In [18]:
sapply(result_rows, function(row) { 
    if (is.null(row[[col_name]]))
        NA
    else
        row[[col_name]]
    })

The last two executions demonstrate the need for the check.
In the case with `NULL` values, this will result in vector
of length `2`.  But the `NA` values maintain a length of `5`.

This handles the "*inner loop*".  The outer loop is just the
number of column names.

In [19]:
predf <- lapply(our_columns, function(col_name) {
    sapply(result_rows, function(row) { 
    if (is.null(row[[col_name]]))
        NA
    else
        row[[col_name]]
    })
})

`predf` is basically a transpose of `result_rows`; but with
only certain columns selected and `NA` in place of `NULL`.
`predf` is a list of equal length vectors.  We can now create
a `data.frame` object.

In [22]:
df <- as.data.frame(predf, stringsAsFactors=FALSE,
                           row.names=predf[[5]], 
                           col.names=our_columns)
dim(df)

This is a data frame with `5` rows and `15` columns.

**Exercise**: Create an R script in R Studio using the
commands from this session.