Skip to content
This repository has been archived by the owner on Dec 28, 2023. It is now read-only.

Commit

Permalink
Removed commented RQ
Browse files Browse the repository at this point in the history
  • Loading branch information
ismayc committed Sep 14, 2016
1 parent 46654d8 commit 4163383
Show file tree
Hide file tree
Showing 68 changed files with 359 additions and 348 deletions.
1 change: 1 addition & 0 deletions .gitignore
Expand Up @@ -3,3 +3,4 @@
.RData
.Ruserdata
*placeholder.html
.httr-oauth
24 changes: 10 additions & 14 deletions 03-tidy_data.Rmd
Expand Up @@ -53,7 +53,7 @@ knitr::include_graphics("images/tidy-1.png")

Reading over this definition, you can begin to think about datasets that won't follow this nice format.

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

```{block lc3-1, type='learncheck'}
**_Learning check_**
Expand All @@ -64,7 +64,7 @@ Reading over this definition, you can begin to think about datasets that won't f
+ What features of this dataset might make it difficult to visualize?
+ How could the dataset be tweaked to make it **tidy**?

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

## The `nycflights13` datasets

Expand Down Expand Up @@ -92,7 +92,7 @@ This dataset and most others presented in this book will be in the `data.frame`
View(flights)
```

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

```{block lc3-2, type='learncheck'}
**_Learning check_**
Expand All @@ -106,7 +106,7 @@ View(flights)
- C. Data on an airport
- D. Data on multiple flights

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

By running `View(flights)`, we see the different **variables** listed in the columns and we see that there are different types of variables. Some of the variables like `distance`, `day`, and `arr_delay` are what we will call **quantitative** variables. These variables vary in a numerical way. Other variables here are **categorical**.

Expand All @@ -122,7 +122,7 @@ Note that if you look in the leftmost column of the `View(flights)` output, you
str(flights)
```

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

```{block lc3-3, type='learncheck'}
**_Learning check_**
Expand All @@ -136,7 +136,7 @@ str(flights)

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** How many different rows are in this dataset?

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

Another way to view the properties of a dataset is to use the `str` function ("str" is short for "structure"). This will give you the first few entries of each variable in a row after the variable. In addition, the type of the variable is given immediately after the `:` following each variable's name. Here, `int` and `num` refer to quantitative variables. In contrast, `chr` refers to categorical variables. One more type of variable is given here with the `time_hour` variable: **POSIXct**. As you may suspect, this variable corresponds to a specific date and time of day.

Expand Down Expand Up @@ -200,8 +200,8 @@ If we `View` this dataset, we see a new variable has been created called (We wil

More discussion about joining data frames together will be given in Chapter \@ref(manip). We will see there that the names of the columns to be linked need not match as they did here with `"carrier"`.

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***
***

```{block tidy_review, type='review'}
**_Review questions_**
Expand All @@ -211,10 +211,6 @@ More discussion about joining data frames together will be given in Chapter \@re

**`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`** What makes "tidy" datasets useful for organizing data?

<!--
**`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`** What would the code `kable(head(flights))` produce?
-->

**`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`** How many variables are presented in the table below? What does each row correspond to? (**Hint:** You may not be able to answer both of these questions immediately but take your best guess.)


Expand Down Expand Up @@ -245,8 +241,8 @@ kable(data_frame("role" = role, `Sociology?` = sociology,

**`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`** What are some advantages of data in normal forms? What are some disadvantages?

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***
***

## What's to come?

Expand Down
62 changes: 31 additions & 31 deletions 04-visualizing_data.Rmd
Expand Up @@ -183,7 +183,7 @@ ggplot(data = weather, mapping = aes(x = temp)) +

As we might expect, the temperature tends to increase as summer approaches and then decrease as winter approaches.

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

```{block lc4-2, type='learncheck'}
**_Learning check_**
Expand All @@ -202,7 +202,7 @@ Draw or give an example.

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Does the `temp` variable in the `weather` data set have a lot of variability? Why do you say that?

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

Histograms can provide a way to compare distributions across groups as we see above when we looked at temperature over months. Frequently,
a plot called a **boxplot** (also called a **side-by-side boxplot**) is done instead. The **boxplot** uses the information provided in the **five-number summary** referred to in the previous section when we used the `summary` function. It gives a way to compare this summary information across the different levels of a group. Let's create a boxplot to compare the monthly temperatures as we did above with the faceted histograms.
Expand All @@ -223,7 +223,7 @@ ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) +

We have introduced a new function called `factor()` here. One of the things this function does is to convert a numeric value like `month` (1, 2, ..., 12) into a categorical variable. The "box" part of this plot represents the 25^th^ percentile, the median (50^th^ percentile), and the 75^th^ percentile. The dots correspond to **outliers**. (The specific formulation for these outliers is discussed in Appendix \@ref(appendix2).) The lines show how the data varies that is not in the center 50% defined by the first and third quantiles. Longer lines correspond to more variability and shorter lines correspond to less variability.

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

```{block lc4-2b, type='learncheck'}
**_Learning check_**
Expand All @@ -237,7 +237,7 @@ We have introduced a new function called `factor()` here. One of the things thi

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Boxplots provide a simple way to identify outliers. Why may outliers be easier to identify when looking at a boxplot instead of a faceted histogram?

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

### Summary

Expand Down Expand Up @@ -269,7 +269,7 @@ flights_table <- count(x = flights, vars = carrier)
flights_table
```

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

```{block lc4-3, type='learncheck'}
**_Learning check_**
Expand All @@ -283,7 +283,7 @@ flights_table

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What was the seventh highest airline in terms of departed flights from NYC in 2013?

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

### Must avoid pie charts!

Expand Down Expand Up @@ -317,7 +317,7 @@ While it is quite easy to look back at the barplot to get the answer to these qu
knitr::include_graphics("images/Pie-I-have-Eaten.jpg")
```

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

```{block lc4-3b, type='learncheck'}
**_Learning check_**
Expand All @@ -327,7 +327,7 @@ knitr::include_graphics("images/Pie-I-have-Eaten.jpg")

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What is your opinion as to why pie charts continue to be used?

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

### Using barplots to compare two variables

Expand All @@ -349,7 +349,7 @@ ggplot(data = flights_namedports, mapping = aes(x = carrier, fill = name)) +

This plot is what is known as a **stacked barplot**. While simple to make, it often leads to many problems.

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

```{block lc4-3c, type='learncheck'}
**_Learning check_**
Expand All @@ -359,7 +359,7 @@ This plot is what is known as a **stacked barplot**. While simple to make, it o

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What can you say, if anything, about the relationship between airline and airport in NYC in 2013 in regards to the number of departing flights?

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

Another variation on the **stacked barplot** is the **side-by-side barplot**.

Expand All @@ -368,7 +368,7 @@ ggplot(data = flights_namedports, mapping = aes(x = carrier, fill = name)) +
geom_bar(position = "dodge")
```

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

```{block lc4-3d, type='learncheck'}
**_Learning check_**
Expand All @@ -378,19 +378,19 @@ ggplot(data = flights_namedports, mapping = aes(x = carrier, fill = name)) +

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What are the disadvantages of using a side-by-side barplot, in general?

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

Lastly, an often preferred type of barplot is the **faceted barplot**. We already saw this concept of faceting and small multiples in Subsection \@ref(faceting). This gives us a nicer way to compare the distributions across both `carrier` and airport/`name`.

```{r, fig.cap="Faceted barplot comparing the number of flights by carrier and airport", fig.height=5.2}
```{r, fig.cap="Faceted barplot comparing the number of flights by carrier and airport", fig.height=7.5}
ggplot(data = flights_namedports, mapping = aes(x = carrier, fill = name)) +
geom_bar() +
facet_grid(name ~ .)
```

Note how the `facet_grid` function arguments are written here. We are wanting the names of the airports vertically and the `carrier` listed horizontally. As you may have guessed, this argument and other _formulas_ of this sort in R are in `y ~ x` order. We will see more examples of this in Chapter \@ref(regress).

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

```{block lc4-3e, type='learncheck'}
**_Learning check_**
Expand All @@ -400,7 +400,7 @@ Note how the `facet_grid` function arguments are written here. We are wanting t

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What information about the different carriers at different airports is more easily seen in the faceted barplot?

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

### Summary

Expand All @@ -417,15 +417,15 @@ alaska_cap <- "Arrival Delays vs Departure Delays for Alaska Airlines flights fr
```


```{r noalpha, warning=FALSE, fig.cap=alaska_cap}
```{r noalpha, warning=FALSE, fig.cap=alaska_cap, fig.height=4}
alaska_flights <- filter(flights, carrier == "AS")
ggplot(alaska_flights, aes(x = dep_delay, y = arr_delay)) +
geom_point()
```

We see that a positive relationship exists between `dep_delay` and `arr_delay`: as departure delays increase, arrival delays tend to also increase. We also note that the majority of points fall near the point (0, 0) here. There is a large mass of points clustered there.

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

```{block lc4-4, type='learncheck'}
**_Learning check_**
Expand All @@ -441,38 +441,38 @@ We see that a positive relationship exists between `dep_delay` and `arr_delay`:

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What are some other features of the plot that stand out to you?

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

### Jittering

The large mass of points near (0, 0) can cause some confusion. This is the result of a phenomenon called **over-plotting**. As one may guess, this corresponds to values being plotted on top of each other _over_ and _over_ again. It is often difficult to know just how many values are plotted in this way when looking at a basic scatter-plot as we have here.

One way of relieving this issue of **over-plotting** is to **jitter** the points a bit. In other words, we are going to add just a bit of random noise to the points to better see them and remove some of the over-plotting. You can think of "jittering" as shaking the points a bit on the plot. Instead of using `geom_point`, we use `geom_jitter` to perform this shaking and specify around how much jitter to add with the `width` and `height` arguments. This corresponds to how hard you'd like to shake the plot in units corresponding to those for both the horizontal and vertical variables (minutes here).

```{r warning=FALSE, fig.cap="Jittered delay scatterplot"}
```{r warning=FALSE, fig.cap="Jittered delay scatterplot", fig.height=4}
ggplot(alaska_flights, aes(x = dep_delay, y = arr_delay)) +
geom_jitter(width = 30, height = 30)
```

This has helps us a little bit in getting a sense for the over-plotting, but with a relatively large dataset like this one (`r nrow(alaska_flights)` flights), it is often useful to change the transparency of the points as seen in the next section.
This helps us a little bit in getting a sense for the over-plotting, but with a relatively large dataset like this one (`r nrow(alaska_flights)` flights), it is often useful to change the transparency of the points as seen in the next section.

### Setting transparency

One of the arguments that can be changed with `geom_point` is `alpha`. By default, this value is set to `1`. We can change this value to a smaller fraction to change the transparency of the points in the plot:

```{r alpha, warning=FALSE, fig.cap=paste(alaska_cap, "- alpha=0.2")}
```{r alpha, warning=FALSE, fig.cap=paste(alaska_cap, "- alpha=0.2", fig.height=1)}
ggplot(alaska_flights, aes(x = dep_delay, y = arr_delay)) +
geom_point(alpha = 0.2)
```

We can also specify the `alpha` argument in `geom_jitter`:

```{r jitteralpha, warning=FALSE, fig.cap=paste(alaska_cap, "- jitter and alpha added")}
```{r jitteralpha, warning=FALSE, fig.cap=paste(alaska_cap, "- jitter and alpha added", fig.height=1)}
ggplot(alaska_flights, aes(x = dep_delay, y = arr_delay)) +
geom_jitter(width = 30, height = 30, alpha = 0.3)
```

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

```{block lc4-4b, type='learncheck'}
**_Learning check_**
Expand All @@ -486,7 +486,7 @@ ggplot(alaska_flights, aes(x = dep_delay, y = arr_delay)) +

+ How has that region changed compared to when you observed the same plot without the `alpha = 0.2` set in \@ref(fig:noalpha)?

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

<!--
**Maybe include a shading of the points by another variable example here for multivariate thinking?**
Expand Down Expand Up @@ -536,7 +536,7 @@ ggplot(data = flights_summarized, aes(x = date, y = median_arr_delay)) +
geom_line()
```

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

```{block lc4-5, type='learncheck'}
**_Learning check_**
Expand All @@ -550,7 +550,7 @@ ggplot(data = flights_summarized, aes(x = date, y = median_arr_delay)) +

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Are the largest median arrival delays where you expected them to occur on the line-graph above in Figure \@ref(fig:lineflights)? Why or why not?

`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***

### Summary

Expand Down Expand Up @@ -589,14 +589,14 @@ We'll see (and have seen) that you don't necessarily need to include all of thes
An excellent resource as you begin to create plots using the `ggplot2` package is a cheatsheet that RStudio has put together entitled "Data Visualization with ggplot2" available [here](https://www.rstudio.com/wp-content/uploads/2015/12/ggplot2-cheatsheet-2.0.pdf). This covers more than what we've discussed in this chapter but provides nice visual descriptions of what each function produces.

<!--
`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***
***
```{block viz_review, type='review'}
**_Review questions_**
```
**`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`**
**`paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`**
- Have a variety of bad plots with data for the readers and have readers create better plots with `ggplot2`
Expand All @@ -605,8 +605,8 @@ An excellent resource as you begin to create plots using the `ggplot2` package i
- Why is it important for barplots to start at zero?
`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
***
***
-->

## What's to come?
Expand Down

0 comments on commit 4163383

Please sign in to comment.