Removed commented RQ

ismayc · Sep 14, 2016 · 4163383 · 4163383
1 parent 46654d8
commit 4163383
Show file tree

Hide file tree

Showing 68 changed files with 359 additions and 348 deletions.
diff --git a/.gitignore b/.gitignore
@@ -3,3 +3,4 @@
 .RData
 .Ruserdata
 *placeholder.html
+.httr-oauth
diff --git a/03-tidy_data.Rmd b/03-tidy_data.Rmd
@@ -53,7 +53,7 @@ knitr::include_graphics("images/tidy-1.png")
 
 Reading over this definition, you can begin to think about datasets that won't follow this nice format.
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 ```{block lc3-1, type='learncheck'}
 **_Learning check_**
@@ -64,7 +64,7 @@ Reading over this definition, you can begin to think about datasets that won't f
 + What features of this dataset might make it difficult to visualize?  
 + How could the dataset be tweaked to make it **tidy**?
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 ## The `nycflights13` datasets
 
@@ -92,7 +92,7 @@ This dataset and most others presented in this book will be in the `data.frame`
 View(flights)
 ```
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 ```{block lc3-2, type='learncheck'}
 **_Learning check_**
@@ -106,7 +106,7 @@ View(flights)
 - C. Data on an airport
 - D. Data on multiple flights
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 By running `View(flights)`, we see the different **variables** listed in the columns and we see that there are different types of variables.  Some of the variables like `distance`, `day`, and `arr_delay` are what we will call **quantitative** variables.  These variables vary in a numerical way.  Other variables here are **categorical**.
 
@@ -122,7 +122,7 @@ Note that if you look in the leftmost column of the `View(flights)` output, you
 str(flights)
 ```
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 ```{block lc3-3, type='learncheck'}
 **_Learning check_**
@@ -136,7 +136,7 @@ str(flights)
 
 **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** How many different rows are in this dataset?
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 Another way to view the properties of a dataset is to use the `str` function ("str" is short for "structure").  This will give you the first few entries of each variable in a row after the variable.  In addition, the type of the variable is given immediately after the `:` following each variable's name.  Here, `int` and `num` refer to quantitative variables.  In contrast, `chr` refers to categorical variables.  One more type of variable is given here with the `time_hour` variable: **POSIXct**.  As you may suspect, this variable corresponds to a specific date and time of day.
 
@@ -200,8 +200,8 @@ If we `View` this dataset, we see a new variable has been created called (We wil
 
 More discussion about joining data frames together will be given in Chapter \@ref(manip).  We will see there that the names of the columns to be linked need not match as they did here with `"carrier"`.
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
+***
 
 ```{block tidy_review, type='review'}
 **_Review questions_**
@@ -211,10 +211,6 @@ More discussion about joining data frames together will be given in Chapter \@re
 
 **`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`** What makes "tidy" datasets useful for organizing data?
 
-<!--
-**`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`** What would the code `kable(head(flights))` produce?
--->
-
 **`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`** How many variables are presented in the table below?  What does each row correspond to? (**Hint:** You may not be able to answer both of these questions immediately but take your best guess.)
 
 
@@ -245,8 +241,8 @@ kable(data_frame("role" = role, `Sociology?` = sociology,
 
 **`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`** What are some advantages of data in normal forms?  What are some disadvantages?
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
+***
 
 ## What's to come?
 

diff --git a/04-visualizing_data.Rmd b/04-visualizing_data.Rmd
@@ -183,7 +183,7 @@ ggplot(data = weather, mapping = aes(x = temp)) +
 
 As we might expect, the temperature tends to increase as summer approaches and then decrease as winter approaches.
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 ```{block lc4-2, type='learncheck'}
 **_Learning check_**
@@ -202,7 +202,7 @@ Draw or give an example.
 
 **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Does the `temp` variable in the `weather` data set have a lot of variability?  Why do you say that?
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 Histograms can provide a way to compare distributions across groups as we see above when we looked at temperature over months.  Frequently,
 a plot called a **boxplot** (also called a **side-by-side boxplot**) is done instead.  The **boxplot** uses the information provided in the **five-number summary** referred to in the previous section when we used the `summary` function.  It gives a way to compare this summary information across the different levels of a group.  Let's create a boxplot to compare the monthly temperatures as we did above with the faceted histograms.
@@ -223,7 +223,7 @@ ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) +
 
 We have introduced a new function called `factor()` here.  One of the things this function does is to convert a numeric value like `month` (1, 2, ..., 12) into a categorical variable.  The "box" part of this plot represents the 25^th^ percentile, the median (50^th^ percentile), and the 75^th^ percentile.  The dots correspond to **outliers**.  (The specific formulation for these outliers is discussed in Appendix \@ref(appendix2).)  The lines show how the data varies that is not in the center 50% defined by the first and third quantiles.  Longer lines correspond to more variability and shorter lines correspond to less variability.
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 ```{block lc4-2b, type='learncheck'}
 **_Learning check_**
@@ -237,7 +237,7 @@ We have introduced a new function called `factor()` here.  One of the things thi
 
 **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Boxplots provide a simple way to identify outliers.  Why may outliers be easier to identify when looking at a boxplot instead of a faceted histogram?
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 ### Summary
 
@@ -269,7 +269,7 @@ flights_table <- count(x = flights, vars = carrier)
 flights_table
 ```
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 ```{block lc4-3, type='learncheck'}
 **_Learning check_**
@@ -283,7 +283,7 @@ flights_table
 
 **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What was the seventh highest airline in terms of departed flights from NYC in 2013?
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 ### Must avoid pie charts!
 
@@ -317,7 +317,7 @@ While it is quite easy to look back at the barplot to get the answer to these qu
 knitr::include_graphics("images/Pie-I-have-Eaten.jpg")
 ```
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 ```{block lc4-3b, type='learncheck'}
 **_Learning check_**
@@ -327,7 +327,7 @@ knitr::include_graphics("images/Pie-I-have-Eaten.jpg")
 
 **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What is your opinion as to why pie charts continue to be used?
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 ### Using barplots to compare two variables
 
@@ -349,7 +349,7 @@ ggplot(data = flights_namedports, mapping = aes(x = carrier, fill = name)) +
 
 This plot is what is known as a **stacked barplot**.  While simple to make, it often leads to many problems.
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 ```{block lc4-3c, type='learncheck'}
 **_Learning check_**
@@ -359,7 +359,7 @@ This plot is what is known as a **stacked barplot**.  While simple to make, it o
 
 **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What can you say, if anything, about the relationship between airline and airport in NYC in 2013 in regards to the number of departing flights?
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 Another variation on the **stacked barplot** is the **side-by-side barplot**.
 
@@ -368,7 +368,7 @@ ggplot(data = flights_namedports, mapping = aes(x = carrier, fill = name)) +
   geom_bar(position = "dodge")
 ```
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 ```{block lc4-3d, type='learncheck'}
 **_Learning check_**
@@ -378,19 +378,19 @@ ggplot(data = flights_namedports, mapping = aes(x = carrier, fill = name)) +
 
 **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What are the disadvantages of using a side-by-side barplot, in general?
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 Lastly, an often preferred type of barplot is the **faceted barplot**.  We already saw this concept of faceting and small multiples in Subsection \@ref(faceting).  This gives us a nicer way to compare the distributions across both `carrier` and airport/`name`.
 
-```{r, fig.cap="Faceted barplot comparing the number of flights by carrier and airport", fig.height=5.2}
+```{r, fig.cap="Faceted barplot comparing the number of flights by carrier and airport", fig.height=7.5}
 ggplot(data = flights_namedports, mapping = aes(x = carrier, fill = name)) +
   geom_bar() +
   facet_grid(name ~ .)
 ```
 
 Note how the `facet_grid` function arguments are written here.  We are wanting the names of the airports vertically and the `carrier` listed horizontally.  As you may have guessed, this argument and other _formulas_ of this sort in R are in `y ~ x` order.  We will see more examples of this in Chapter \@ref(regress).
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 ```{block lc4-3e, type='learncheck'}
 **_Learning check_**
@@ -400,7 +400,7 @@ Note how the `facet_grid` function arguments are written here.  We are wanting t
 
 **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What information about the different carriers at different airports is more easily seen in the faceted barplot?
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 ### Summary
 
@@ -417,15 +417,15 @@ alaska_cap <- "Arrival Delays vs Departure Delays for Alaska Airlines flights fr
 ```
 
 
-```{r noalpha, warning=FALSE, fig.cap=alaska_cap}
+```{r noalpha, warning=FALSE, fig.cap=alaska_cap, fig.height=4}
 alaska_flights <- filter(flights, carrier == "AS")
 ggplot(alaska_flights, aes(x = dep_delay, y = arr_delay)) + 
   geom_point()
 ```
 
 We see that a positive relationship exists between `dep_delay` and `arr_delay`:  as departure delays increase, arrival delays tend to also increase.  We also note that the majority of points fall near the point (0, 0) here.  There is a large mass of points clustered there.
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 ```{block lc4-4, type='learncheck'}
 **_Learning check_**
@@ -441,38 +441,38 @@ We see that a positive relationship exists between `dep_delay` and `arr_delay`:
 
 **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What are some other features of the plot that stand out to you?
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 ### Jittering
 
 The large mass of points near (0, 0) can cause some confusion.  This is the result of a phenomenon called **over-plotting**.  As one may guess, this corresponds to values being plotted on top of each other _over_ and _over_ again.  It is often difficult to know just how many values are plotted in this way when looking at a basic scatter-plot as we have here.
 
 One way of relieving this issue of **over-plotting** is to **jitter** the points a bit.  In other words, we are going to add just a bit of random noise to the points to better see them and remove some of the over-plotting.  You can think of "jittering" as shaking the points a bit on the plot. Instead of using `geom_point`, we use `geom_jitter` to perform this shaking and specify around how much jitter to add with the `width` and `height` arguments.  This corresponds to how hard you'd like to shake the plot in units corresponding to those for both the horizontal and vertical variables (minutes here).
 
-```{r warning=FALSE, fig.cap="Jittered delay scatterplot"}
+```{r warning=FALSE, fig.cap="Jittered delay scatterplot", fig.height=4}
 ggplot(alaska_flights, aes(x = dep_delay, y = arr_delay)) + 
   geom_jitter(width = 30, height = 30)
 ```
 
-This has helps us a little bit in getting a sense for the over-plotting, but with a relatively large dataset like this one (`r nrow(alaska_flights)` flights), it is often useful to change the transparency of the points as seen in the next section.
+This helps us a little bit in getting a sense for the over-plotting, but with a relatively large dataset like this one (`r nrow(alaska_flights)` flights), it is often useful to change the transparency of the points as seen in the next section.
 
 ### Setting transparency
 
 One of the arguments that can be changed with `geom_point` is `alpha`.  By default, this value is set to `1`.  We can change this value to a smaller fraction to change the transparency of the points in the plot:
 
-```{r alpha, warning=FALSE, fig.cap=paste(alaska_cap, "- alpha=0.2")}
+```{r alpha, warning=FALSE, fig.cap=paste(alaska_cap, "- alpha=0.2", fig.height=1)}
 ggplot(alaska_flights, aes(x = dep_delay, y = arr_delay)) + 
   geom_point(alpha = 0.2)
 ```
 
 We can also specify the `alpha` argument in `geom_jitter`:
 
-```{r jitteralpha, warning=FALSE, fig.cap=paste(alaska_cap, "- jitter and alpha added")}
+```{r jitteralpha, warning=FALSE, fig.cap=paste(alaska_cap, "- jitter and alpha added", fig.height=1)}
 ggplot(alaska_flights, aes(x = dep_delay, y = arr_delay)) + 
   geom_jitter(width = 30, height = 30, alpha = 0.3)
 ```
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 ```{block lc4-4b, type='learncheck'}
 **_Learning check_**
@@ -486,7 +486,7 @@ ggplot(alaska_flights, aes(x = dep_delay, y = arr_delay)) +
 
 + How has that region changed compared to when you observed the same plot without the `alpha = 0.2` set in \@ref(fig:noalpha)?
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 <!--
 **Maybe include a shading of the points by another variable example here for multivariate thinking?**
@@ -536,7 +536,7 @@ ggplot(data = flights_summarized, aes(x = date, y = median_arr_delay)) +
   geom_line()
 ```
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 ```{block lc4-5, type='learncheck'}
 **_Learning check_**
@@ -550,7 +550,7 @@ ggplot(data = flights_summarized, aes(x = date, y = median_arr_delay)) +
 
 **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Are the largest median arrival delays where you expected them to occur on the line-graph above in Figure \@ref(fig:lineflights)? Why or why not?
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
 
 ### Summary
 
@@ -589,14 +589,14 @@ We'll see (and have seen) that you don't necessarily need to include all of thes
 An excellent resource as you begin to create plots using the `ggplot2` package is a cheatsheet that RStudio has put together entitled "Data Visualization with ggplot2" available [here](https://www.rstudio.com/wp-content/uploads/2015/12/ggplot2-cheatsheet-2.0.pdf).  This covers more than what we've discussed in this chapter but provides nice visual descriptions of what each function produces.
 
 <!--
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
+***
 
 ```{block viz_review, type='review'}
 **_Review questions_**
 ```
 
-**`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`**
+**`paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`**
 
 - Have a variety of bad plots with data for the readers and have readers create better plots with `ggplot2`
 
@@ -605,8 +605,8 @@ An excellent resource as you begin to create plots using the `ggplot2` package i
 
 - Why is it important for barplots to start at zero?
 
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
-`r if(knitr:::is_html_output()) '<hr>'` `r if(knitr:::is_latex_output()) '\\begin{center}\\rule{\\linewidth}{\\linethickness}\\end{center}'`
+***
+***
 -->
 
 ## What's to come?