# Introduction to the Tidyvrse in R

Note that this notebook relies heavily on material from the textbook [R for Data Science](https://r4ds.had.co.nz/index.html) by Hadley Wickham and Garrett Grolemund, and to a lesser extent, on the [DataCamp course on the Tidyverse](https://www.datacamp.com/courses/introduction-to-the-tidyverse). Both are great resources to explore!

## A. What is the Tidyverse?

The Tidyverse is a collection of R packages meant to streamline data science tasks. All Tidyverse packages share an underlying design philosophy, grammar, and data structures. In this notebook, we'll learn some basics of the Tidyverse. To install the tidyverse, follow these [instructions for Installing and Loading R Packages](https://docs.google.com/document/d/1iEZ9rbyjj7-Qajd4-s4t7MSu-yJwcrTwCcHHtob3oeo/edit?usp=sharing).


In [None]:
#install.packages("tidyverse", dependencies = TRUE)
library(tidyverse)
library(testthat) #used for autograder tests
library(proto) #used for autograder tests

## B. Some (Very) Basic Plotting with ggplot

Let's do some plotting with the mpg dataset. mpg contains observations collected by the US Environmental Protection Agency on 38 models of car.

First, load and learn about the variables contained in this dataset. The dataset is in the ggplot2 package, which is included in the tidyverse. So, you can load the data using data(mpg).

In [None]:
data(mpg)
#help(mpg)
head(mpg)
?mpg

Let's look at a plot that might tell us about the relationship between drv (whether the car is front, rear, or 4-wheel drive) and hwy (highway miles per gallon).

We begin a plot with the function `ggplot()`, which creates a coordinate system that you can add layers to. Layers are created with `+` `geom_boxplot()` will make a boxplot. In general, a template for creating plots would be 

`ggplot(data = DATA) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))`

**Here's the basic code for the boxplot (fill in the correct variables). For full credit, be sure to name your plot `p`, as I do in the commented example.**

In [None]:
options(repr.plot.width=8, repr.plot.height=6) #this line just makes the plots bigger
# p = ggplot(data = mpg) + 
#     geom_boxplot(mapping = aes(x = , y = )) 
# p

# your code here


In [None]:
# Test Cell

**What do we notice about the relationship?**

YOUR ANSWER HERE

You can mess with *all* sorts of things. For example, you could change colors:

In [None]:
options(repr.plot.width=8, repr.plot.height=6)
ggplot(data = mpg) + 
  geom_boxplot(mapping = aes(x = drv, y = hwy, color = drv)) + 
    scale_color_manual(values=c("#999999", "#CFB87C", "black"))+ 
    theme_bw()

In some instances, it is helpful to swap the axes. For example, the levels of the factor along the horizontal axis might have long names. **Try flipping the axes using `coord_flip()`**. Name your plot `p_flip` for credit.

In [None]:
options(repr.plot.width=8, repr.plot.height=6)
# your code here


In [None]:
# Test Cell

Now let's try a scatterplot. **Use the template above to plot hwy (y) against displ (x).** `geom_point()` will give a scatterplot. Name your plot `p_scatter` for credit.

In [None]:
options(repr.plot.width=8, repr.plot.height=6)
# your code here


In [None]:
# Test Cell

**What do we notice about the relationship between engine displacement and highway miles per gallon?**

YOUR ANSWER HERE

**Now, let's color points based on whether they represent a vehicle that is front, rear, or 4-wheel drive (`drv`). We will map the `drv` variable to the *aesthetic* `color`. In general, an aesthetic is a "visual property of the objects in the plot" ([Wickham, Section 3.3](https://r4ds.had.co.nz/index.html)). Other aesthetics include size and shape.**

**Further, use the `scale_color_manual()` function to specify the [CU Boulder Colors](https://www.colorado.edu/brand/how-use/color).**

In [None]:
# your code here


**What do we notice about the relationships between these variables?**

YOUR ANSWER HERE

Another way to add information from categorical variables to plots is by using *facets*. Facets split a plot into subplots, where each subplot contains data for a particular level of the categorical variable. We can facet by adding `facet_wrap(~ CatVar, nrow = x)` to our ggplot, where `CatVar` is the categorical variable that we want to facet on, and `x` is the number of rows that we'd like (we could also use ncol...).  

**Create a facet plot where you split the data based on the class variable**.

In [None]:
options(repr.plot.width=8, repr.plot.height=6)
# your code here


In [None]:
# Test Cell

Instead of seeing the individual data points, we might be interested in visualizing some overall trend between displ and hwy. We could do this by substituting `geom_points()` with `geom_smooth()`. **Try it!**

In [None]:
# your code here


And, we can layer the smooth over the scatterplot pretty easily by adding `+ geom_point()`. **Try it!**

In [None]:
# your code here


## C. Data Manipulation and Transformation

`dplyr` is a package in the Tidyverse that provides simple “verbs”, or functions, that correspond to the most common data manipulation tasks; these verbs help you translate your thoughts into code. Let's see how some of these verbs work on the gapminder dataset. **First, if you haven't already, let's install and load the gapminder package.**

In [None]:
#install.packages("gapminder")
library(gapminder)
library(dplyr)

**Write a summary of the variables in this dataset.**

In [None]:
data(gapminder)
head(gapminder)

### Filter rows with filter()

It is often useful to study a subset of your data. The verb `filter()` will easily allow you to filter rows (observations) in a data frame. Here's one possibility:

In [None]:
filter(gapminder, country == "United States")
#or
#gapminder %>%
#    filter(country == "United States")

**Has the code above modified the exiting data frame or created a new one?**

YOUR ANSWER HERE

**Filter the original dataset to show only observations where the year is later than 1987 and the life expectancy is greater than or equal to 70. Save your answer in `gapminder_filter`**

In [None]:
# your code here


In [None]:
# Test Cell

**Important notes:**

1. The arguments in `filter()` are combined with "and". To combine in other ways (e.g., "or"), use the Boolean operators (e.g., `|` is for "or").
2. Missing values: `filter()` only includes rows for which the variable is *not* `NA`. If you would like to preserve missing values, ask for them explicitly:

In [None]:
filter(gapminder, is.na(country) | country == "United States")

### Arranging with arrange()


**Use the `arrange()` verb, in conjunction with the code above to put the United States data (and only that data) in descending order with respect to year.**

In [None]:
# your code here


### Selecting columns with select()

In addition to being able to filter out a subset of rows, you can also filter out a subset of columns with the `select()` verb. **Try to select just the country and year variables.**

In [None]:
# your code here


### Changing columns with mutate()

We can also mutate certain columns. For example, suppose that we wanted life expectancy to be measured in months. We might write:

In [None]:
head(gapminder %>%
    mutate(lifeExp = lifeExp*12))

**Create a new column in the data frame that is just GDP (not GDP per capita). Store your new data frame in `gapminder_GDP`**

In [None]:
# your code here


In [None]:
# Test Cell

**Research another verb in dplyr and use it on this dataset.**

In [None]:
# your code here


# D. Exploratory Data Analysis

Let's explore a [dataset](https://dasl.datadescription.com/datafile/amazon-books/?_sfm_methods=Multiple+Regression&_sfm_cases=4+59943) about book prices from Amazon. The data consists of data on $n = 325$ books and includes measurements of:

- `aprice`: The price listed on Amazon (dollars)


- `lprice`: The book's list price (dollars)


- `weight`: The book's weight (ounces)


- `pages`: The number of pages in the book


- `height`: The book's height (inches)


- `width`: The book's width (inches)


- `thick`: The thickness of the book (inches)


- `cover`: Whether the book is a hard cover of paperback.


- And other variables...

First, we'll read this data in from Github...

In [None]:
library(RCurl) #a package that includes the function getURL(), which allows for reading data from github.
library(ggplot2) #a package for nice plots!

#getURL is a nice way of reading in data from the web
url = getURL(paste0("https://raw.githubusercontent.com/bzaharatos/",
                    "-Statistical-Modeling-for-Data-Science-Applications/",
                    "master/Modern%20Regression%20Analysis%20/Datasets/amazon.txt"))
#stores the data in the dataframe amazon
amazon = read.csv(text = url, sep = "\t")

#prints the names in the dataframe
names(amazon)

Next, let's create a new data frame, called `df`, and store a subset of the variables. In addition, we'll change the names of the variables in the dataframe to something cleaner and easier to work with. Take note of how to do this :)

In [None]:
df = data.frame(aprice = amazon$Amazon.Price, lprice = as.numeric(amazon$List.Price),  
                pages = amazon$NumPages, width = amazon$Width, weight = amazon$Weight..oz,  
                height = amazon$Height, thick = amazon$Thick, cover = amazon$Hard..Paper)
df2 = df #used for autograding
summary(df)


From the summary, we can see that there are missing values in the dataset, coded as `NA`. There are many ways to deal with missing data. Suppose that sample unit $i$ has a missing measurement for variable $z_j$. We could:

1. Delete sample unit $i$ from the dataset, i.e., delete the entire row. That might be reasonable if there are very view missing values and if we think the values are missing at random.

2. Delete the variable $z_j$ from the dataset, i.e., delete the entire column. This might be reasonable if there are many many other missing values for $z_j$ and if we think $z_j$ might not be neccesary for our overall prediction/explanation goals.

3.  Impute missing values by substituting each missing value with an estimate.

For more information on missing values, see this [resource](https://www.bu.edu/sph/files/2014/05/Marina-tech-report.pdf).

Since most of our columns/variables are not missing values, and since these variables will be useful to us in our analysis, option 2 seems unreasonable. Let's first try option 3: impute the missing values of `lprice`, `pages`, `width`, `weight`, `height`, and `thick` with the mean of each. The following code might help you get started!

In [None]:
which(is.na(df$lprice))
df = df %>%
  mutate(lprice = replace(lprice, is.na(lprice),mean(lprice, na.rm = TRUE)))

In [None]:
# your code here


**Use the `summary()` function to print numerical summaries of this dataset.**

In [None]:
# your code here


**Use the `arrange()` verb to rearrange the `df` dataframe in descending order with respect to `lprice` (that is, with the row corresponding to the highest `lprice` at the top, the row corresponding to the next highest `lprice` second, etc.). Do *not* rewrite the dataframe in `df`.**

In [None]:
# your code here


**Note that you could provide more descriptive labels for the levels of this factor (note that `H` = "Hardcover" and `P` = "Paperback"). The easiest way do do this is with the `levels()` function: `levels(x) = value`.**

In [None]:
levels(df$cover) = c("Hardcover","Paperback")
summary(df)

**Use `ggplot` to create a histogram of the `pages` variable. Change the number of bins to 15. For credit, store the histogram in the variable `p_hist`. Comment on it's shape.**

In [None]:
# your code here


In [None]:
# Test Cell

YOUR ANSWER HERE

**Use `ggplot` to create a scatterplot of `aprice` ($y$) against `lprice` ($x$). What do you notice about this plot?**

In [None]:
# your code here


YOUR ANSWER HERE

**Use `ggplot` to produce a boxplot of `pages` conditioned on `cover`. Interpret this plot.**

In [None]:
# your code here


YOUR ANSWER HERE

Note another way to read data from the web...

In [None]:
### Reading data from the web...
read.table("http://www.stats.ox.ac.uk/pub/datasets/csb/ch11b.dat")