# Lab 1
This lab will focus on the functionality of ggplot and review the basics of R covered in last week's lecture.

## Review

### 1) Environment setup
Let's first make sure that you have the programming environment setup correctly. If you installed Jupyter notebook through Anaconda, you should be able to type on the command line:
```
conda --version
```
If you installed Jupyter notebook directly, then typing this on the command line should open up Jupyter notebook:
```
jupyter notebook
```
If you installed jupyter notebook and the R kernel correctly you should be able to create a new R notebook by going to File > New Notebook > R. Then in the first cell you should be able to run:
```
library(ggplot2)
```
I'll stop here to make sure everyone has their environment setup and working.

### 2) Installing new packages with Anaconda
If you want to install a new package with Anaconda, the easiest way to do this is by searching for the package on Anaconda Cloud site (or googling 'anaconda packagename'). Then run the installation code given on your command line. For example, if you wanted to install tidyverse for R (https://anaconda.org/r/r-tidyverse), after finding the linked page here, you would run the code given on your command line to install:
```
conda install -c r r-tidyverse
```

Now if I open up jupyter notebook, I would find that I would be able to load in this package:
```
library(tidyverse)
```

### 3) Typical Data Science Project
![Data Science Lifecycle](DS_Lifecycle.png)
1. **Import** - Bring the data into R
2. **Tidy** - Clean the data up generally to have this format:
    + Each column is a variable
    + Each row is an observation
3. **Transform** - Filtering data set down, create new variables, calculate summary statistics
4. **Visualize** - Create plots to find interesting things about the data
5. **Model** - Based on the visualizations you have created, the data will hint at the models to use (e.g. based on the scatter plot regression looks appropiate, based on the summary statistics classification could be relevant, etc.)
6. **Communicate** - Using the findings from the previous steps, you will need to communicate what you have found (e.g. looking at the sales data from our mobile app, it looks like millenials are most likely to buy product A in the next quarter)
    

## Explore

This week we will focus on the exercises as there was only one introduction lab last week.

## Exercises

Start by loading in tidyverse package (really a collection of packages). All of these exercises are taken from the texbook Chapter 3 (R for Data Science).

In [None]:
library(tidyverse)

### Section 3.2

In [None]:
# What happens when you run ggplot(data=mpg)? Why?

#Solution:
#Nothing! You never defined a geom or aesthetic, so ggplot doesn't know what variables to plot or how to plot them

In [None]:
# How many rows and columns are in the mpg data set?

#Solution:
nrow(mpg)
ncol(mpg)

In [None]:
# What does the drv variable describe?

#Solution:
?mpg #Read and learn about drv

In [None]:
# Make a scatterplot of hwy vs. cyl

#Solution:
#Highlighting here that you can store these plots in variables and continue to add things onto them
plt = ggplot(data = mpg)
plt = plt + geom_point(mapping = aes(x=hwy, y=cyl))
#Display variable (really a plot)
plt

### Section 3.3

In [None]:
# Why aren't the points here blue?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

#Solution:
# Because the color is defined WITHIN the aesthetic, so ggplot is expecting a variable to define the color by, to set the whole plot to be blue run this
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

In [None]:
# How does ggplot handle continuous vs discrete variables as aesthetics for size, shape, and color?

#Solution:
# Continous variables are defined along a continuous axis, while discrete variables use categories
ggplot(data=mpg) + geom_point(mapping = aes(x=hwy, y=displ, size=model))

In [None]:
# What happens when you map the same variable to different aesthetics?

#Solution:
# The variable is used to define both aesthetics
ggplot(data=mpg) + geom_point(mapping = aes(x=hwy, y=displ, size=cty, color=cty))

In [None]:
# What happens when you map an aethetic, like color, to something like "color = displ < 5"?

#Solution:
# The aesthetic is converted to the result of the equation given, in this case it is converted to a boolean (true/false)
ggplot(data=mpg) + geom_point(mapping = aes(x=hwy, y=displ,color = displ<5))

### Section 3.5

In [None]:
# What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?
ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = cyl))

#Solution:
#The empty cells correspond to locations where there is no overlapping data for the the drv and cyl variables

In [None]:
# What plots does the following code make? What does . do?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)

#Solution:
# The first plot only splits the displ vs. hwy scatter by row (drv), while the second only splits the scatter by column (cyl)
# The '.' in this case just means the facet_grid should ignore the row or column variable

In [None]:
#Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol argument?

#Solution:
# nrow and ncol set the number of rows and columns to display in the output
# 'scales' also defines a layout option to determine how the graphic scales in your screen
# face_grid() does not have a nrow or ncol argument as this command always creates a grid the size of however many values, or levels, there are in each categorical variable

### Section 3.6

In [None]:
# What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?

#Solution:
#line: geom_line()
#boxplot: geom_boxplot()
#histogram: geom_histogram()
#area: geom_area()

In [None]:
# What does the se argument to geom_smooth() do?

#Solution:
# Using ?geom_smooth, 'se' takes a boolean TRUE or FALSE and corresponds to if a confidence interval should be plotted around the line

In [None]:
# Will these two graphs look different? Why/why not?
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()

ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))


#Solution:
# Exactly the same, since the aesthetic was set in ggplot() for the first plot, this will be inherited by the next geoms. The second plot sets no defaults and then manually defines the aesthetic for each geom.