# Intro to Markdown, Jupyter Notebook, and R

## R Packages

R is an open-source programming language, meaning that users can contribute
packages that make our lives easier, and we can use them for free. For this lab,
and many others in the future, we will use the following R packages:

- `dplyr`: for data wrangling
- `ggplot2`: for data visualization

Next, you need to load the packages in your working environment. We do this with
the `library` function. Note that you only need to **install** packages once, but
you need to **load** them each time.

```{r load-packages, message = FALSE}
library(dplyr)
library(ggplot2)
```


Going forward you will be asked to load any relevant packages at the beginning
of each lab.

## Dataset 1: Dr. Arbuthnot's Baptism Records

To get you started, enter the following command at the R prompt.
```{r load-abrbuthnot-data}
source("http://www.openintro.org/stat/data/arbuthnot.R")
```

This command instructs R to load some data. The Arbuthnot baptism counts for boys 
and girls. 

The Arbuthnot data set refers to Dr. John Arbuthnot, an 18<sup>th</sup> century 
physician, writer, and mathematician. He was interested in the ratio of newborn
boys to newborn girls, so he gathered the baptism records for children born in
London for every year from 1629 to 1710. We can take a look at the data by 
typing its name into an active R cell.

```{r view-data}
arbuthnot
```



However printing the whole dataset in the cell is not that useful. Arbuthnot's data in a kind of spreadsheet or table called a *data frame*.

You can see the dimensions of this data frame by typing:

```{r dim-data}
dim(arbuthnot)
```

And you can see just the top part of a dataset by using the *head* command:

```{r head-data}
head(arbuthnot)
```

The *dim* command should output `82 3`, indicating that there are 82 rows and 3 
columns. You can see the names of these columns (or 
variables) by typing:

```{r names-data}
names(arbuthnot)
```


You should see that the data frame contains the columns `year`,  `boys`, and 
`girls`. At this point, you might notice that many of the commands in R look a 
lot like functions from math class; that is, invoking R commands means supplying
a function with some number of arguments. The `dim` and `names` commands, for 
example, each took a single argument, the name of a data frame.


### Some Exploration

Let's start to examine the data a little more closely. We can access the data in
a single column of a data frame separately using a command like

```{r view-boys}
arbuthnot$boys
```

This command will only show the number of boys baptized each year. The dollar
sign basically says "go to the data frame that comes before me, and find the 
variable that comes after me".


Notice that the way R has printed these data is different. When we looked at the
complete data frame, we saw 82 rows, one on each line of the display. These data
are no longer structured in a table with other variables, so they are displayed 
one right after another. Objects that print out in this way are called vectors; 
they represent a set of numbers. 

R has some powerful functions for making graphics. We can create a simple plot 
of the number of girls baptized per year with the command

```{r plot-girls-vs-year}
ggplot(data = arbuthnot, aes(x = year, y = girls)) +
  geom_point()
```

Before we review the code for this plot, let's summarize the trends we see in the 
data. Go the 'Cell' option above and change the type of the following cell to 'markdown'. Markdown is a way to format your text to make it look nice. Write your name and short description of the plot above. Then go online and find how to format your text in markdown to make your name in __bold__. 

Back to the code... We use the `ggplot()` function to build plots. If you run the 
plotting code in your console, you should see the plot appear under the *Plots* tab 
of the lower right panel of RStudio. Notice that the command above again looks like 
a function, this time with arguments separated by commas. 

- The first argument is always the dataset. 
- Next, we provide thevariables from the dataset to be assigned to `aes`thetic 
elements of the plot, e.g. the x and the y axes. 
- Finally, we use another layer, separated by a `+` to specify the `geom`etric 
object for the plot. Since we want to scatterplot, we use `geom_point`.

You might wonder how you are supposed to know the syntax for the `ggplot` function. 
Thankfully, R documents all of its functions extensively. To read what a function 
does and learn the arguments that are available to you, just type in a question mark 
followed by the name of the function that you're interested in. Try the following in
your console:

```{r plot-help, tidy = FALSE}
?ggplot
```


### R as a big calculator

Now, suppose we want to plot the total number of baptisms. To compute this, we 
could use the fact that R is really just a big calculator. We can type in 
mathematical expressions like

```{r calc-total-bapt-numbers}
5218 + 4683
```

to see the total number of baptisms in 1629. We could repeat this once for each 
year, but there is a faster way. If we add the vector for baptisms for boys to 
that of girls, R will compute all sums simultaneously.

```{r calc-total-bapt-vars}
arbuthnot$boys + arbuthnot$girls
```

What you will see are 82 numbers (in that packed display, because we aren’t 
looking at a data frame here), each one representing the sum we’re after. Take a
look at a few of them and verify that they are right.

### Adding a new variable to the data frame

We'll be using this new vector to generate some plots, so we'll want to save it 
as a permanent column in our data frame.

```{r calc-total-bapt-vars-save}
arbuthnot <- arbuthnot %>%
  mutate(total = boys + girls)
```


What in the world is going on here? The `%>%` operator is called the **piping** 
operator. Basically, it takes the output of the current line and pipes it into 
the following line of code.


**A note on piping: ** Note that we can read these three lines of code as the following: 

*"Take the `arbuthnot` dataset and **pipe** it into the `mutate` function. 
Using this mutate a new variable called `total` that is the sum of the variables
called `boys` and `girls`. Then assign this new resulting dataset to the object
called `arbuthnot`, i.e. overwrite the old `arbuthnot` dataset with the new one
containing the new variable."*

This is essentially equivalent to going through each row and adding up the boys 
and girls counts for that year and recording that value in a new column called
total.

**Where is the new variable? ** When you make changes to variables in your dataset, 
click on the name of the dataset again to update it in the data viewer.

```{r names-data}
names(arbuthnot)
```

You'll see that there is now a new column called `total` that has been tacked on
to the data frame. The special symbol `<-` performs an *assignment*, taking the 
output of one line of code and saving it into an object in your workspace. In 
this case, you already have an object called `arbuthnot`, so this command updates
that data set with the new mutated column.

We can make a plot of the total number of baptisms per year with the following command.

```{r plot-total-vs-year-line}
ggplot(data = arbuthnot, aes(x = year, y = total)) +
  geom_line()

Note that using `geom_line()` instead of `geom_point()` results in a line plot instead
of a scatter plot. You want both? Just layer them on:

```{r plot-total-vs-year-line-and-point}
ggplot(data = arbuthnot, aes(x = year, y = total)) +
  geom_line() +
  geom_point()
```

**Exercise**: Now, generate a plot of the proportion of boys born over time. What 
do you see? Write your answer in markdown below. Feel free to insert a cell.

```{r plot-proportion-of-boys-over-time}

ggplot(data = arbuthnot, aes(x = year, y = boys/total)) +
  
  geom_point()
```



Finally, in addition to simple mathematical operators like subtraction and 
division, you can ask R to make comparisons like greater than, `>`, less than,
`<`, and equality, `==`. For example, we can ask if boys outnumber girls in each 
year with the expression

```{r boys-more-than-girls}
arbuthnot <- arbuthnot %>%
  mutate(more_boys = boys > girls)
```

This command adds a new variable to the `arbuthnot` data frame containing the values
of either `TRUE` if that year had more boys than girls, or `FALSE` if that year 
did not (the answer may surprise you). This variable contains different kind of 
data than we have considered so far. All other columns in the `arbuthnot` data 
frame have values are numerical (the year, the number of boys and girls). Here, 
we've asked R to create *logical* data, data where the values are either `TRUE` 
or `FALSE`. In general, data analysis will involve many different kinds of data 
types, and one reason for using R is that it is able to represent and compute 
with many of them.

## Dataset 2: Present birth records

In the previous few pages, you recreated some of the displays and preliminary 
analysis of Arbuthnot's baptism data. Next you will do a similar analysis, 
but for present day birth records in the United States. Load up the 
present day data with the following command.


```{r load-present-data}
present<-read.csv("present.csv")
```

The data are stored in a data frame called `present` which should now be loaded in 
your workspace.

4. How many variables are included in this data set?
<ol>
<li> 2 </li>
<li> 3 </li>
<li> 4 </li>
<li> 74 </li>
<li> 2013 </li>
</ol>

**Exercise**: What years are included in this dataset? **Hint:** Use the `range` 
function and `present$year` as its argument.

Calculate the total number of births for each year and store these values in a new 
variable called `total` in the `present` dataset. Then, calculate the proportion of 
boys born each year and store these values in a new variable called `prop_boys` in 
the same dataset. Plot these values over time and based on the plot determine if the 
following statement is true or false: The proportion of boys born in the US has 
decreased over time. 
<ol>
<li> True </li>
<li> False </li>
</ol>

Create a new variable called `more_boys` which contains the value of either `TRUE` 
if that year had more boys than girls, or `FALSE` if that year did not. Based on this 
variable which of the following statements is true? 
<ol>
<li> Every year there are more girls born than boys. </li>
<li> Every year there are more boys born than girls. </li>
<li> Half of the years there are more boys born, and the other half more girls born. </li>
</ol>

Calculate the boy-to-girl ratio each year, and store these values in a new variable called `prop_boy_girl` in the `present` dataset. Plot these values over time. Which of the following best describes the trend? 
<ol>
<li> There appears to be no trend in the boy-to-girl ratio from 1940 to 2013. </li>
<li> There is initially an increase in boy-to-girl ratio, which peaks around 1960. After 1960 there is a decrease in the boy-to-girl ratio, but the number begins to increase in the mid 1970s. </li>
<li> There is initially a decrease in the boy-to-girl ratio, and then an increase between 1960 and 1970, followed by a decrease. </li>
<li> The boy-to-girl ratio has increased over time. </li>
<li> There is an initial decrease in the boy-to-girl ratio born but this number appears to level around 1960 and remain constant since then. </li>
</ol>

In what year did we see the most total number of births in the U.S.? *Hint:* Sort 
your dataset in descending order based on the `total` column. 
<ol>
<li> 1940 </li>
<li> 1957 </li>
<li> 1961 </li>
<li> 1991 </li>
<li> 2007 </li>
</ol>


## Resources for learning R 

That was a short introduction to R , but we will provide you with more
functions and a more complete sense of the language as the course progresses. You 
might find the following tips and resources helpful.

- In this course we will be using the `dplyr` (for data wrangling) and `ggplot2` (for 
data visualization) extensively. If you are googling for R code, make sure
to also include these package names in your search query. For example, instead
of googling "scatterplot in R", google "scatterplot in R with ggplot2".

- The following cheathseets may come in handy throughout the course. Note that some 
of the code on these cheatsheets may be too advanced for this course, however 
majority of it will become useful as you progress through the course material.
    - [Data wrangling cheatsheet](http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)
    - [Data visualization cheatsheet](http://www.rstudio.com/wp-content/uploads/2015/12/ggplot2-cheatsheet-2.0.pdf)


