## R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.

When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Running this in a new project or on your own computer for the fist time you would need to run `install.packages("tidyverse")`.  Thereafter you only need to turn on the package. 

In [1]:
install.packages("tidyverse")


The downloaded binary packages are in
	/var/folders/dz/k8c4rxzs27v3q7ly5w31c95m0000gn/T//Rtmpu46fQx/downloaded_packages


In [2]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.2     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.1.0     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


## Iris Dataset

You will read in the pbc data set from the survival package. To find more information on the pbc data set you can use `?pbc` or use the help tab in the right hand side and search pbc.  This gives more information on how the data was created and what is in each column.

In [3]:
iris

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
<dbl>,<dbl>,<dbl>,<dbl>,<fct>
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa
4.6,3.4,1.4,0.3,setosa
5.0,3.4,1.5,0.2,setosa
4.4,2.9,1.4,0.2,setosa
4.9,3.1,1.5,0.1,setosa


You can sort your data using the arrange function.  This will sort the data based on age in ascending order with the youngest patient being the first of the dataset.

In [4]:
iris_arrange <- arrange(iris, Sepal.Length) 
iris_arrange

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
<dbl>,<dbl>,<dbl>,<dbl>,<fct>
4.3,3.0,1.1,0.1,setosa
4.4,2.9,1.4,0.2,setosa
4.4,3.0,1.3,0.2,setosa
4.4,3.2,1.3,0.2,setosa
4.5,2.3,1.3,0.3,setosa
4.6,3.1,1.5,0.2,setosa
4.6,3.4,1.4,0.3,setosa
4.6,3.6,1.0,0.2,setosa
4.6,3.2,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa


You can also sort on descending order. You will again use age and this time the oldest patient will be the first.

In [5]:
iris_arrange <- arrange(iris, desc(Sepal.Length))
iris_arrange

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
<dbl>,<dbl>,<dbl>,<dbl>,<fct>
7.9,3.8,6.4,2.0,virginica
7.7,3.8,6.7,2.2,virginica
7.7,2.6,6.9,2.3,virginica
7.7,2.8,6.7,2.0,virginica
7.7,3.0,6.1,2.3,virginica
7.6,3.0,6.6,2.1,virginica
7.4,2.8,6.1,1.9,virginica
7.3,2.9,6.3,1.8,virginica
7.2,3.6,6.1,2.5,virginica
7.2,3.2,6.0,1.8,virginica


This is a large data set, let's say you want to subset the data to just the columns of information you need for your analysis. You can use the select function. Let's say you are interested in just the patient id, sex and age.

In [6]:
iris_select <- select(iris, Sepal.Length, Sepal.Width, Species)
iris_select

Sepal.Length,Sepal.Width,Species
<dbl>,<dbl>,<fct>
5.1,3.5,setosa
4.9,3.0,setosa
4.7,3.2,setosa
4.6,3.1,setosa
5.0,3.6,setosa
5.4,3.9,setosa
4.6,3.4,setosa
5.0,3.4,setosa
4.4,2.9,setosa
4.9,3.1,setosa


For reference you can use the "-" to select every column but the ones listed

In [7]:
iris_select <- select(iris, -Sepal.Length, -Sepal.Width)
iris_select

Petal.Length,Petal.Width,Species
<dbl>,<dbl>,<fct>
1.4,0.2,setosa
1.4,0.2,setosa
1.3,0.2,setosa
1.5,0.2,setosa
1.4,0.2,setosa
1.7,0.4,setosa
1.4,0.3,setosa
1.5,0.2,setosa
1.4,0.2,setosa
1.5,0.1,setosa


You may only be interested in the setosa species for this data set. You can use the filter function to select only rows where species is equal to "setosa".

In [8]:
iris_filter <- filter(iris, Species=="setosa") 
iris_filter

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
<dbl>,<dbl>,<dbl>,<dbl>,<fct>
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa
4.6,3.4,1.4,0.3,setosa
5.0,3.4,1.5,0.2,setosa
4.4,2.9,1.4,0.2,setosa
4.9,3.1,1.5,0.1,setosa


You can also filter based on if septal length is greater than the mean of septal length. 

In [9]:
iris_filter <- filter(iris, Sepal.Length > mean(iris$Sepal.Length))
iris_filter

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
<dbl>,<dbl>,<dbl>,<dbl>,<fct>
7.0,3.2,4.7,1.4,versicolor
6.4,3.2,4.5,1.5,versicolor
6.9,3.1,4.9,1.5,versicolor
6.5,2.8,4.6,1.5,versicolor
6.3,3.3,4.7,1.6,versicolor
6.6,2.9,4.6,1.3,versicolor
5.9,3.0,4.2,1.5,versicolor
6.0,2.2,4.0,1.0,versicolor
6.1,2.9,4.7,1.4,versicolor
6.7,3.1,4.4,1.4,versicolor


You may be interested in computing new information from your data. 

For instance, you may want to calculate the ratio of sepal width to sepal length.

In [10]:
iris_mutate <- mutate(iris, Sepal.Ratio=Sepal.Width/Sepal.Length) 
iris_mutate

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,Sepal.Ratio
<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>
5.1,3.5,1.4,0.2,setosa,0.6862745
4.9,3.0,1.4,0.2,setosa,0.6122449
4.7,3.2,1.3,0.2,setosa,0.6808511
4.6,3.1,1.5,0.2,setosa,0.6739130
5.0,3.6,1.4,0.2,setosa,0.7200000
5.4,3.9,1.7,0.4,setosa,0.7222222
4.6,3.4,1.4,0.3,setosa,0.7391304
5.0,3.4,1.5,0.2,setosa,0.6800000
4.4,2.9,1.4,0.2,setosa,0.6590909
4.9,3.1,1.5,0.1,setosa,0.6326531


You can use the summarize function to summarize the data in specific ways. Here you are outputting a table with mean sepal length of the dataset.

In [11]:
avg_sepal_length <-summarize(iris, mean_val = mean(Sepal.Length))
avg_sepal_length

mean_val
<dbl>
5.843333


You can also group the data based on a specified variable or group of variables.

In [12]:
group_by_species <- group_by(iris,Species)
group_by_species

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
<dbl>,<dbl>,<dbl>,<dbl>,<fct>
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa
4.6,3.4,1.4,0.3,setosa
5.0,3.4,1.5,0.2,setosa
4.4,2.9,1.4,0.2,setosa
4.9,3.1,1.5,0.1,setosa


Interestingly nothing changes about the data that can be seen in the table. You can use the groups function to look at how the data is. You can always use View(iris) to look at the original data set. 

In [13]:
groups(group_by_species)
groups(iris)

[[1]]
Species


Group by is particularly helpful when used in conjunction other functions such as the previously used summarize function.  You can combine the group by and summarize function to calculate the mean age of males versus females. You could do this in two steps but you don't need the intermediate data so a pipe "%>%" can be used. A pipe is like saying do this then immediately follow with this next function.

In [14]:
iris_final <- iris %>% group_by(Species) %>% summarize(new_col = mean(Sepal.Length))
iris_final

Species,new_col
<fct>,<dbl>
setosa,5.006
versicolor,5.936
virginica,6.588


This concludes the introduction to data manipulation!