<a href="https://colab.research.google.com/github/odu-cs800-research/public/blob/main/CS800_RTutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS 800 R Tutorial

We'll be doing some of the exercises from [R for Data Science](https://r4ds.had.co.nz) to get an introduction to R. We'll do this in Google Colab, but the commands can also be run locally using RStudio.

There are a ton of references available for R and since it's popular, you can pretty much search for whatever you want and find something close.

There are some examples of more of the statistical functions at
https://www.cs.odu.edu/~mweigle/courses/cs795/mklein-IntroR/lecture/

## First Thing

We're going to use the [ggplot2](https://ggplot2.tidyverse.org/) library, which is including in the [tidyverse package](https://www.tidyverse.org/), so we install and include that first.

In [None]:
install.packages("tidyverse")

In [None]:
library(tidyverse)

## Basic Data Visualization in R

### Aesthetic mappings

We'll start with [Section 3.3, Aesthetic mappings](https://r4ds.had.co.nz/data-visualisation.html#aesthetic-mappings) from [R for Data Science](https://r4ds.had.co.nz).

First, let's look at the `mpg` dataset that we use for these examples.  This command will pop up a help window with a description of the dataset.

In [None]:
?ggplot2::mpg

Here we'll print just the first few lines of what's in the `mpg` dataset.  Note that this is a "tibble" instead of a regular R "dataframe".  

If you're not familiar with either of these terms, don't worry about it.  If you are familiar with dataframes, then here's a description of the differences: https://r4ds.had.co.nz/tibbles.html#tibbles-vs.data.frame

In [None]:
head(mpg)

First, let's create a simple scatterplot.  We're mapping the displacement attribute (`displ`) to the x-axis and the highway miles per gallon (`hwy`) to the y-axis.

In [None]:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

Map the colors of datapoints to the `class` variable, indicating the class of each vehicle.

In [None]:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

Make all of the points blue.  Notice the difference in the placement of the color setting. It applies to all the dots and is not based on any data item.

In [None]:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

*This will generate an error.  Why?*

In [None]:
ggplot(data = mpg) 
+ geom_point(mapping = aes(x = displ, y = hwy))

### Facets
*Moving to [Section 3.5 Facets](https://r4ds.had.co.nz/data-visualisation.html#facets)*

We can create "small multiples" to show the data in separate charts. Note that both the x-axis range and the y-axis range is the same in all of the charts.

The only thing we've added here is the `facet_wrap` function.  It says to divide the charts by `class` and use 2 rows to display them.

In [None]:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

### Statistical transformations

[Section 3.7](https://r4ds.had.co.nz/data-visualisation.html#statistical-transformations)

Time to look at bar charts and histograms.  We'll use a different dataset, describing diamonds.

In [None]:
?ggplot2::diamonds

The chart below uses `geom_bar` to generate a count of items with each type of cut.

In [None]:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))

We can also use the `stat_count` function to generate the same chart.

In [None]:
ggplot(data = diamonds) + 
  stat_count(mapping = aes(x = cut))

We can use `stat_summary` to generate other summary statistics about the dataset. This is setting the min value of the line to the min value of `depth`, max to the max value, and the dot to the median value.


In [None]:
ggplot(data = diamonds) + 
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )

*For more information on the stat functions available, see the [ggplot2 cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/data-visualization-2.1.pdf).*

## Workflow: basics

[Section 4](https://r4ds.had.co.nz/workflow-basics.html)

*Now that we've made some charts, let go back to basics.*

Using R as a calculator:

In [None]:
1 / 200 * 30
(59 + 73 + 2) / 3
sin(pi/2)

**Important:** Variable assignment is done with `<-`, not with `=`

In [None]:
x <- 3*4
x

## Working with Data

[Section 5](https://r4ds.had.co.nz/transform.html) - uses the `flights` dataset and introduces filter(), arrange(), select()

In [None]:
install.packages("nycflights13")
library(nycflights13)
?nycflights13::flights

`filter()`  allows you to subset observations based on their values. 

In [None]:
filter(flights, month==1, day==1)

`arrange()` lets you sort rows (rather than filtering them out, just rearrange them.

In [None]:
arrange(flights, year, month, day)

`select()` lets you pick only certain columns

In [None]:
select(flights, year, month, day, tailnum)

Finally, we'll use `summarize()` and `group_by()` to perform summary operations on selected data.

This will compute the average departure delay by month.  (`na.rm = TRUE` just means to ignore any rows that have `NA` values)

In [None]:
by_month = group_by(flights, year, month)
summarize(by_month, delay = mean(dep_delay, na.rm = TRUE))

Here are a couple shortcut notations:
* `$` - allows you to reference a particular column (without having to use `filter()`)
* `%>%` - like a pipe (`|`) in unix

Here we're going to compute the average delay for all Delta flights.

In [None]:
delta = filter(flights, carrier=="DL")
mean(delta$dep_delay, na.rm=TRUE)

The pipe is just a shortcut.  This is the same result as above to compute the average departure delay by month.

You can omit the first parameter (dataset) and it's assumed that the data is coming from the pipe input.

In [None]:
group_by(flights, year, month) %>% summarize(delay = mean(dep_delay, na.rm = TRUE))

## Data Import

[Section 11](https://r4ds.had.co.nz/data-import.html) - getting data into R, reading CSV

The main function we'll look at is `read_csv()` to read in comma-separated files, but there are several others described in this section.

There are two ways that we can load data into the notebook.  First, if the data is available online, we can provide a URL:

In [None]:
stars = read_csv("https://raw.githubusercontent.com/cs625-datavis-fall19/assignments/master/stars.csv")


Or we can load the datafile temporarily in the notebook and read it in using the filename.  Click the folder icon in the left sidebar and then click the upload button (page with an up arrow).  Once the file is uploaded, you can just use the filename directly:

In [None]:
stars = read_csv("stars.csv")

Once this is read in, you can use all of the other functions that we've covered.

In [None]:
hot_stars = filter(stars, temp > 5000)
head(hot_stars)

## Chart Labels

[Section 28](https://r4ds.had.co.nz/graphics-for-communication.html#label) Graphics for Communication

It's important to have good chart titles and labels.

In [None]:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

Let's add a title and caption.

In [None]:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
  labs(
    title="Fuel efficiency decreases with engine size",
    caption = "Data from fueleconomy.gov"
  )

Now let's change the axis labels.

In [None]:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
  labs(
    title="Fuel efficiency decreases with engine size",
    caption = "Data from fueleconomy.gov",
    x = "Engine displacement (L)",
    y = "Highway fuel economy (mpg)",
    color = "Car type"
  )

## Data Analysis Walkthrough

Now we're going to take a dataset and walk through generating some basic statistics and charts.

stats
* mean
* standard deviation
* median
* mode

charts
* histogram visualizing the distribution of values for a single variable
* box plot visualizing the distribution of values for a single variable
* scatterplot visualizing the distribution of values for one variable vs. a second variable
* line chart showing the values of data vs. time


In [None]:
install.packages("dslabs")

In [None]:
library(dslabs)

We're going to use the `murders` dataset for most of these. This is the FBI dataset for gun murders in the US in 2010. It is broken down by each state and includes the state population.

In [None]:
head(murders)

### Stats

Here we'll compute mean, standard deviation, median, and mode (most common value).

In [None]:
mean(murders$total)

In [None]:
sd(murders$total)

In [None]:
median(murders$total)

There's no built-in mode function in R, so we write our own.

In [None]:
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}
getmode(murders$total)

In [None]:
max(murders$total)

In [None]:
min(murders$total)

In [None]:
arrange(murders, desc(total))

In [None]:
arrange(murders, desc(total/population))

### Histogram

In [None]:
ggplot(murders, aes(total)) + 
  geom_histogram(binwidth=200) + 
  labs(
    title="Distribution of Gun Murders in 2010",
    x = "Gun Murders (bins of 200)",
    y = "Number of States"
  )

### Box plot

In [None]:
ggplot(murders, aes(y=total)) + 
  geom_boxplot() + 
  labs(
    y = "Gun Murders in 2010"
  )

In [None]:
ggplot(murders, aes(x=region, y=total)) + 
  geom_boxplot() + 
  labs(
    y = "Gun Murders in 2010"
  )

### Scatterplot

In [None]:
ggplot(murders) + 
  geom_point(mapping = aes(x = population, y = total, color=region)) + 
  labs(
    title="Gun Murders in 2010",
    x = "State Population",
    y = "Gun Murders",
    color = "Region"
  )

In [None]:
ggplot(murders) + 
  geom_point(mapping = aes(x = population, y = total/population, color=region)) + 
  labs(
    title="Gun Murders in 2010 ",
    x = "State Population",
    y = "Murder Rate (murders / population)",
    color = "Region"
  )

### Line Chart 

We want to do a line chart, but this isn't the right kind of data for that, so we'll load in a different dataset.  

The `polls_2008` dataset is showing the number of days until the 2008 US Presidential Election Day (in negative numbers) and the average poll margin between Obama and McCain on that day.

In [None]:
head(polls_2008)

In [None]:
ggplot(polls_2008, aes(x=day, y=margin)) +
  geom_line()+
  labs (
    x = "Days before Election",
    y = "Poll difference between Obama and McCain"
  )

In [None]:
ggplot(polls_2008, aes(x=day, y=margin)) +
  geom_line()+
  geom_point() + 
  labs (
    x = "Days before Election",
    y = "Poll difference between Obama and McCain"
  )