# Introduction to `R`

| Start Time | End Time | Length | Agenda                                       |
|------------|----------|--------|:---------------------------------------------|
| 10:00am    | 10:40am  | 40 min | **Unit 1 - Introduction and Administration** |
| 11:00am    | 11:40am  | 40 min | Unit 2 - EDA and DataViz I                   |
| 11:40am    | 12:20pm  | 40 min | Lunch                                        |
| 12:20pm    | 12:50pm  | 30 min | Unit 3 - EDA and DataViz II                  |
| 13:10pm    | 13:50pm  | 40 min | Unit 4 - Ingest and Basic Stats              |
| 14:10pm    | 14:50pm  | 40 min | Unit 5 - Finance in R                        |
| 14:50pm    | 15:00pm  | 10 min | Wrap-Up                                      |

# Organization and Introduction

## Instructor: Michael Weylandt

- Ph.D. student in the department of statistics
- Formerly quantitative analyst at Morgan Stanley in NYC
- Author and contributor to several `R` packages
- `R` mentor for Google Summer of Code
- I study machine learning, optimization, finance, and neuroscience

## Teaching Assistant: Kate Shoemaker

- Ph.D. student in the department of statistics
- Works on Bayesian methods in radiology and image processing

## Getting to know you

How much previous programming experience do you have?
- Never
- Some
- Advanced
- Some with R
- Advanced with R

Today: pair programming

# What is `R`? 

From http://r-project.org: 

> `R` is a language and environment for statistical computing and graphics. 

Let's unpack this: 

- `language`: 
    - `R` is *command based* (not point-and-click)
    - reproducible
    - automatable
    - composable
    - extensible
- `environment`
    - `R` comes with a rich built-in library (much more so than 98% of other languages)
    - optimized for interactive use
    - large user community and base of extensions
- `computing` 
- `graphics`

## `R`: Popularity

![R is popular](http://sogrady-media.redmonk.com/sogrady/files/2018/03/lang.rank_.118.png)

By most measures a top 15 programming language -- most popular "specialty" language

Extremely popular for statistics -- especially data science, finance, and academia

## `R`: History

![K and R](http://cs.mcgill.ca/~rwest/wikispeedia/wpcd/images/39/3956.jpg)

![C](https://upload.wikimedia.org/wikipedia/en/5/5e/The_C_Programming_Language_cover.svg)

![S](https://d1w7fb2mkkr3kw.cloudfront.net/assets/images/book/lrg/9780/4128/9780412830402.jpg)

![R&R](https://snipcademy.com/binf/img/tutorials/r/introduction/creators.jpg)

![Logo](https://www.r-project.org/logo/Rlogo.png)

## `R`: Functionality

## `R`: Installation

## `R`: Getting Started

There are two main ways to use `R`: 

- Interactively
   - Enter commands at the console and get immediate feedback
   - Good for exploratory work / figuring things out
- Scripted (batch)
   - Write commands in a text file to be executed in bulk
   - Good for standard reports and reproducibility
   
Today we will focus (exclusively) on the first model

When you start `R`, you will be presented with a prompt, which typically looks like this: 

    > 
    
From here, you type commands and hit `ENTER` after the command is completed

`R` will try to interpret your command

If your command is not "complete," you will get a "continuation" prompt: 

    + 
    
where you can continue your command. 

Example: 

    > 3 + 5
    [1] 8

Here my command was `3 + 5` and the output was `8`. 

The `[1]` is there to help readability in more complex commands (which we will discuss later)

## `R`: Using Locally

By default, `R` is used from the command line: 

![shell](https://snipcademy.com/binf/img/tutorials/r/introduction/r-startup.png)


Many people prefer to use an IDE instead. RStudio is among the most popular: 

![RStudio](https://snipcademy.com/binf/img/tutorials/r/introduction/rstudio.png)

Today we will be using *Jupyter notebooks*!

## Jupyter Notebooks

- Web based interface to programming language
- Can support multiple languages -- today we are using `R`
- Oriented around *cells*
  - Code in a cell is executed as a unit
  - To edit a cell, click on it and start typing
  - To run a cell, hit `<Shift>+<Enter>`
  - Two types of cells: Markdown (text) + Code
    - Code cells have a little "In [ ]:" next to them to identify them
    
Try opening `Binder` now

# Getting Started

Click the below cell and push `<Shift>+<Enter>` to run

In [None]:
library(ggplot2)
library(nycflights13)
library(dplyr)
library(maps)
library(mapdata)

states <- map_data("state")

delays <- flights %>% group_by(year, month, dest, origin) %>%
                      filter(arr_delay > 0) %>%
                      summarize(mean_delay = median(arr_delay),
                                n=n()) %>%
                      filter(!is.na(mean_delay)) %>%
          inner_join(airports %>% select(faa, dest_lat=lat, dest_lon=lon),
                    by=c("dest"="faa")) %>%
          inner_join(airports %>% select(faa, origin_lat=lat, origin_lon=lon),
                    by=c("origin"="faa"))


ggplot(delays) +
    geom_segment(aes(x=origin_lon,
                     y=origin_lat,
                     xend=dest_lon,
                     yend=dest_lat,
                     lwd=n,
                     color=mean_delay)) +
    scale_color_gradient(low="white", high="red") +
    geom_polygon(data=states,
                 aes(x=long, y=lat, group=group),
                 fill="green4",
                 alpha=0.2,
                 color = "white") +
    guides(fill=FALSE,
           alpha=FALSE,
           lwd=FALSE,
           color=guide_legend(title="Median Delay (min)",
                              breaks=c(15, 30, 60, 120))) +
    xlab("Longitude") + ylab("Latitude") +
    coord_fixed(xlim=range(states$long),
                ylim=range(states$lat), ratio=1.3) +
    ggtitle("Average Flight Delay Leaving NYC by Route") +
    theme_bw() + theme(legend.position="bottom") 

By lunch, you'll know all the elements of this sequence of commands

## `R`: Fundamentals

### Data Types

`R` is organized around *vectors* -- collections of objects of the same type

The main types you should be aware of are: 

- numeric (real numbers) and integer -- used for numerical data (essentially interchangeable)
- character 
- factor -- categorical data with a fixed set of categories ("Low", "Medium", "High")
- dates and date-times ("POSIXct")
  [times are tricky in any programming language - we won't use them today]

There are also container types: `list`s and `data.frame`s. We'll talk more about DFs below. 

For each type, there is an associated missing data (NA) element. 

`R` is the only major language with built-in NA handling (distinct from NaN) -- generally "does the right thing."

A scalar is a vector of length 1

In [1]:
## Vector of length 1
1.5

In [2]:
## Vector of length 5
c(1, 2, 3, 4, 5)

In [3]:
## Vector of characters
c("a", "b", "c")

In [None]:
## Arithmetic is "vectorized" 
1 + c(2, 3, 4)

In [4]:
c(1, 2, 3) + c(2, 3, 4)

In [None]:
## NA works as expected
c(1, NA, 2, 3) - 5

### Assignment
To work with data, we assign it a name using the `<-` operator (`=` works also)

In [5]:
x <- c(1, 2, 3)
y <- 5
x + y

### Extraction
To pull elements out of a vector, use `[ ]`

In [6]:
x <- c(4, 5, 6)
x[2]

In [7]:
x[4]

In [8]:
## Negative indexing drops elements
x[-2]

If you have a matrix or data frame, need to give _row_ and _column_ to `[, ]`

Leaving them out means "take all"

In [13]:
x <- matrix(1:9, ncol=3, byrow=TRUE)
x

0,1,2
1,2,3
4,5,6
7,8,9


In [10]:
x[,2]

In [12]:
x[1, 2]

## `R`: Functions

Most of what we want to do in `R` is already implemented with `functions`

(You can write your own functions, but we won't do that today)

Functions are called with their name followed by parentheses

- If the function needs input, put it in the parentheses
- Can go by position or by name: the rules are a bit confusing, but work as you'd expect

In [21]:
## The runif function creates a random vector
print(runif)

function (n, min = 0, max = 1) 
.Call(C_runif, n, min, max)
<bytecode: 0x7f8bda063e98>
<environment: namespace:stats>


In [16]:
x <- runif(10)
x

The arguments `min` and `max` have default values (`0` and `1`) so you don't have to provide them

You can provide them by position or by name (I think by name is clearer)

In [None]:
runif(10, min=0, max=25)

## `R`: How to get help

Easiest approach -- built in help system: 

If you want to know about a function, type

`?fun_name`

In [17]:
?runif

`example(fun_name)` is sometimes useful as well

## `R`: Packages and Ecosystem

`R` comes with a range of built-in functions (covering most basic statistics), but its power comes from a huge range of **add-ons** (called "packages") 

The biggest source -- CRAN = Comprehensive R Archive Network -- has over 12,000 packages freely available for download 

Example: [TTR](https://cran.r-project.org/web/packages/TTR/index.html)

Packages exist for basically everything you can think of -- when publishing a statistics paper, publishing an associated `R` package is *de rigeur* 


To install a package, use the `install.packages` function -- everything we'll use today is pre-installed

To *load* an *installed* package, use the confusingly-named `library` function

In [18]:
library(TTR)

In [20]:
print(runSD)

function (x, n = 10, sample = TRUE, cumulative = FALSE) 
{
    result <- sqrt(runCov(x, x, n, use = "all.obs", sample = sample, 
        cumulative))
    return(result)
}
<environment: namespace:TTR>


The `namespace` tells you what package a function comes from

# Exercises for the break

- Generate a vector of 25 standard normal random variables using the `rnorm` function
- Generate a vector of 25 normal random variables with mean 5
- Plot a histogram of your random variables with the `hist` function
- Plot a histogram of 25,000 normal random variables