<a href="https://colab.research.google.com/github/henrikalbihn/henrikalbihn/blob/main/intro2R.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Crash Course in R 🇷

## Welcome to Google Colab! 🎉

Colab (short for Colaboratory) is a Python 3/R environment run on GCE (Google Compute Engine) that supports Jupyter notebooks (.ipynb files) and is stored in Google Drive.

> Why Colab over another R environment?

I like Colab because it requires basically no setup and Google makes it super easy to share code using Google Drive. That being said, you should still install [R](https://ftp.osuosl.org/pub/cran/) and [RStudio](https://www.rstudio.com/products/rstudio/download/#download) on your local machine. I still reach for RStudio as my IDE of choice for developing in R, Colab just has many sharing benefits.

> Why Jupyter notebooks over .R files?

I like using Jupyter notebooks because it's easy to teach other people what your code does using ***markdown*** and you can embed images, code cells, ***LaTeX***, and more. That being said, .R files have their use cases (like for automation) and we use both here at Westcliff.

> See links for more info on [Google Colab](https://research.google.com/colaboratory/faq.html) and [project Jupyter](https://jupyter.org/).


### Note: Colab for R
> Google Colab uses a *Python 3* runtime by default.

To ensure you are using R, go to the top ribbon header ***Runtime*** then click ***Change runtime type***, select R, and click ***Save***.

Your notebook will now run R code.

Another note: Colab does ***not*** support mounting your Google Drive directly into R notebooks. There are ways around it and we'll get to that in the Google Drive section.

## Hello, World! 👋 🌎

As with any intro to a programming language, we will start by calling the most basic function you can write: `print()`.

If it's not already obvious, `print()` prints a message to the console. We will begin with the tradition used by computer scientists since [1974](https://en.wikipedia.org/wiki/%22Hello,_World!%22_program)!

> **"Hello, world!"**

To run the code cell below, click the play button on the left side of the code cell or click inside the cell and press (CMD/CTRL + ENTER).

In a Jupyter notebook, instead of printing the output to the console, it prints just below the code cell.

In [None]:
print("Hello World!")

# You can also set the quote argument to FALSE
# to suppress quotes in the output.
print("Hello World!", quote = FALSE)

This code cell has ***comments*** (the lines starting with `#`). Comments do not run, they are used to explain what your code does. 
> ***Note on commenting:***
using readable and easily understandable comments whenever possible is good practice. In the event of someone else using your code, this is the only way for them to make sense of your operations. The same goes for looking at your own code months or years later so please ***write good comments***! 

### Variable Assignment
> In R, there are two ways to assign a variable,
1. using a single equal-sign `=` (like in Python)
2. a less-than-sign followed by a dash (looks like a left-facing arrow) `<-`.

Despite requiring a little more typing, the `<-` sign is the generally accepted symbol for variable assignment and ***I suggest you use this every time***.

Notice that the function prints characters *within* the `string` in double quotes `" "`. You can also pass a ***variable*** to the `print()` function. Here we will create a variable called `message` and assign it the `string` value `"Hello, world!"`. This is called ***variable assignment***.

In [None]:
message <- "Hello, world!"

print(message)

Notice that the output is the ***same*** whether we assign the string to a variable or pass the string directly!

> In R, there are many different ways to code the same thing!

In [None]:
# an even simpler way to print the value of message
# without calling the print() function

message <- "Hello, world!"

message

We can also assign a ***list*** to a ***variable***.

In [None]:
l <- c("Hello", ",", " ", "world", "!")

# paste is a useful function to concatenate strings
print(paste(l, collapse=""))

## Data Types/Structures

R supports many data types, we will cover just the basics.

### Vectors

Vectors are the most basic R data objects and there are six types of atomic vectors.

They are *logical*, *integer*, *double*, *character*, *complex*, and *raw*.

1. ***Logical*** vectors hold boolean values: `TRUE` / `FALSE`
2. ***Integer*** vectors hold whole number values: `1` / `0` / `590`
3. ***Double*** vectors hold floating point numbers: `3.1` / `9.99` / `0.5`
4. ***Character*** vectors hold character *strings*: `"Hello"` / `'Sam'` / `"5"`
5. ***Complex*** vectors hold *complex* and *imaginary* numbers: `2+3i` / `5+0i` / `3+5i`
6. ***Raw*** vectors hold *raw bytes*, they print as hexadecimals: `01` / `06` / `ff` / `0a`

> We will mainly focus on the first four. ***Complex*** and ***Raw*** are rarely used in Data Analysis.

In [None]:
a <- c(1, 2, 5.3, 6, -2,4) # numeric vector
b <- c("one", "two", "three") # character vector
c <- c(TRUE, T, TRUE, F, TRUE, FALSE) # logical vector (note)
d <- c(1, 'two', TRUE, 2+3i) # passing one string results in char

class(a)
class(b)
class(c)
class(d) # notice that this evaluates as a character vector

> The `class()` function

You can use the `class()` function to check the data type of whatever you pass as the argument.

### Matrices/Arrays (Optional)

A Matrix is a two-dimensional data structure (rows x columns). All elements must be the same atomic type.

In [None]:
# generates 5 x 4 integer matrix
y <- matrix(1:20, nrow = 5, ncol = 4)
y

Arrays are similar to matrices, but can have more than two dimensions.

We do not use arrays very often (or matrices for that matter), so we will skip over these, but here's how you make one.

In [None]:
# Create two vectors of different lengths.
vector1 <- c(5, 9, 3)
vector2 <- c(10, 11, 12, 13, 14, 15)
column.names <- c("COL1", "COL2", "COL3")
row.names <- c("ROW1", "ROW2", "ROW3")
matrix.names <- c("Matrix1", "Matrix2")

# Take these vectors as input to the array.
result <- array(c(vector1, vector2),
                  dim = c(3, 3, 2),
                  dimnames = list(row.names, column.names, matrix.names))
print(result)

### Data Frames

Data Frames are a more general application of a matrix because the columns can be of different atomic types.
> Data Frames are used in ***Excel***, ***Google Sheets***, ***SAS***, ***SPSS***, ***Python*** (`pandas` module), and more so you should get very familiar with them!

It is common practice to name our Data Frames as `<name>.df` so they are easy to identify as such.

To combine vectors into a Data Frame, we use the `data.frame` function and pass the names of the vectors we would like to combine.

In [None]:
d <- c(1, 2, 3, 4)
e <- c("red", "white", "red", NA)
f <- c(TRUE, TRUE, TRUE, FALSE)

# to combine vectors into a Data Frame we use the data.frame function
mydata.df <- data.frame(d, e, f)
mydata.names <- c("ID", "Color", "Passed") # char vector of var names
names(mydata.df) <- mydata.names # assign char vector to names()

mydata.df

### Lists

Lists are like vectors, but the elements can be of different types, they can even be other lists, matrices, arrays, Data Frames, and any combination of them.

In [None]:
# example of a list with 4 components -
# a string, a numeric vector, a matrix, and a scaler
w <- list(name="Fred", mynumbers=c(1,2,3), mymatrix=y, age=5.3)
w

### Factors (Nominal & Ordinal)

Factors are the name R uses for ***categorical variables***. They can be binary, like the case of male & female, where there is no order.
> We call these ***nominal*** factors.

In [None]:
# variable gender with 20 "male" entries and
# 30 "female" entries
gender <- c(rep("male",20), rep("female", 30))
gender <- factor(gender)
# stores gender as 20 1s and 30 2s and associates
# 1=female, 2=male internally (alphabetically)
# R now treats gender as a nominal variable
gender

Factors can also be ***ordinal*** - meaning they must follow an order. Like the case of `large` > `medium` > `small`. For ordinal factors, we must assign `levels.`

> Factors are stored internally as integers. So if your factor has 3 levels, then the character values will be stored as `1`, `2`, and `3` where `1` is the highest order level. They are reconverted for printing.

In [None]:
foodOrders <- c(rep("large",10), rep("medium", 10), rep("small", 10))
levels <- c('large', 'medium', 'small')

# If we left the levels argument blank, it would order
# alphabetically by default
foodOrders <- factor(foodOrders, levels = levels)

foodOrders

Factors are necessary for statistical tasks like ***classification*** and ***regression*** so I recommend you learn how to use them.

## Dependencies: Libraries/Modules/Packages 📚

R comes with built-in libraries (also called "packages" or "modules", I will use the terms interchangeably).

They are essentially stores of pre-written functions, data, and more that you can access and use in your code. This allows you to build off of very complicated implementations instead of having to build them from scratch.

> *P.S.* The one we will focus on for data analysis is called `tidyverse` 🌌. The Tidyverse is a collection of several useful packages put together so you only need to call one instead of many.

### The `library()` & `install.packages()` functions

To import a package, start with:

>`library(<packagename>)`

`library` is used to make functions, data, classes, and more available in the current environment.

The package **must** first be installed in your environment. You cannot import a package without installing it first.

To install a package, run:

>`install.packages(<packagename>)`


In Google Colab, the package is only installed in the current notebook so to use a package which is not installed by default, you must use `install.packages(<packagename>)` every time you open a new notebook. Try it out:

In [None]:
# The warning message can be ignored here,
# it shows what packages are in the tidyverse.
library(tidyverse)

# # We can also remove it using
# suppressMessages(library(dplyr))

Here's a useful snippet to show what packages are currently installed with their version number.

In [None]:
# show currently installed packages with version
ip = as.data.frame(installed.packages()[,c(1, 3:4)])
ip = ip[is.na(ip$Priority), 1:2, drop = FALSE]
ip

Here's a few useful packages that are not installed by default. Technically, `googlesheets4` & `googledrive` are  part of the `tidyverse`, however sometimes R cannot find functions within the library without explicity installing and importing them (R is an open-source language and thus occasionally has bugs).

In [None]:
# install.packages('googlesheets4')
# install.packages('googledrive')
install.packages('RMariaDB')

You can also call a specific function from a library like 

## Welcome to the Tidyverse 🌌

The `tidyverse` is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. This makes them very easy to use together.

### Working with Pipes `%>%`

Pipes `%>%` are a cool feature of the Tidyverse. They come from the `magrittr` package but are included in `dplyr` as well. Pipes allow you to string multiple operations and pass a *tibble* or *data frame* to a function.
> `f(x,y)` becomes `x %>% f(y)`.
You can also think of a pipe as a `then` operation. Here's the pseduocode below:

```
data %>%
  function(y) %>%
  differentFunction(z) %>%
  anotherFunction(z)
```

Above we are passing our data into one function, `then` another, `then` another. We can do this as many times as we like.

> Pipes are awesome when coupled with the `dplyr` package to make your code super readable.

### Data manipulation with `dplyr` 🧰

`dplyr` (pronounced dee-plier) is the main workhorse for data manipulation. It has several awesome built-in functions for data manipulation.

As you may know, R comes with some built-in datasets and the Tidyverse gives you access to even more.

Let's try the `starwars` dataset. If you've already used `library(tidyverse)` you can just run the `starwars` keyword to return the data.

In [None]:
library(tidyverse)
starwars

Note the `class` of `starwars`. It's a ***tibble*** (*tbl* short for *table* hence ***tibble***).

In `tidyverse` speak, you can think of a ***tibble*** as a "*better*" version of a data frame, but for our purposes they are virtually the same thing. See [here](https://tibble.tidyverse.org/) for more info on tibbles.

We can select just certain columns with `select()`

In [None]:
starwars %>%
  select(name, skin_color, eye_color)

There are also helper functions for select like `starts_with()`, `ends_with()`, `matches()`, and `contains()`.

In [None]:
# selects only columns whose name ends with "color"
starwars %>%
  select(ends_with("color"))

We can also use `mutate()` to create new columns and change existing ones.

In [None]:
# creates new column height_m for height in meters
starwars %>%
  mutate(height_m = height / 100)

Let's string some operations together and reorder our columns.

In [None]:
# creates height_m
# then selects new column, old height followed by everything else
# this reorders the output table without changing the data (reassigning)
starwars %>%
  mutate(height_m = height / 100) %>%
  select(height_m, height, everything())

Another method for reordering columns is `relocate()`

In [None]:
starwars %>%
  relocate(sex:homeworld, .before = height)

Finally let's use `group_by` and `summarise()` to make a pivot table of `species` by `sex`, `avg_height`, and `avg_mass`.

In [None]:
starwars %>%
  group_by(species, sex) %>%
  select(height, mass) %>%
  summarise(
    avg_height = mean(height, na.rm = TRUE),
    avg_mass = mean(mass, na.rm = TRUE)
  )

### Importing data with `readr` and `readxl` 👓

The `readr` package provides a slightly improved version of the base-R `read.csv()` which is `read_csv()` (note the underscore). This is useful because it also works with pipes.

Example in pseudocode:

```
df <- read_csv(data.csv) %>%
  select(just, these, columns,)
```
Run below to read in and assign the example dataset to the `df` variable.

In [None]:
df <- read_csv(readr_example("challenge.csv"))
df

The `readxl` package provides the `read_excel()` function for reading .xlsx files.

```
library(readxl)
df1 <- read_excel('~/file.xlsx')
```

### Plotting with `ggplot2` 📈

Need a graph?

## Working with `googledrive` and `googlesheets4` ⚡

In [None]:
library(googledrive)
library(googlesheets4)

#get Google sheet dribble metadata
  mysheet.dribble <- drive_get("210928 IR Reg Drive Import")
  1
  as_sheets_id(irregdrivecopy.dribble)
  1
  mysheet.dribble

#make tibble with main sheet data
  reg.drive.tib <- read_sheet(mysheet.dribble, 1)

## The Current Environment 🧹

To print all variables in the current environment, we can use the `ls()` function.

In [None]:
ls()

This snippet below allows you to clean the current R environment by deleting all variables in-memory.

> This is useful for debugging your code and keeping your environment tidy.

The `rm` function allows you to remove objects from the current environment, the `list` argument can be passed a list of objects, in this case, we are passing the `ls` function with the `all` argument set to `T` or `TRUE`.

In [None]:
# This function cleans your environment
# don't run this cell in Colab, it usually crashes your runtime
# you can copy and paste into RStudio and try it out
rm(list=ls(all=T))

# Additional Resources

Below are just a few additional resources you may find useful when you have questions or need practice.

*   [Stack Overflow](https://stackoverflow.com/): is a Q/A board where you can ask coding questions and get answers from experts. Most questions I've asked are answered same day. Make sure to tag your questions with `R` and the packages in question.
> *Stack Overflow is public so be sure not to post any sensitive data.
*   [GitHub](https://github.com/) is a code repository for using `git`. You can backup your code and enforce version control. I recommend you setup an account and learn how to use it. We do not currently use GitHub as a department, but would like to in the future.
> *Same goes for GitHub, make your repositories private and don't push anything sensitive.
*   [DataCamp](https://www.datacamp.com/) is an interactive learning platform that focuses on coding, Data Analytics, and Data Science. You can try it for free but the membership is $25/month. I recommend just because I've used it extensively. There are free alternatives like [Free Code Camp](https://www.freecodecamp.org/learn/data-analysis-with-python/), but I haven't used it much.

## Package Documentation

Various documentation by package:
* [tidyverse](https://www.tidyverse.org/packages/#import): package overview
* [dplyr](https://dplyr.tidyverse.org/reference/index.html): API reference
* [ggplot2](https://ggplot2.tidyverse.org/reference/index.html): API reference
* [readr](https://readr.tidyverse.org/reference/index.html): API reference

## More useful functions
```
length(object) # number of elements or components
str(object)    # structure of an object
class(object)  # class or type of an object
names(object)  # names

c(object,object,...)       # combine objects into a vector
cbind(object, object, ...) # combine objects as columns
rbind(object, object, ...) # combine objects as rows

object     # prints the object

ls()       # list current objects
rm(object) # delete an object

newobject <- edit(object) # edit copy and save as newobject
fix(object)               # edit in place

```