# Introduction to R

A more general goal is to get you comfortable with [R](https://cloud.r-project.org/) so that it seems less scary and mystifying than it perhaps does now. Note that this is by no means a complete or thorough introduction to R! It’s just enough to get you started.

This workshop is relatively informal, example-oriented, and hands-on. We won’t spend much time examining language features in detail. Instead we will work through an example, and learn some things about the R along the way.

As an example project we will analyze the popularity of baby names in the US from 1880 through 2008. Among the questions we will use R to answer are:

 * In which year did your name achieve peak popularity?
 * How many children were born each year?
 * What are the most popular names overall? For girls? For Boys?

## Exercise 0

The purpose of this exercise is to give you an opportunity to explore the interface provided by R. You may not know how to do these things; that’s fine! This is an opportunity to figure it out.

Try to get R to add 2 plus 2

In [None]:
2 + 2

In [None]:
sum(2, 2)

Square root of 10

In [None]:
10^(1/2)

In [None]:
sqrt(10)

## R basics

### Function calls

The general form for calling R functions is

```r
FunctionName(arg.1 = value.1, arg.2 = value.2, ..., arg.n - value.n)
```

Arguments can be matched by name; unnamed arguments will be matched by position.

### Assignment

Values can be assigned names and used in subsequent operations

 * The `<-` operator (less than followed by a dash) is used to save values
 * The name on the left gets the value on the right.


In [None]:
sqrt(10) # calculate square root of 10; result is not stored anywhere

In [None]:
x <- sqrt(10) # assign result to a variable named x

In [None]:
x

Names should start with a letter, and contain only letters, numbers, underscores, and periods.

### Getting data into R

R has data reading functionality built-in – see e.g., `?read.table`. However, faster and more robust tools are available, and so to make things easier on ourselves we will use a contributed package called `readr` instead. This requires that we learn a little bit about packages in R.

#### Installing and using R packages

A large number of contributed packages are available. If you are looking for a package for a specific task, https://cran.r-project.org/web/views/ and https://r-pkg.org are good places to start.

You can install a package in R using the `install.packages()` function. Once a package is installed you may use the `library` function to attach it so that it can be used.

In [None]:
install.packages('readr')

In [None]:
library(readr)

#### Readers for common file types

In order to read data from a file, you have to know what kind of file it is. The table below lists functions that can import data from common plain-text formats.

Data Type | Function
--- | ---
comma separated | `read_csv()`
tab separated | `read_delim()`
other delimited formats | `read_table()`
fixed width | `read_fwf()`

### Baby names data

The examples in this workshop use baby names data retrieved from https://www.nrscotland.gov.uk/statistics-and-data/statistics/statistics-by-theme/vital-events/names/babies-first-names or https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/livebirths. A cleaned and merged version of these data is available at https://raw.githubusercontent.com/hadley/data-baby-names/master/baby-names.csv.

## Exercise 1: Reading the baby names data

Make sure you have installed the `readr` package and attached it with `library(readr)`.

Baby names data are available at `https://raw.githubusercontent.com/hadley/data-baby-names/master/baby-names.csv`.

 1. Open the `read_csv` help page to determine how to use it to read in data.
 2. Read the baby names data using the `read_csv` function and assign the result with the name `baby.names`.


## Exercise 1 - solution

In [None]:
?read_csv

In [None]:
baby.names <- read_csv('https://raw.githubusercontent.com/hadley/data-baby-names/master/baby-names.csv')

## Popularity of your name

In this section we will pull out specific names and examine changes in their popularity over time.

The `baby.names` object we created in the last exercise is a `data.frame`. There are many other data structures in R, but for now we'll focus on working with `data.frames`, a table!.

R has decent data manipulation tools built-in. But to make things easier on ourselves we will use a contributed package called `dplyr` instead.

In [None]:
install.packages('dplyr')

In [None]:
library(dplyr)

## Filtering and arranging data

One way to find the year in which your name was the most popular is to filter out just the rows corresponding to your name, and then arrange (sort) by Count.

To demonstrate these techniques we’ll try to determine whether “Alex” or “Jim” was more popular in 1992. We start by filtering the data so that we keep only rows where Year is equal to 1992 and Name is either “Alex” or “Mark”.

In [None]:
am <- filter(baby.names, year == 1992 & (name == 'Alex' | name == 'Mark'))

In [None]:
am

These operators may be combined with `&` (and) or `|` (or).

In this case it's pretty easy to see that “Mark” is more popular, but to make it even easier we can arrange the data so that the most popular name is listed first.

In [None]:
arrange(am, percent)

In [None]:
arrange(am, desc(percent))

## Other logical operators

In the previous example we used `==` to filter rows. Other relational and logical operators are listed below.

Operator | Meaning
--- | ---
== | equal to
!= | not equal to
> | greater than
>= | greater than or equal to
< | less than
<=  |less than or equal to
%in% | contained in

These operators may be combined with `&` (and) or `|` (or).

## Exercise 2: Peak popularity of your name

In this exercise you will discover the year your name reached its maximum popularity.

 1. Use filter to extract data for your name (or another name of your choice).
 2. Arrange the data you produced in step 1 above by Count. In which year was the name most popular?
 3. BONUS (optional): Filter the data to extract only the row containing the most popular boys name in 1999

Exercise 2 - solution

In [None]:
george <- filter(baby.names, name == 'George')

In [None]:
arrange(george, desc(percent))

In [None]:
boys.1999 <- filter(baby.names, year == 1999 & sex == 'boy')
boys.1999

In [None]:
filter(boys.1999, percent==max(percent))

## Plotting baby name trends over time

It can be difficult to spot trends when looking at summary tables. Plotting the data makes it easier to identify interesting patterns.

R has decent plotting tools built-in. However, To make things easier on ourselves we will use a package called `ggplot2` instead.

In [None]:
install.packages('ggplot2')

In [None]:
library(ggplot2)

For quick and simple plots we can use the `qplot` function. For example, we can plot the number of babies given the name “Diana” over time like this:

In [None]:
diana <- filter(baby.names, name == 'Diana')

In [None]:
qplot(x = year, y = percent, data = diana)

In [None]:
qplot(x = year, y = percent, color = sex, data = diana)

Done!

There are 100' of tutorial for R. Try [swirl](https://swirlstats.com/students.html) step-by-step tutoials is you are interested...