# Lecture 3.1:  Data transformation

<div style="border: 1px double black; padding: 10px; margin: 10px">

**Goals for today's lecture:**
* Learn [how to manipulate data](#Data-manipulation), including:
    * [Filtering data](#Filtering-data)
    * [Arranging (sorting) rows](#Arranging-rows)
    * [Selecting columns](#Selecting-columns)
    
This lecture note corresponds to Chapter 5.1--5.4 of your book.
</div>


## Data manipulation
Manipulating data is an important part of data science, and there are a lot of built-in commands for doing it in R:
```{r}
# Traditional data manipulation commands in R
subset()
aggregate()
merge()
reshape()
```
These commands are old and somewhat difficult to use. Instead of the traditional commands, we are going to focus on the `dplyr` package for filtering data. They provide a nice suite of replacements for the traditional commands, which have a consistent, unified interface and interoperate nicely with each other.

The `dplyr` packag is part of `tidyverse`, and we can just load up the `tidyverse` package to use tools in `dplyr`.

We will be using the `nycflights13` data set for this lecture. It does not come with tidyverse. If you are running Jupyter on your own computer you will first need to `install.packages("nycflights13")`.  This data set is about flights departing from the NYC area in 2013.  You have worked with part of this data set in Homework 2.

In [1]:
# install.packages('nycflights13')
library(tidyverse)
library(nycflights13)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.1     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [2]:
print(flights)

[38;5;246m# A tibble: 336,776 x 19[39m
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<int>[39m[23m    [3m[38;5;246m<int>[39m[23m          [3m[38;5;246m<int>[39m[23m     [3m[38;5;246m<dbl>[39m[23m    [3m[38;5;246m<int>[39m[23m          [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m  [4m2[24m013     1     1      517            515         2      830            819
[38;5;250m 2[39m  [4m2[24m013     1     1      533            529         4      850            830
[38;5;250m 3[39m  [4m2[24m013     1     1      542            540         2      923            850
[38;5;250m 4[39m  [4m2[24m013     1     1      544            545        -[31m1[39m     [4m1[24m004           [4m1[24m022
[38;5;250m 5[39m  [4m2[24m013     1     1      554            600        -[31m6[39m      812            837
[38;5;250m 6[39m  [4m2[24m013     1    

Tibble is similar to dataframe and we will learn more about it later in the course.   For now, you can interpret it as a dataframe. 

Notice the types of the variables above. They include:

* **int** integers
* **dbl** double precision floating point numbers
* **chr** character vectors, or strings
* **dttm** date-time (a date along with a time)

Other types available in R but not represented above include:

* **lgl** logical (either `TRUE` or `FALSE`)
* **fctr** factor (categorical variable with a fixed number of possible values)
* **date** date

### Filtering data
The first operation we'll learn about is filtering. Filtering is interpereted to mean "keep only the rows which match these criteria". The syntax for the `filter` command is 
```{r}
filter(<TIBBLE>, <LOGICAL CRITERIA>)
```
This commands returns a new tibble whose rows all match the specified criteria.

#### Types of logical criteria
For those who are new to programming, we now briefly review the sorts of logical operations that you can specify for commands like `filter()`. The basic logical operators in R are `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal). The first four are used for comparing numbers and function exactly as in mathematics:
```{r}
> 1 > 1
[1] FALSE
> 1 >= 1
[1] TRUE
> 2.5 < 3
[1] TRUE
> 2.5 <= 3
[1] TRUE
```

#### Assignment vs. equality
An extremely common mistake for beginner programmers is to confuse `=` and `==` ("double equals") when writing code. As we have seen,
- `=` is used for
    - assigning a value to a variable, and
    - passing a named parameter into a function. 
- `==` is used for testing equality. 


```{r}
> a = 1  # assigns the integer 1 to a
> b = 2  # assigns the integer 2 to b
> a == 1 # tests that a equals 1
[1] TRUE
> b == 1 # tests that b equals 1
[1] FALSE
```

#### Boolean operations
Logical expressions are combined using *boolean operations*. The basic boolean operations are `and`, `or`, and `not`, denoted `&`, `|` and `!` respectively.

There are also doubled versions of `&` and `|` denoted `&&` and `||`. Do not use them here. We will return to these later in the course.

Another useful operator is `%in%`:
```r
x %in% y
```

return `TRUE` if the value `x` is found in the vector `y`:

### Missing data
Something you will often encounter when working with real data are missing observations. R has a special value, `NA` , for representing missing data. You can think of the value of `NA` as "I don't know". Thus, logical and mathematical operations involving `NA` will again return `NA`, so that `NA`s "propagate through" the computation:

Since you cannot test `NA`s for equality, R has a special function for determining whether a value is `NA`:

### Examples of filtering
Let's use what we have just learned to evaluate some simple queries on the `flights` dataset. Let's first construct narrow down to all flights that departed on December 31:

An alternative way is to use multiple arguments in `filter`.  `R` will interpret multiple arugments as `AND` in the `filter` function.

## Remark:
Using `==` for testing equality is very important in `R`.  `R` will yield an error if you use `=`.  

The above code just displayed the filtered rows. What if we want to store the results for later use?

If you want to assign as well as print, enclose the command in parentheses.

### Question:
Let's filter down to all flights which were in the last quarter of the year (October through December). That is, we want flights whose `month` is 10, 11, or 12.

We can save some typing by using the `%in%` operator as well as the `:` (colon) operator. The colon operator takes two integers and returns a vector of all the integers between them: `a:b = c(a,a+1,...,b-1,b)`.

### `Between` function
The above pattern occurs so ofter, there is a special `between()` function:

### Counting matches
Sometimes we just want to know how many observations match a given filter. The `nrow()` command can be used to count the number of rows in a data table.

Let us try to calculate how many flights with missing departure time in our data.

How about the number of flights departing between Jan and Mar?

# Arrange Rows

`arrange` can order rows of a data frame using a variable name (or a more complicated expression). If you provide multiple expressions to order by, it uses the second one to break ties in the first one, third one to break ties in the second one, and so on.

We sorted the data by month and day, so the top-most rows have the earliest month, folllowed by day.

`desc()` will order in descending order.

Missing values are always left at the end by `arrange`. In contrast, `filter` will ignore missing values unless you explicitly ask for them using `is.na()`.

# Select Columns

`select` is used to keep only a few variables of interest to the current analysis. It is most useful when working with data frames involving a large number of variables.

### What is `%>%`?
Under the hood, `x %>% f(y)` turns into `f(x, y)`,

You can change the name of the variables when selecting them.

Note that `select` drops any variables not explicitly mentioned. To just rename some variables while keeping all others, use `rename`.

If there are a lot of variables, you can save yourself some typing by using `:` and `-` in combination with select. The colon operator selects a range of variables:

The negative sign lets you select everything but certain columns:

You can use `-` and `:` together, for example:

If you want to bring a few variables at the beginning, you can use `everything()` to refer to the remaining variables.

In addition, there are some helper functions that only work inside `select()`.

* `starts_with()`, `ends_with()`, `contains()`
* `matches()`
* `num_range()`

You can consult the documentation or type `?select` at the prompt to learn more about these. Here's just one example of their use.

This basically selects all the columns containing the string "time".

# Slice Rows
The slice function is helpful in selecting the specific rows of your data set. Type `?slice` to understand more about the function.    