In [1]:
# load tidyverse and the nycflights13 dataset
library(nycflights13)
library(tidyverse)

Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang
Registered S3 method overwritten by 'rvest':
  method            from
  read_xml.response xml2
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.1       ✔ purrr   0.3.2  
✔ tibble  2.1.1       ✔ dplyr   0.8.0.1
✔ tidyr   0.8.3       ✔ stringr 1.4.0  
✔ readr   1.3.1       ✔ forcats 0.4.0  
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()


 Get nycflights13
- Name: r-nycflights13
- Run:
 * conda activate "your R environment"
 * conda install r-nycflights13
 * launch jupyter notebook
- Note: If you want to install other packages in R. Do it this way, and do not try installing from inside Jupyter Notebook.

In [2]:
#flights
#print(nycflights13::flights, n = 5, width = 500) # indicating to run flights from the nycflights13 library
print(flights, n = 5, width = 500)

# A tibble: 336,776 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
  arr_delay carrier flight tailnum origin dest  air_time distance  hour minute
      <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
1        11 UA        1545 N14228  EWR    IAH        227     1400     5     15
2        20 UA        1714 N24211  LGA    IAH        227     1416     5     29
3        33 AA        1141 N619AA  JFK    MIA        160     1089     5     40
4       -18 B6         725 N804JB 

### Some types of variables in R
- ```int```: integers
- ```dbl```: "double" or real numbers in double precision
- ```chr```: "character" vectors or strings
- ```dttm```: "date + time"
- ```lgl```: "logical" or vectors of Boolean values (i.e. ```TRUE``` or ```FALSE```)
- ```date```: dates
- ```fctr```: "factor" or a categorical variable with a fixed possible set of unique values
- ```ord```: "ordered" or a categorical variable with an ordering on the possible values

We will use the ```dplyr``` package to select subsets of and modify datasets in tibbles.
- See [here](https://github.com/tidyverse/dplyr/issues/1857) for a discussion of why it is called ```dplyr```.

In [11]:
# only look at the flights on January 1st
print(filter(flights, month == 1, day == 1))

# A tibble: 842 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# … with 832 more rows, and 11 

Functions in ```dplyr``` never modify the original dataframe.

In [14]:
jan1_flights <- filter(flights, month == 1, day == 1)

print(flights, n = 5)
print(jan1_flights, n = 5)

# A tibble: 336,776 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
# … with 3.368e+05 more rows, and 11 more variables: arr_delay <dbl>,
#   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 842 x 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517 

### Comparisons
- ```==```: equal to
- ```!=```: not equal to
- ```<```: less than
- ```>```: greater than
- ```<=```: less than or equal to
- ```>=```: greater than or equal to

In [15]:
2 == 2
2 != 2
2 > 3
2 <= 2

In [16]:
(1 / 49) * 49 == 1
sqrt(2)**2 == 2

In [None]:
# use near to compare floating point numbers
near((1 / 49) * 49,  1)
near(sqrt(2)**2,  2)

Attempt: Create a new dataset consisting of all flights on the third day of every month after June.

In [18]:
# all flights on the third day of every month after June, excluding June
print(filter(flights, month > 6, day == 3))
#print(filter(flights, month >= 7, day == 3))

# all flights on the third day of every month after June, including June
print(filter(flights, month >= 6, day == 3))

# A tibble: 5,618 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    10     3      453            500        -7      636            648
 2  2013    10     3      512            517        -5      739            757
 3  2013    10     3      541            545        -4      826            855
 4  2013    10     3      541            545        -4      920            933
 5  2013    10     3      546            545         1      822            827
 6  2013    10     3      546            550        -4      917            932
 7  2013    10     3      550            600       -10      646            708
 8  2013    10     3      550            600       -10      844            858
 9  2013    10     3      552            600        -8      651            659
10  2013    10     3      552            600        -8      656            711
# … with 5,608 more rows, and

### Logical Operators

```filter()``` automatically interprets a list of comparisons using the ```&``` or "and" logical operator
- ```&``` and
- ```|```: or
- ```!```: not

Examples: 

```x & y``` -> "x and y"

```x | !y``` -> "x or not y"

![logical_operators](https://d33wubrfki0l68.cloudfront.net/01f4b6d39d2be8269740a3ad7946faa79f7243cf/8369a/diagrams/transform-logical.png)

In [19]:
# all flights on the third day of every month after June, excluding June
print(filter(flights, (month > 6) & (day == 3)))

# A tibble: 176,211 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     3       32           2359        33      504            442
 2  2013     1     3       50           2145       185      203           2311
 3  2013     1     3      235           2359       156      700            437
 4  2013     1     3      458            500        -2      650            650
 5  2013     1     3      520            525        -5      830            820
 6  2013     1     3      532            530         2      851            831
 7  2013     1     3      535            540        -5      835            850
 8  2013     1     3      543            545        -2     1009           1022
 9  2013     1     3      550            600       -10      843            846
10  2013     1     3      552            600        -8      759            801
# … with 176,201 more rows,

In [20]:
# all flights on the third day of a month OR on a month after June, excluding June
print(filter(flights, (month > 6) | (day == 3)))

# A tibble: 176,211 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     3       32           2359        33      504            442
 2  2013     1     3       50           2145       185      203           2311
 3  2013     1     3      235           2359       156      700            437
 4  2013     1     3      458            500        -2      650            650
 5  2013     1     3      520            525        -5      830            820
 6  2013     1     3      532            530         2      851            831
 7  2013     1     3      535            540        -5      835            850
 8  2013     1     3      543            545        -2     1009           1022
 9  2013     1     3      550            600       -10      843            846
10  2013     1     3      552            600        -8      759            801
# … with 176,201 more rows,

In [None]:
# all flights (NOT on the third day of a month) OR on a month after June, excluding June
print(filter(flights, (month > 6) | !(day == 3)))

In [None]:
# all flights NOT (on the third day of a month OR on a month after June, excluding June)
print(filter(flights, !((month > 6) | (day == 3))))

In [21]:
print(flights, width = 500)

# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
   arr_delay carrier flight

Attempt: Create the following datasets (hint: use parentheses)
- All flights on the third day or fifth day of each month.
- All flights departing after 8:00am, excluding flights from United Airlines (flights w/ carrier UA)
- All flights with destination Miami (MIA) or with an origin other than JFK

In [5]:
print(filter(flights, dest == "MIA" | (origin != "JFK")))

# A tibble: 228,811 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      554            600        -6      812            837
 5  2013     1     1      554            558        -4      740            728
 6  2013     1     1      555            600        -5      913            854
 7  2013     1     1      557            600        -3      709            723
 8  2013     1     1      558            600        -2      753            745
 9  2013     1     1      558            600        -2      923            937
10  2013     1     1      559            600        -1      941            910
# … with 228,801 more rows,

In [4]:
# All flights on the third day or fifth day of each month.
print(filter(flights, day == 3 | day == 5), n = 10)

# A tibble: 22,069 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     3       32           2359        33      504            442
 2  2013     1     3       50           2145       185      203           2311
 3  2013     1     3      235           2359       156      700            437
 4  2013     1     3      458            500        -2      650            650
 5  2013     1     3      520            525        -5      830            820
 6  2013     1     3      532            530         2      851            831
 7  2013     1     3      535            540        -5      835            850
 8  2013     1     3      543            545        -2     1009           1022
 9  2013     1     3      550            600       -10      843            846
10  2013     1     3      552            600        -8      759            801
# … with 2.206e+04 more rows

In [29]:
# all flights departing after 8:00am and none of those flights can be  from UA
print(filter(flights, dep_time > 0800, carrier != "UA"))

# A tibble: 227,794 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      801            805        -4      900            919
 2  2013     1     1      803            810        -7      903            925
 3  2013     1     1      804            810        -6     1103           1116
 4  2013     1     1      805            805         0     1015           1005
 5  2013     1     1      805            800         5     1118           1106
 6  2013     1     1      805            815       -10     1006           1010
 7  2013     1     1      807            810        -3     1043           1043
 8  2013     1     1      809            815        -6     1043           1050
 9  2013     1     1      810            810         0     1048           1037
10  2013     1     1      810            815        -5     1100           1128
# … with 227,784 more rows,

In [None]:
# all flights departing after 8:00am or flight not from United Airlines
print(filter(flights, dep_time > 0800 | carrier != "UA"))

Instead of stringing together many "or"s we can use ```%in%```.

The function ```c( , , ... , )``` creates a *vector*.

In [30]:
# all flights going to Miami or Atlanta
# literally: all flights whose destination is in the vector c("MIA", "ATL")
print(filter(flights, dest %in% c("MIA", "ATL")), width = 100)

# A tibble: 28,943 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      542            540         2      923            850
 2  2013     1     1      554            600        -6      812            837
 3  2013     1     1      600            600         0      837            825
 4  2013     1     1      606            610        -4      858            910
 5  2013     1     1      606            610        -4      837            845
 6  2013     1     1      607            607         0      858            915
 7  2013     1     1      615            615         0      833            842
 8  2013     1     1      623            610        13      920            915
 9  2013     1     1      655            700        -5     1002           1020
10  2013     1     1      658            700        -2      944            939
   arr_delay carrier flight 

In [31]:
# all flights excluding those going to Miami or Atlanta
# literally: all flights where destination is NOT in the vector c("MIA", "ATL")
print(filter(flights, !(dest %in% c("MIA", "ATL"))), width = 100)

# A tibble: 307,833 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      544            545        -1     1004           1022
 4  2013     1     1      554            558        -4      740            728
 5  2013     1     1      555            600        -5      913            854
 6  2013     1     1      557            600        -3      709            723
 7  2013     1     1      557            600        -3      838            846
 8  2013     1     1      558            600        -2      753            745
 9  2013     1     1      558            600        -2      849            851
10  2013     1     1      558            600        -2      853            856
   arr_delay carrier flight

In [7]:
# dataset describing mammalian sleep times
msleep

name,genus,vore,order,conservation,sleep_total,sleep_rem,sleep_cycle,awake,brainwt,bodywt
Cheetah,Acinonyx,carni,Carnivora,lc,12.1,,,11.9,,50.000
Owl monkey,Aotus,omni,Primates,,17.0,1.8,,7.0,0.01550,0.480
Mountain beaver,Aplodontia,herbi,Rodentia,nt,14.4,2.4,,9.6,,1.350
Greater short-tailed shrew,Blarina,omni,Soricomorpha,lc,14.9,2.3,0.1333333,9.1,0.00029,0.019
Cow,Bos,herbi,Artiodactyla,domesticated,4.0,0.7,0.6666667,20.0,0.42300,600.000
Three-toed sloth,Bradypus,herbi,Pilosa,,14.4,2.2,0.7666667,9.6,,3.850
Northern fur seal,Callorhinus,carni,Carnivora,vu,8.7,1.4,0.3833333,15.3,,20.490
Vesper mouse,Calomys,,Rodentia,,7.0,,,17.0,,0.045
Dog,Canis,carni,Carnivora,domesticated,10.1,2.9,0.3333333,13.9,0.07000,14.000
Roe deer,Capreolus,herbi,Artiodactyla,lc,3.0,,,21.0,0.09820,14.800


The ```NA``` refers to missing values. Missing values are contagious!

In [8]:
2 + NA
3 > NA
"NA" == NA
NA == NA

In [9]:
# output whether a value is missing or not
x <- NA
is.na(x)

In [11]:
(filter(msleep, sleep_rem != -10000))

name,genus,vore,order,conservation,sleep_total,sleep_rem,sleep_cycle,awake,brainwt,bodywt
Owl monkey,Aotus,omni,Primates,,17.0,1.8,,7.00,0.01550,0.480
Mountain beaver,Aplodontia,herbi,Rodentia,nt,14.4,2.4,,9.60,,1.350
Greater short-tailed shrew,Blarina,omni,Soricomorpha,lc,14.9,2.3,0.1333333,9.10,0.00029,0.019
Cow,Bos,herbi,Artiodactyla,domesticated,4.0,0.7,0.6666667,20.00,0.42300,600.000
Three-toed sloth,Bradypus,herbi,Pilosa,,14.4,2.2,0.7666667,9.60,,3.850
Northern fur seal,Callorhinus,carni,Carnivora,vu,8.7,1.4,0.3833333,15.30,,20.490
Dog,Canis,carni,Carnivora,domesticated,10.1,2.9,0.3333333,13.90,0.07000,14.000
Goat,Capri,herbi,Artiodactyla,lc,5.3,0.6,,18.70,0.11500,33.500
Guinea pig,Cavis,herbi,Rodentia,domesticated,9.4,0.8,0.2166667,14.60,0.00550,0.728
Grivet,Cercopithecus,omni,Primates,lc,10.0,0.7,,14.00,,4.750


Even though there are no data points with ```sleep_rem == -1000```, ```filter``` still removed rows. ```filter``` will always remove rows where the value of the variable is ```NA```. You can specify to explicitly keep those values.

In [13]:
(filter(msleep, is.na(sleep_rem) | sleep_rem != -10000))

name,genus,vore,order,conservation,sleep_total,sleep_rem,sleep_cycle,awake,brainwt,bodywt
Cheetah,Acinonyx,carni,Carnivora,lc,12.1,,,11.9,,50.000
Owl monkey,Aotus,omni,Primates,,17.0,1.8,,7.0,0.01550,0.480
Mountain beaver,Aplodontia,herbi,Rodentia,nt,14.4,2.4,,9.6,,1.350
Greater short-tailed shrew,Blarina,omni,Soricomorpha,lc,14.9,2.3,0.1333333,9.1,0.00029,0.019
Cow,Bos,herbi,Artiodactyla,domesticated,4.0,0.7,0.6666667,20.0,0.42300,600.000
Three-toed sloth,Bradypus,herbi,Pilosa,,14.4,2.2,0.7666667,9.6,,3.850
Northern fur seal,Callorhinus,carni,Carnivora,vu,8.7,1.4,0.3833333,15.3,,20.490
Vesper mouse,Calomys,,Rodentia,,7.0,,,17.0,,0.045
Dog,Canis,carni,Carnivora,domesticated,10.1,2.9,0.3333333,13.9,0.07000,14.000
Roe deer,Capreolus,herbi,Artiodactyla,lc,3.0,,,21.0,0.09820,14.800


Similarly we can use ```select``` to choose a subset of the columns of a dataframe. This is useful when dealing with datasets with many, many variables.

In [34]:
print(select(flights, year, month, day, origin))

# A tibble: 336,776 x 4
    year month   day origin
   <int> <int> <int> <chr> 
 1  2013     1     1 EWR   
 2  2013     1     1 LGA   
 3  2013     1     1 JFK   
 4  2013     1     1 JFK   
 5  2013     1     1 LGA   
 6  2013     1     1 EWR   
 7  2013     1     1 EWR   
 8  2013     1     1 LGA   
 9  2013     1     1 JFK   
10  2013     1     1 LGA   
# … with 336,766 more rows


In [35]:
# select all columns between year and day (including year and day)
print(select(flights, year:day))

# A tibble: 336,776 x 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# … with 336,766 more rows


In [38]:
# select all columns *excluding* those between year and day (including year and day)
print(select(flights, -(year:day)))

# A tibble: 336,776 x 16
   dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
      <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
 1      517            515         2      830            819        11 UA     
 2      533            529         4      850            830        20 UA     
 3      542            540         2      923            850        33 AA     
 4      544            545        -1     1004           1022       -18 B6     
 5      554            600        -6      812            837       -25 DL     
 6      554            558        -4      740            728        12 UA     
 7      555            600        -5      913            854        19 B6     
 8      557            600        -3      709            723       -14 EV     
 9      557            600        -3      838            846        -8 B6     
10      558            600        -2      753            745         8 AA     
# … with 336,766 more rows,

Run ```?select``` to see options for helper functions to use with ```select```.

In [39]:
?select

In [40]:
print(select(flights, starts_with("d")))

# A tibble: 336,776 x 5
     day dep_time dep_delay dest  distance
   <int>    <int>     <dbl> <chr>    <dbl>
 1     1      517         2 IAH       1400
 2     1      533         4 IAH       1416
 3     1      542         2 MIA       1089
 4     1      544        -1 BQN       1576
 5     1      554        -6 ATL        762
 6     1      554        -4 ORD        719
 7     1      555        -5 FLL       1065
 8     1      557        -3 IAD        229
 9     1      557        -3 MCO        944
10     1      558        -2 ORD        733
# … with 336,766 more rows
