### Additional Data Structures in R.

* A data structure is a simple scheme for organizing data according to a specific model
  * Data structures facilitate access to and manipulation of data
* So far, we've covered vectors.
* We are going to cover 4 new data structures



* Matrices, lists, data frames 
* Factors, another common data structure will be covered when we discuss statistical analysis


### Factors

* A numerical value that maps to label
  * Saves space when encoding data.
  * We can encode male as 1, female as 2, thus saving 8 characters.
* We will cover those later


### Matrices

* A matrix in R is a 2-dimensional collection of elements of the same type.
  * the 2-D here refers to the rows and colums
  * Think of it a the mathematic matrix object 
  
* `matrix()` function is used to create a matrix 
  * Easient way to instantiate a matrix is to pass a vector of values
  * Matrix shape can be specificed using the `nrow` and/or `ncol`

 `matrix(some_vector_of_values, nrow=...)`

* we use the `dim` function to get the dimensions of a matrix
  * the dimesion is the number of rows and nb cols.
      * `dim` returns a vector of size 2.
  * In other languages and in lin algebra, dim is often referred to as shape



In [3]:
vec_of_vals = c(1,2,3,4,5,6)
matrix(vec_of_vals, nrow = 3)

0,1
1,4
2,5
3,6


In [7]:
matrix(c(1,2,3,4,5,6) , ncol = 3)

0,1,2
1,3,5
2,4,6


In [5]:
# No need to provide both, unless you want to 
# truncate the data.
matrix(c(1,2,3,4,5,6) , ncol = 3, nrow=1)

0,1,2
1,2,3


In [25]:
test_matrix = matrix(c(1,2,3,4,5,6) , ncol = 3)
print(test_matrix)

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6


### Indexing Elements in a Matrix 

* Elements can be indexed using `row_id` and `col_id`
  * Indices need to be valid or an error is generated
* The index structure for a single element is:

    `matrix_name[row_index, col_index]`

* The index structure for a sub-matrix is:

    `matrix_name[vector_of_rows, vector_of_columns]`


In [7]:
test_matrix = matrix(1:30, nrow = 5)
print(test_matrix)

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    6   11   16   21   26
[2,]    2    7   12   17   22   27
[3,]    3    8   13   18   23   28
[4,]    4    9   14   19   24   29
[5,]    5   10   15   20   25   30


In [8]:
test_matrix[1,3]

In [10]:
# The following will generate an error

test_matrix[7,1]

ERROR: Error in test_matrix[7, 1]: subscript out of bounds


In [12]:
print(test_matrix[3,])

[1]  3  8 13 18 23 28


In [14]:
print(test_matrix[c(1,3),])

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    6   11   16   21   26
[2,]    3    8   13   18   23   28


In [15]:
test_matrix[c(1,3),c(1, 3, 4)]

0,1,2
1,11,16
3,13,18


### Using the column operator to index rows and columns

* Colon (":") is an operator used to generate regular sequences
    *  e.g.: `1:6` generates the vector 1 2 3 4 5 6 
  
* Instead of using an explicit vector of positions to index on, we can use the colon operator
  
`matrix_name[row_index_start:row_index_end, col_index_start:col_index_end]`

* Rules:
    * If the start is omitted then start 1
    * If the end is omitted, end at the last valid index
    * If both the start and end are omitted, then start at 1 and end at the last valid index.


In [16]:
test_matrix[1:2,3]

In [18]:
test_matrix[,3:4]

0,1
11,16
12,17
13,18
14,19
15,20


### Col and Row names

* Labeling the columns and rows is useful to:
    * Facilitate reading. 
      * E.g.: what does the first row represent
    * Facilitate access to the data: we can index the data by a readable label
    
    
* It can be easily done using:
  * `rownames()` and `colnames()` 
  * Passing a `list` of vector names to `dimnames()`.
    * We'll see this when we cover lists
    

In [19]:
test_matrix = matrix(c(1,2,3,4,5,6) , ncol = 3)    
print(test_matrix)

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6


In [28]:
rownames(test_matrix) = c("row_1", "row_2")
print(test_matrix)

      col_1 col_2 col_3
row_1     1     3     5
row_2     2     4     6


In [29]:
colnames(test_matrix) = c("col_1", "col_2", "col_3")
print(test_matrix)

      col_1 col_2 col_3
row_1     1     3     5
row_2     2     4     6


In [30]:
test_matrix["row_1", "col_2"]

In [32]:
test_matrix[1, 2]

In [36]:

test_matrix["row_1", 3]

### Combining Objects by Column or Row

* Another way to generate matrices is by combining vectors 
  * We can use `cbind()` to bind a collection of vectors as columns of a new matrix
  * We can use `rbind()` to bind a collection of vectors as rows of a new matrix

* Both `cbind` and `rbind` can also be used to add new columns of rows to an existing matrix
 * The columns or rows are automatically named using variable names

In [39]:
col_1 = c(1, 2)
col_2 = c(3, 4)
col_3 = c(5, 6)

my_col_matrix <- cbind(col_1, col_2, col_3)

print(my_col_matrix)

     col_1 col_2 col_3
[1,]     1     3     5
[2,]     2     4     6


In [40]:
print(my_col_matrix[, "col_1"])

[1] 1 2


In [43]:
row_1 = c(1, 2, 3)
row_2 = c(4, 5, 6)

my_row_matrix <- rbind(row_1, row_2)
print(my_row_matrix)

      [,1] [,2] [,3]
row_1    1    2    3
row_2    4    5    6


In [44]:
some_matrix = matrix(c(1,2,3,4,5,6), nrow=3)
print(some_matrix)

     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6


In [45]:
print(cbind(some_matrix, c(7,8,9)))

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9


In [47]:
# can you explain the result below?
print(rbind(some_matrix, c(10,10)))

     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
[4,]   10   10


In [49]:
# We can even use rbind or cbin to stack or concatenate matrices

print(rbind(some_matrix, some_matrix))

     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
[4,]    1    4
[5,]    2    5
[6,]    3    6


In [71]:
print(cbind(some_matrix, some_matrix))

     [,1] [,2] [,3] [,4]
[1,]    1    4    1    4
[2,]    2    5    2    5
[3,]    3    6    3    6


## Data Frames

* A data frame (`df`) is an Excel-like table 
   
 * Instances (observations) are represented as rows
 * Variables are represented as columns  
   * All the observations of the same variable have the same data type

* While this type is conceptually similar to a matrix, the main difference is that `df` columns can have different types while a matrix can have a single data type.

* Generate a data frame using `data.frame()`
  * Takes vectors of the same length as input
  
* Column names are the variable names by default


In [51]:
nb_trna_genes <- c(20, 20, 15, 11)
str(as.integer(nb_trna_genes))

 int [1:4] 20 20 15 11


In [52]:
# genome lenght %and& trna genes are fictitious
species <- c("A. damnosum","A. adventoris","A. soli","A. lekithochrous")
genus <- c("Acidipro...","Apibacter","Aquaspirillum","Arcobacter")
genome_length <- c(3.2, 4.1, 4.23, 4.6)
nb_trna_genes <- as.integer(c(20, 20, 15, 11))
taxonomy <- data.frame(species, genus, genome_length, nb_trna_genes)
print(taxonomy)

           species         genus genome_length nb_trna_genes
1      A. damnosum   Acidipro...          3.20            20
2    A. adventoris     Apibacter          4.10            20
3          A. soli Aquaspirillum          4.23            15
4 A. lekithochrous    Arcobacter          4.60            11


In [53]:
colnames(taxonomy) = c("Species Name", "Genus", "Genome Length", "Nb. tRNA Genees")
print(taxonomy)

      Species Name         Genus Genome Length Nb. tRNA Genees
1      A. damnosum   Acidipro...          3.20              20
2    A. adventoris     Apibacter          4.10              20
3          A. soli Aquaspirillum          4.23              15
4 A. lekithochrous    Arcobacter          4.60              11


### Accessing datasets distributed with R

* We will use the `data()` function to access one of the pre-built datasets.
  * `airquality` dataset repersents New York air quality measurements from May to September

In [64]:
print(data())

In [65]:
data("airquality")
airquality

Ozone,Solar.R,Wind,Temp,Month,Day
<int>,<int>,<dbl>,<int>,<int>,<int>
41,190,7.4,67,5,1
36,118,8.0,72,5,2
12,149,12.6,74,5,3
18,313,11.5,62,5,4
,,14.3,56,5,5
28,,14.9,66,5,6
23,299,8.6,65,5,7
19,99,13.8,59,5,8
8,19,20.1,61,5,9
,194,8.6,69,5,10


### Some common methods on Functions

* `str()` to see a text description of the `str`ucture of the `df` 
 * Useful to inspect variables you're unfamiliar with.
  
* Display a subset of the data frame 
  * `head()`: top of `df` 
  * `tail()`, bottom of the `df`
  

In [66]:
str(airquality)

'data.frame':	153 obs. of  6 variables:
 $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
 $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
 $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
 $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
 $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...


In [67]:
head(airquality)

Unnamed: 0_level_0,Ozone,Solar.R,Wind,Temp,Month,Day
Unnamed: 0_level_1,<int>,<int>,<dbl>,<int>,<int>,<int>
1,41.0,190.0,7.4,67,5,1
2,36.0,118.0,8.0,72,5,2
3,12.0,149.0,12.6,74,5,3
4,18.0,313.0,11.5,62,5,4
5,,,14.3,56,5,5
6,28.0,,14.9,66,5,6


In [68]:
# show n lines. Works with both head and tail
tail(airquality, n = 3)

Unnamed: 0_level_0,Ozone,Solar.R,Wind,Temp,Month,Day
Unnamed: 0_level_1,<int>,<int>,<dbl>,<int>,<int>,<int>
151,14,191,14.3,75,9,28
152,18,131,8.0,76,9,29
153,20,223,11.5,68,9,30


### Indexing in a Data Frame

* Dataframes behave like tables when it comes to indexing
    * simple format: `df_name[row_index, col_index]`
    * advanced format: `df_name[row_index_start:row_index_end, col_index_start:col_index_end]`
    
* To index a column, you can use the special shortcut `$`

```df_name$col_name```

* The indexing type determines whether a value, vector, or another dataframe will be generated

In [156]:
head(airquality)

Unnamed: 0_level_0,Ozone,Solar.R,Wind,Temp,Month,Day
Unnamed: 0_level_1,<int>,<int>,<dbl>,<int>,<int>,<int>
1,41.0,190.0,7.4,67,5,1
2,36.0,118.0,8.0,72,5,2
3,12.0,149.0,12.6,74,5,3
4,18.0,313.0,11.5,62,5,4
5,,,14.3,56,5,5
6,28.0,,14.9,66,5,6


In [69]:
airquality[c(1,2,6), "Ozone"]

In [158]:
airquality[1:5, "Ozone"]

In [72]:
airquality[ , c("Ozone", "Wind")]

Ozone,Wind
<int>,<dbl>
41,7.4
36,8.0
12,12.6
18,11.5
,14.3
28,14.9
23,8.6
19,13.8
8,20.1
,8.6


In [73]:
airquality$Ozone

In [74]:
airquality$Ozone[1:5]

### Using Conditions to Filter `df` Rows

* A common way to subet a `df` data is by using the `subset()`

```
subset(my_df, some_condition)
```

* Logical Operators to describe the condition
| operator | Description |
|---|---|
| < | less than |
| <= | less than or equal to |
| > | greater than |
| >= | greater than or equal to |
| == | exactly equal to |
| != | not equal to |
| ! `condition_x` | Not condition_x |
| `cond_x` \| `cond_y` | cond_x or cond_y |
| `cond_x` & `cond_y` | cond_x and cond_y |


In [11]:
# first records in the airquality df has month==5
head(airquality)

Unnamed: 0_level_0,Ozone,Solar.R,Wind,Temp,Month,Day
Unnamed: 0_level_1,<int>,<int>,<dbl>,<int>,<int>,<int>
1,41.0,190.0,7.4,67,5,1
2,36.0,118.0,8.0,72,5,2
3,12.0,149.0,12.6,74,5,3
4,18.0,313.0,11.5,62,5,4
5,,,14.3,56,5,5
6,28.0,,14.9,66,5,6


In [77]:
subset(airquality, Month != 5)

Unnamed: 0_level_0,Ozone,Solar.R,Wind,Temp,Month,Day
Unnamed: 0_level_1,<int>,<int>,<dbl>,<int>,<int>,<int>
32,,286,8.6,78,6,1
33,,287,9.7,74,6,2
34,,242,16.1,67,6,3
35,,186,9.2,84,6,4
36,,220,8.6,85,6,5
37,,264,14.3,79,6,6
38,29,127,9.7,82,6,7
39,,273,6.9,87,6,8
40,71,291,13.8,90,6,9
41,39,323,11.5,87,6,10


In [79]:
nrow(airquality)
not_may = subset(airquality, Month != 5)
nrow(not_may)

In [80]:
subset(airquality, (Month != 5) & (Month != 6))

Unnamed: 0_level_0,Ozone,Solar.R,Wind,Temp,Month,Day
Unnamed: 0_level_1,<int>,<int>,<dbl>,<int>,<int>,<int>
62,135,269,4.1,84,7,1
63,49,248,9.2,85,7,2
64,32,236,9.2,81,7,3
65,,101,10.9,84,7,4
66,64,175,4.6,83,7,5
67,40,314,10.9,83,7,6
68,77,276,5.1,88,7,7
69,97,267,6.3,92,7,8
70,97,272,5.7,92,7,9
71,85,175,7.4,89,7,10


In [81]:
# A note about defensive programming
not_may_june = subset(airquality, (Month != 5) & (Month != 6))
nrow(airquality)
nrow(not_may_june)


In [82]:
# Useful to include safeguards
nrow(not_may_june) == nrow(airquality) - 61 

In [90]:
stopifnot(nrow(not_may_june) == nrow(airquality) - 61)


In [91]:
stopifnot(nrow(not_may_june) != nrow(airquality) - 61)

ERROR: Error: nrow(not_may_june) != nrow(airquality) - 61 is not TRUE


In [94]:

library(assertthat)
error_message <- "The filter did not return expected number of rows nrow(not_may_june) != nrow(airquality) - 61"
assert_that( nrow(not_may_june) == nrow(airquality) - 61, msg=error_message)

In [95]:
error_message <- "The filter did not return expected number of rows nrow(not_may_june) != nrow(airquality) - 61"
assert_that( nrow(not_may_june) != nrow(airquality) - 61, msg=error_message)

ERROR: Error: The filter did not return expected number of rows nrow(not_may_june) != nrow(airquality) - 61


In [97]:
subset(airquality, Month > 6)

Unnamed: 0_level_0,Ozone,Solar.R,Wind,Temp,Month,Day
Unnamed: 0_level_1,<int>,<int>,<dbl>,<int>,<int>,<int>
62,135,269,4.1,84,7,1
63,49,248,9.2,85,7,2
64,32,236,9.2,81,7,3
65,,101,10.9,84,7,4
66,64,175,4.6,83,7,5
67,40,314,10.9,83,7,6
68,77,276,5.1,88,7,7
69,97,267,6.3,92,7,8
70,97,272,5.7,92,7,9
71,85,175,7.4,89,7,10


In [105]:
# Should return an error
nrows_may_june = nrow(subset(airquality, (Month == 5) | (Month == 6)))
assert_that(nrows_may_june == 62, 
            msg= paste("Should return 62, returned ", nrows_may_june) )

ERROR: Error: Should return 61, returned  61


In [106]:
# Note that we use parentheses to negate the whole condition!
subset(airquality, !((Month == 5) | (Month == 6)))

Unnamed: 0_level_0,Ozone,Solar.R,Wind,Temp,Month,Day
Unnamed: 0_level_1,<int>,<int>,<dbl>,<int>,<int>,<int>
62,135,269,4.1,84,7,1
63,49,248,9.2,85,7,2
64,32,236,9.2,81,7,3
65,,101,10.9,84,7,4
66,64,175,4.6,83,7,5
67,40,314,10.9,83,7,6
68,77,276,5.1,88,7,7
69,97,267,6.3,92,7,8
70,97,272,5.7,92,7,9
71,85,175,7.4,89,7,10


### Sorting a Data Frame

* Some other operations on data frames are:

    * Sorting: `order()`
        * Interestingly, order returns the order (positions) of rows in the table
    * Order should be carried on a specific column, otherwise the ordering is ambiguous
        *  Can also specify a list of columns to order to break ties



In [108]:
some_vector = c(10,2, 0, 42)
order(some_vector)

In [109]:
some_vector_order = order(some_vector)
some_vector[some_vector_order]

In [110]:
airquality_temp_order = order(airquality$Temp)

In [111]:
airquality[airquality_temp_order, ]

Unnamed: 0_level_0,Ozone,Solar.R,Wind,Temp,Month,Day
Unnamed: 0_level_1,<int>,<int>,<dbl>,<int>,<int>,<int>
5,,,14.3,56,5,5
18,6,78,18.4,57,5,18
25,,66,16.6,57,5,25
27,,,8.0,57,5,27
15,18,65,13.2,58,5,15
26,,266,14.9,58,5,26
8,19,99,13.8,59,5,8
21,1,8,9.7,59,5,21
9,8,19,20.1,61,5,9
23,4,25,9.7,61,5,23


In [112]:
airquality_temp_solarR_order = order(airquality$Temp, airquality$Solar.R)
airquality[airquality_temp_solarR_order, ]

Unnamed: 0_level_0,Ozone,Solar.R,Wind,Temp,Month,Day
Unnamed: 0_level_1,<int>,<int>,<dbl>,<int>,<int>,<int>
5,,,14.3,56,5,5
25,,66,16.6,57,5,25
18,6,78,18.4,57,5,18
27,,,8.0,57,5,27
15,18,65,13.2,58,5,15
26,,266,14.9,58,5,26
21,1,8,9.7,59,5,21
8,19,99,13.8,59,5,8
9,8,19,20.1,61,5,9
23,4,25,9.7,61,5,23


## Lists 

* Lists are data structures used when you need to store collections of items that differ in length

    * Cannot use `df` since those need columns to be of the same length
    * Cannot use `matrix` since the columns need to be numeric, character or logical values 

* You can thin of a list as a way to group a "bunch" of other objects
  * Lists can include matrices, vectors, data frames, or even other lists,
  * The objects are conceptually related 
    * You determine whether grouping the objects makes sense


* Use `list()` to group the objects

In [114]:
a_vector = c(1, 2, 3, 4, 5, 6)
a_matrix = matrix(c(1,2,3,4), nrow=2, byrow = TRUE)

first_names <- c("John", "Jane")
last_names <- c("Smith", "Doe")
a_df <- data.frame(first_names, last_names)
print(a_df)

  first_names last_names
1        John      Smith
2        Jane        Doe


In [115]:
my_list <- list(a_vector, a_matrix, a_df)
print(my_list)

[[1]]
[1] 1 2 3 4 5 6

[[2]]
     [,1] [,2]
[1,]    1    2
[2,]    3    4

[[3]]
  first_names last_names
1        John      Smith
2        Jane        Doe



In [116]:
print(my_list[[1]])

[1] 1 2 3 4 5 6


In [117]:
print(my_list[[2]])

     [,1] [,2]
[1,]    1    2
[2,]    3    4


In [118]:
print(my_list[[3]])

  first_names last_names
1        John      Smith
2        Jane        Doe


## Lists

* Common to assign names to objects included in the list
    * Allows us to refer to the object by name instead of using `[[i]]`
    * Can add the name at `instantiation` or using the `names()` function
* Once indexed and retrieved, objects can be used the same way as how we used them before
    

In [122]:
my_list <- list("my_vector" = a_vector, "my_matrix" = a_matrix, "my_df" = a_df)
print(my_list)

$my_vector
[1] 1 2 3 4 5 6

$my_matrix
     [,1] [,2]
[1,]    1    2
[2,]    3    4

$my_df
  first_names last_names
1        John      Smith
2        Jane        Doe



In [77]:
my_list$my_vector

In [123]:
my_list$my_matrix

0,1
1,2
3,4


In [124]:
my_list$my_df

first_names,last_names
<chr>,<chr>
John,Smith
Jane,Doe


In [122]:
# Works without double quoting the names.
my_list <- list(my_vector = a_vector, my_matrix = a_matrix, my_df = a_df)
print(my_list)

$my_vector
[1] 1 2 3 4 5 6

$my_matrix
     [,1] [,2]
[1,]    1    2
[2,]    3    4

$my_df
  first_names last_names
1        John      Smith
2        Jane        Doe



In [129]:
# can also assign names to the objects post creation

names(my_list) = c("A", "B", "C")
print(my_list)

$A
[1] 1 2 3 4 5 6

$B
     [,1] [,2]
[1,]    1    2
[2,]    3    4

$C
  first_names last_names
1        John      Smith
2        Jane        Doe



In [133]:
print(my_list$A)

[1] 1 2 3 4 5 6


In [132]:
print(my_list$C)

  first_names last_names
1        John      Smith
2        Jane        Doe


### Review

* Vectors can hold numeric, character or logical values. 
    * values are either numeric, character or logical values 
    * The elements in a vector all have the same data type.
    *  Indexed using `[position(s)]`

### Review - Cont'd

* Matrices (two dimensional array)
    * All the cells are either numeric, character or logical values 
    * The elements in a matrix all have the same data type.
    * Indexing using `[<row(s)>, col(s)]`

### Review - Cont'd

* Data frames: the Excel-sheet equivalent in `R`
    * A column contains either numeric, character or logical values 
    * All elements of a column are of the same data type
    * Different columns can have different data types
    * Indexing using `[<row(s)>, col(s)]`


### Practical Session 


`matrix(c(1,2,3,4,5,6) , nrow = 3)`

* Running the expression produces the following matrix

|  |  |
|---|---|
| 1 | 4 |
| 2 | 5 |
| 3 | 6 |

* How can you modify the call to `matrix()` to produce the following matrix instead?

|   | col_1 |  col_2 |
| --- |---|---|
| row_1 | 1 | 2 |
| row_2 | 3 | 4 |
| row_3 | 5 | 6 |

* Note that you need to name the columns (col_1 and col_2) and name the rows (row_1, row_2, row_3)

Hint: Use the `?` symbol to invoke the matrix documentation

In [137]:
?matrix

In [141]:
x = matrix(c(1,2,3,4,5,6) , 
           byrow = TRUE, 
           nrow = 3, 
           dimnames=list(c("row_1", "row_1", "row_3"), c("col_1", "col_2")))
# I asked that we modify the call to matrix, so while the following will work,
# it is not in the call to matrix
# colnames(x) <- c("col_1", "col_2")
# rownames(x) <- c("row_1", "row_1", "row_3")
x

Unnamed: 0,col_1,col_2
row_1,1,2
row_1,3,4
row_3,5,6


* Sort the airquality data frame on its Temp and Solar.R columns in reverse order (largest to smallest values)
* Display only the first 15 lines of your table

In [145]:
airquality_temp_solarR_order = order(airquality$Temp, airquality$Solar.R, decreasing = TRUE)
head(airquality[airquality_temp_solarR_order, ], n=15)

Unnamed: 0_level_0,Ozone,Solar.R,Wind,Temp,Month,Day
Unnamed: 0_level_1,<int>,<int>,<dbl>,<int>,<int>,<int>
120,76.0,203,9.7,97,8,28
122,84.0,237,6.3,96,8,30
121,118.0,225,2.3,94,8,29
123,85.0,188,6.3,94,8,31
42,,259,10.9,93,6,11
127,91.0,189,4.6,93,9,4
126,73.0,183,2.8,93,9,3
70,97.0,272,5.7,92,7,9
69,97.0,267,6.3,92,7,8
43,,250,9.2,92,6,12


In [None]:
* Sort the airquality data frame on its Temp in decreasing order and Solar.R columns in increasing order
* Display only the first 15 lines of your table

In [146]:
airquality_temp_solarR_order = order(airquality$Temp, airquality$Solar.R, decreasing = c(TRUE, FALSE))
head(airquality[airquality_temp_solarR_order, ], 15)

Unnamed: 0_level_0,Ozone,Solar.R,Wind,Temp,Month,Day
Unnamed: 0_level_1,<int>,<int>,<dbl>,<int>,<int>,<int>
120,76.0,203,9.7,97,8,28
122,84.0,237,6.3,96,8,30
123,85.0,188,6.3,94,8,31
121,118.0,225,2.3,94,8,29
126,73.0,183,2.8,93,9,3
127,91.0,189,4.6,93,9,4
42,,259,10.9,93,6,11
125,78.0,197,5.1,92,9,2
102,,222,8.6,92,8,10
43,,250,9.2,92,6,12


* There are other ways to select values in data frames. 

* Consult your [R Reference Card](https://cran.r-project.org/doc/contrib/Baggott-refcard-v2.pdf), see `Data Selection and Manipulation` section.
  * Which other operation can you use to select all the rows where the temperature is 72.
  
* Hint, you want an operation that will return the indices of the line where temperature is 72

In [59]:
which(airquality$Temp == 72)

In [60]:
temp_72_idx <- which(airquality$Temp == 72)
airquality[temp_72_idx,]

# or simply
airquality[which(airquality$Temp == 72),]

# The second option saves us from creating an unnecessary variable, 
# but may be less legible for novices 

Unnamed: 0_level_0,Ozone,Solar.R,Wind,Temp,Month,Day
Unnamed: 0_level_1,<int>,<int>,<dbl>,<int>,<int>,<int>
2,36,118,8.0,72,5,2
48,37,284,20.7,72,6,17
114,9,36,14.3,72,8,22


Given the following vectors of sites and abundances for a fish species `x`, create a list called `x_info` that contains

1. vector of sites
2. vector of abundances
3. numerical average abundance value 
4. The site with the highest abundance

abundances_for_x <- c(10, 21, 32, 6, 22, 5)
associated_site_for_x <- c("site_1", "site_2", "site_3", "site_4", "site_5", "site_6")


Ensure that each of the expressions returns the value described below it

* print(`x_info$abundances`) returns the vector
`[1] 10 21 32  6 22  5`

* print(x_info$sites) returns the vector
`[1] "site_1" "site_2" "site_3" "site_4" "site_5" "site_6"`


* `x_info$mean_abundance` returns the mean abundance 
`16` 



* `x_info$site_max_abundance` returns the site with the maximum abundance
'site_3'

Note: see R Reference Card (Data selection and Manipulation)

In [151]:
abundances_for_x <- c(10, 21, 32, 6, 22, 5)
associated_site_for_x <- c("site_1", "site_2", "site_3", "site_4", "site_5", "site_6")

x_info <- list(abundances= abundances_for_x, 
               sites= associated_site_for_x, 
               mean_abundance = mean(abundances_for_x),
               site_max_abundance = associated_site_for_x[which.max(abundances_for_x)]
              )


In [152]:
print(x_info$abundances)

[1] 10 21 32  6 22  5


In [153]:
print(x_info$sites)

[1] "site_1" "site_2" "site_3" "site_4" "site_5" "site_6"


In [138]:
x_info$mean_abundance

In [144]:
x_info$site_max_abundance

* You may have noticied when working with the `airqulity` data that some values show as `NA`
 * NA stands for nor available, or missing values.
* A major part of data wrangling consists of cleaning missing values by either:
  * Dropping the lines that have missing values
   * Sometimes we can drop the column with missing values if the column is made of predominantly missing values
  * imputing the missing values, which uses educated guesses (or more complex algorithms) to fill the missing values
  
  


* Find and remove all rows that are missing values for the `Solar.R` or `Ozone` variables
  * Save the cleaned data to a new data frame called airquality_no_na
  * How many lined have been removed?
* Use defensive programming to make sure that the values have been properly removed


In [8]:
missing_solar_r <- which(is.na(airquality$Solar.R) | is.na(airquality$Ozone) )
airquality_no_na <- airquality[-missing_solar_r, ] 
airquality_no_na

Unnamed: 0_level_0,Ozone,Solar.R,Wind,Temp,Month,Day
Unnamed: 0_level_1,<int>,<int>,<dbl>,<int>,<int>,<int>
1,41,190,7.4,67,5,1
2,36,118,8.0,72,5,2
3,12,149,12.6,74,5,3
4,18,313,11.5,62,5,4
7,23,299,8.6,65,5,7
8,19,99,13.8,59,5,8
9,8,19,20.1,61,5,9
12,16,256,9.7,69,5,12
13,11,290,9.2,66,5,13
14,14,274,10.9,68,5,14


In [4]:
nrow(airquality[missing_solar_r, ])

In [10]:
print(airquality)

    Ozone Solar.R Wind Temp Month Day
1      41     190  7.4   67     5   1
2      36     118  8.0   72     5   2
3      12     149 12.6   74     5   3
4      18     313 11.5   62     5   4
5      NA      NA 14.3   56     5   5
6      28      NA 14.9   66     5   6
7      23     299  8.6   65     5   7
8      19      99 13.8   59     5   8
9       8      19 20.1   61     5   9
10     NA     194  8.6   69     5  10
11      7      NA  6.9   74     5  11
12     16     256  9.7   69     5  12
13     11     290  9.2   66     5  13
14     14     274 10.9   68     5  14
15     18      65 13.2   58     5  15
16     14     334 11.5   64     5  16
17     34     307 12.0   66     5  17
18      6      78 18.4   57     5  18
19     30     322 11.5   68     5  19
20     11      44  9.7   62     5  20
21      1       8  9.7   59     5  21
22     11     320 16.6   73     5  22
23      4      25  9.7   61     5  23
24     32      92 12.0   61     5  24
25     NA      66 16.6   57     5  25
26     NA   

* Let's use a different strategy and impute the missing value.
  * replace the missing values for Solar.R using that month's average.
  * Example:
    * The missing value for line 6 should be replaced with the average for month 5.
    * The missing value for line 97 should be replaced with the average for month 8.