## data reshaping, summarization and merging  

Data reshaping, summarization and merging are fundamental data analysis operations. There are a multitude of ways this can be done in R, this can make R very confusing. For this reason we will stick to the `dplyr/tidyr` syntax.   
To explain these concepts and to demonstrate use-cases, we will perform a small case study.  
Carefully run the lines and make sure to understand each part.  

Observe the `iris` data set:

In [None]:
head(iris)

Our goal is to calculate the average sepal and petal lengths and widths and add it to the existing iris table. Let us do this now, we first define the groups, then calculate the mean and finally add the result of this calculation to the iris table: 

In [None]:
mean_iris <- iris %>% group_by(Species) %>%
  summarise_all(mean) 
  
iris <- left_join(iris, mean_iris, by = "Species", suffix = c( "", ".mean") )
head(iris)

So for each measure- column we have to create an extra column to store the mean. A better way would be to first `gather` the measure-columns into a single column: 

In [None]:
data(iris)
head(iris)
iris %>% gather(key = variable, value = value , -Species)

Basically: `gather(data, key = “new name of combined columns", value = “new name of value column", -name of columns not to gather in key column)`
Or you could specify the columns to gather: 

In [None]:
ris %>% gather(key = variable, value = value,
       Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)

Now let's use what we have learned to streamline our case study objective:  

In [None]:
iris %>% gather(key = variable, value = value , -Species) %>% 
  group_by(Species) %>% 
  mutate(mean_value = mean(value))

Then there is also `spread`, which creates the wider format of data:  

In [None]:
iris <- iris %>% mutate(id = 1:nrow(iris)) # add id so we can go from long format back to wide format
iris_long <- iris %>% gather(key = variable, value = value , -(Species:id)) 
iris_wide <- iris_long %>% spread( key = variable, value = value )

basically: `spread(longdata, key = name of column to spread, name of value column)` 

* exercise: Start with the iris data, go to long format with `gather` then back to the original format, but now without using the `id` column. You can reload/ clean the iris data with `data(iris)`  Did this work? Why (not)? What does the `id` represent? 

* exercise: Create a table that contains the max value of each measurement for each species from the `iris` data. What species has the plant with the largest petals and sepals? 