# Summarising data
Once we've cleaned up our data it's time to start summarising it!

This can get pretty complicated so we have something called a data pipeline to make it really clear how we're doing our transformations and summaries. These pipelines take the previous line as the first input for the current line meaning no nested formulae.

In [3]:
install.packages("tidyverse")
library(tidyverse)
heroes = read_csv("clean/heroes.csv")
powers = read_csv("clean/powers_long.csv")

Installing package into ‘/home/nbuser/R’
(as ‘lib’ is unspecified)
Parsed with column specification:
cols(
  X1 = col_integer(),
  name = col_character(),
  Gender = col_character(),
  `Eye color` = col_character(),
  Race = col_character(),
  `Hair color` = col_character(),
  Height = col_double(),
  Publisher = col_character(),
  `Skin color` = col_character(),
  Alignment = col_character(),
  Weight = col_integer()
)
Parsed with column specification:
cols(
  hero_names = col_character(),
  power = col_character(),
  present = col_logical()
)


A pipeline uses the operator `%>%` between lines to denote the pipeline. So if I want to perform a `summarise_all()` action to return the number of unique values in each column with the `n_distinct()` function for my `heroes` data I could structure it this way:

In [17]:
heroes %>%
  summarise_all(n_distinct)

X1,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
734,715,3,23,62,30,54,25,17,4,135


This is useful as we get more complicated summaries. Let's say we wanted to know the average weight of all "Good" heroes. We would need a `filter()` for determining the good heroes, and a `summary()` to do the average weight.

In [11]:
heroes %>%
  filter(Alignment == "good") %>%
  summarise(avgWeight = mean(Weight, na.rm=TRUE))
# na.rm stops missing values from mucking up the calculation!

avgWeight
95.54655


What if we wanted to summarise multiple values by alignment? We can use the `group_by()` function to allow us to perform a `summarise()` type actions within groups. Let's see how we can could get some key statistics for every numeric column.

In [15]:
heroes %>%
  group_by(Alignment) %>%
  summarise_if(is.numeric, c("min", "max", "mean"), na.rm = TRUE)

Alignment,X1_min,Height_min,Weight_min,X1_max,Height_max,Weight_max,X1_mean,Height_mean,Weight_mean
-,33,183.0,81,692,229,358,372.8571,203.8,175.66667
bad,3,15.2,2,733,366,817,368.8068,187.0824,139.80986
good,0,30.5,4,732,975,900,363.9819,183.8452,95.54655
neutral,92,165.0,16,672,876,855,396.7917,237.4118,198.11765


> The count() function can very quickly build a frequency table for how often eahc unique value in a column appears. Can you use this function to workout how many heroes have each power based on the `powers` dataset?