##dplyr
- summarize() from the dplyr/tidyverse package computes summary statistics from the data frame. It returns a data frame whose column names are defined within the function call.

- summarize() can compute any summary function that operates on vectors and returns a single value, but it cannot operate on functions that return multiple values.
- Like most dplyr functions, summarize() is aware of variable names within data frames and can use them directly.

In [2]:
install.packages("dslabs")
library(tidyverse)
library(dslabs)
data(heights)

# compute average and standard deviation for males
s <- heights %>%
    filter(sex == "Male") %>%
    summarize(average = mean(height), standard_deviation = sd(height))
  

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [3]:
# access average and standard deviation from summary table
s$average
s$standard_deviation

# compute median, min and max
heights %>%
    filter(sex == "Male") %>%
    summarize(median = median(height),
              minimum = min(height),
              maximum = max(height))


median,minimum,maximum
<dbl>,<dbl>,<dbl>
69,50,82.67717


In [4]:
# alternative way to get min, median, max in base R
quantile(heights$height, c(0, 0.5, 1))

# NOTE: The following code will NOT generate an error if using dplyr 1.0 or later

# generates an error: summarize can only take functions that return a single value
heights %>%
    filter(sex == "Male") %>%
    summarize(range = quantile(height, c(0, 0.5, 1)))

range
<dbl>
50.0
69.0
82.67717


## The Dot Placeholder
- The dot operator allows you to access values stored in data that is being piped in using the %>% character. The dot is a placeholder for the data being passed in through the pipe.
- The dot operator allows dplyr functions to return single vectors or numbers instead of only data frames.
- us_murder_rate %>% .$rate is equivalent to us_murder_rate$rate.
- Note that an equivalent way to extract a single column using the pipe is us_murder_rate %>% pull(rate). The pull() function will be used in later course material.

In [6]:
murders <- murders %>% mutate(murder_rate = total/population*100000)
summarize(murders, mean(murder_rate))



mean(murder_rate)
<dbl>
2.779125


In [8]:
# calculate US murder rate, generating a data frame
us_murder_rate <- murders %>%
    summarize(rate = sum(total) / sum(population) * 100000)
    us_murder_rate

# extract the numeric US murder rate with the dot operator
us_murder_rate %>% .$rate



rate
<dbl>
3.034555


In [9]:
# calculate and extract the murder rate with one pipe
us_murder_rate <- murders %>%
    summarize(rate = sum(total) / sum(population) * 100000) %>%
    .$rate

##Group By
- The group_by() function from dplyr  converts a data frame to a grouped data frame, creating groups using one or more variables.
- summarize() and some other dplyr functions will behave differently on grouped data frames.
- Using summarize() on a grouped data frame computes the summary statistics for each of the separate groups.

In [10]:
# libraries and data
library(tidyverse)
library(dslabs)
data(heights)
data(murders)


In [11]:
# compute separate average and standard deviation for male/female heights
heights %>%
    group_by(sex) %>%
    summarize(average = mean(height), standard_deviation = sd(height))


`summarise()` ungrouping output (override with `.groups` argument)



sex,average,standard_deviation
<fct>,<dbl>,<dbl>
Female,64.93942,3.760656
Male,69.31475,3.611024


In [12]:
# compute median murder rate in 4 regions of country
murders <- murders %>%
    mutate(murder_rate = total/population * 100000)
murders %>%
    group_by(region) %>%
    summarize(median_rate = median(murder_rate))

`summarise()` ungrouping output (override with `.groups` argument)



region,median_rate
<fct>,<dbl>
Northeast,1.802179
South,3.398069
North Central,1.971105
West,1.292453


##Sorting data Tables
- The arrange() function from dplyr sorts a data frame by a given column.
- By default, arrange() sorts in ascending order (lowest to highest). To instead sort in descending order, use the function desc() inside of arrange().
- You can arrange() by multiple levels: within equivalent values of the first level, observations are sorted by the second level, and so on.
- The top_n() function shows the top results ranked by a given variable, but the results are not ordered. You can combine top_n() with arrange() to return the top results in order.

In [13]:
# libraries and data
library(tidyverse)
library(dslabs)
data(murders)

In [14]:

# set up murders object
murders <- murders %>%
    mutate(murder_rate = total/population * 100000)


In [15]:
    
# arrange by population column, smallest to largest
murders %>% arrange(population) %>% head()


Unnamed: 0_level_0,state,abb,region,population,total,murder_rate
Unnamed: 0_level_1,<chr>,<chr>,<fct>,<dbl>,<dbl>,<dbl>
1,Wyoming,WY,West,563626,5,0.8871131
2,District of Columbia,DC,South,601723,99,16.4527532
3,Vermont,VT,Northeast,625741,2,0.3196211
4,North Dakota,ND,North Central,672591,4,0.5947151
5,Alaska,AK,West,710231,19,2.675186
6,South Dakota,SD,North Central,814180,8,0.9825837


In [16]:

# arrange by murder rate, smallest to largest
murders %>% arrange(murder_rate) %>% head()


Unnamed: 0_level_0,state,abb,region,population,total,murder_rate
Unnamed: 0_level_1,<chr>,<chr>,<fct>,<dbl>,<dbl>,<dbl>
1,Vermont,VT,Northeast,625741,2,0.3196211
2,New Hampshire,NH,Northeast,1316470,5,0.3798036
3,Hawaii,HI,West,1360301,7,0.514592
4,North Dakota,ND,North Central,672591,4,0.5947151
5,Iowa,IA,North Central,3046355,21,0.6893484
6,Idaho,ID,West,1567582,12,0.7655102


In [17]:

# arrange by murder rate in descending order
murders %>% arrange(desc(murder_rate)) %>% head()


Unnamed: 0_level_0,state,abb,region,population,total,murder_rate
Unnamed: 0_level_1,<chr>,<chr>,<fct>,<dbl>,<dbl>,<dbl>
1,District of Columbia,DC,South,601723,99,16.452753
2,Louisiana,LA,South,4533372,351,7.742581
3,Missouri,MO,North Central,5988927,321,5.359892
4,Maryland,MD,South,5773552,293,5.074866
5,South Carolina,SC,South,4625364,207,4.475323
6,Delaware,DE,South,897934,38,4.231937


In [18]:

# arrange by region alphabetically, then by murder rate within each region
murders %>% arrange(region, murder_rate) %>% head()


Unnamed: 0_level_0,state,abb,region,population,total,murder_rate
Unnamed: 0_level_1,<chr>,<chr>,<fct>,<dbl>,<dbl>,<dbl>
1,Vermont,VT,Northeast,625741,2,0.3196211
2,New Hampshire,NH,Northeast,1316470,5,0.3798036
3,Maine,ME,Northeast,1328361,11,0.8280881
4,Rhode Island,RI,Northeast,1052567,16,1.5200933
5,Massachusetts,MA,Northeast,6547629,118,1.8021791
6,New York,NY,Northeast,19378102,517,2.6679599


In [19]:

# show the top 10 states with highest murder rate, not ordered by rate
murders %>% top_n(10, murder_rate)


state,abb,region,population,total,murder_rate
<chr>,<chr>,<fct>,<dbl>,<dbl>,<dbl>
Arizona,AZ,West,6392017,232,3.629527
Delaware,DE,South,897934,38,4.231937
District of Columbia,DC,South,601723,99,16.452753
Georgia,GA,South,9920000,376,3.790323
Louisiana,LA,South,4533372,351,7.742581
Maryland,MD,South,5773552,293,5.074866
Michigan,MI,North Central,9883640,413,4.178622
Mississippi,MS,South,2967297,120,4.044085
Missouri,MO,North Central,5988927,321,5.359892
South Carolina,SC,South,4625364,207,4.475323


In [20]:

# show the top 10 states with highest murder rate, ordered by rate
murders %>% arrange(desc(murder_rate)) %>% top_n(10)

Selecting by murder_rate



state,abb,region,population,total,murder_rate
<chr>,<chr>,<fct>,<dbl>,<dbl>,<dbl>
District of Columbia,DC,South,601723,99,16.452753
Louisiana,LA,South,4533372,351,7.742581
Missouri,MO,North Central,5988927,321,5.359892
Maryland,MD,South,5773552,293,5.074866
South Carolina,SC,South,4625364,207,4.475323
Delaware,DE,South,897934,38,4.231937
Michigan,MI,North Central,9883640,413,4.178622
Mississippi,MS,South,2967297,120,4.044085
Georgia,GA,South,9920000,376,3.790323
Arizona,AZ,West,6392017,232,3.629527
