## Aggregating Data

Now that you know how to transform your data, you'll want to know more about how to aggregate your data to make it more interpretable. You'll learn a number of functions you can use to take many observations in your data and summarize them, including count, group_by, summarize, ungroup, and top_n.

### Counting by region  
The counties dataset contains columns for region, 
state, population, and the number of citizens, which we 
selected and saved as the counties_selected table. 
In this exercise, you'll focus on the region column.

In [5]:
counties_selected <- counties %>%
  select(region, state, population, citizens)

# Use count() to find the number of counties in each region, 
# using a second argument to sort in descending order.
counties_selected %>%
count(region, sort = TRUE)

# Counting citizens by state
# You can weigh your count by particular variables rather 
# than finding the number of counties. In this case, you'll 
# find the number of citizens in each state.

# Count the number of counties in each state, weighted based 
# on the citizens column, and sorted in descending order.

counties_selected %>%
count(state, wt = citizens, sort = TRUE)  

region,n
South,1420
North Central,1054
West,447
Northeast,217


state,n
California,24280349
Texas,16864864
Florida,13933052
New York,13531404
Pennsylvania,9710416
Illinois,8979999
Ohio,8709050
Michigan,7380136
North Carolina,7107998
Georgia,6978660


### Mutating and counting
You can combine multiple verbs together to answer 
increasingly complicated questions of your data. 
For example: "What are the US states where the most people 
walk to work?"
You'll use the walk column, which offers a percentage of 
people in each county that walk to work, to add a new column 
and count based on it.
Use mutate() to calculate and add a column called 
population_walk, containing the total number of people 
who walk to work in a county.
Use a (weighted and sorted) count() to find the 
total number of people who walk to work in each state

In [6]:
counties_selected <- counties %>%
  select(region, state, population, walk)

counties_selected %>%
  # Add population_walk containing the total number of people who walk to work 
 mutate(population_walk = walk * population /100) %>%
  # Count weighted by the new column
 count(state, wt = population_walk , sort = TRUE) 

state,n
New York,1237938.17
California,1017963.68
Pennsylvania,505397.19
Texas,430783.43
Illinois,400345.6
Massachusetts,316765.03
Florida,284722.87
New Jersey,273047.19
Ohio,266910.98
Washington,239764.32


### Summarizing
The summarize() verb is very useful for collapsing a 
large dataset into a single observation.
Summarize the counties dataset to find the following columns: 
min_population (with the smallest population), 
max_unemployment (with the maximum unemployment), 
and average_income (with the mean of the income variable).

In [7]:
counties_selected <- counties %>%
  select(county, population, income, unemployment)

# Summarize to find minimum population, maximum unemployment, and average income
counties_selected %>%
summarize(min_population = min(population),
max_unemployment = max(unemployment),
average_income = mean(income))  

min_population,max_unemployment,average_income
85,29.4,46832


### Summarizing by state
Another interesting column is land_area, which shows 
the land area in square miles. Here, you'll summarize both 
population and land area by state, with the purpose of finding 
the density (in people per square miles).

In [9]:
counties_selected <- counties %>%
  select(state, county, population, land_area)

# Group the data by state, and summarize to create the columns 
# total_area (with total area in square miles) and 
# total_population (with total population).

counties_selected %>%
    group_by(state) %>%
    summarize(total_area = sum(land_area), 
              total_population = sum(population))

# Add a density column with the people per square mile, 
# then arrange in descending order.
counties_selected %>%
  group_by(state) %>%
  summarize(total_area = sum(land_area),
            total_population = sum(population)) %>%
  mutate(density = total_population / total_area) %>%
  arrange(desc(density))

`summarise()` ungrouping output (override with `.groups` argument)


state,total_area,total_population
Alabama,50645.39,4830620
Alaska,553559.51,725461
Arizona,113594.09,6641928
Arkansas,52035.48,2958208
California,155779.21,38421464
Colorado,103641.93,5278906
Connecticut,4842.36,3593222
Delaware,1948.55,926454
Florida,53624.78,19645772
Georgia,57513.54,10006693


`summarise()` ungrouping output (override with `.groups` argument)


state,total_area,total_population,density
New Jersey,7354.22,8904413,1210.789587
Rhode Island,1033.82,1053661,1019.191929
Massachusetts,7800.08,6705586,859.681696
Connecticut,4842.36,3593222,742.039419
Maryland,9707.23,5930538,610.940299
Delaware,1948.55,926454,475.458161
New York,47126.43,19673174,417.455216
Florida,53624.78,19645772,366.356226
Pennsylvania,44742.71,12779559,285.623267
Ohio,40860.73,11575977,283.303235


### Summarizing by state and region
You can group by multiple columns instead of grouping by one.
Here, you'll practice aggregating by state and region, 
and notice how useful it is for performing multiple aggregations 
in a row.
Summarize to find the total population, as a column called 
total_pop, in each combination of region and state.

In [12]:
counties_selected <- counties %>%
  select(region, state, county, population)

counties_selected %>%
    group_by(region, state) %>%
    summarize(total_pop = sum(population))

`summarise()` regrouping output by 'region' (override with `.groups` argument)


region,state,total_pop
North Central,Illinois,12873761
North Central,Indiana,6568645
North Central,Iowa,3093526
North Central,Kansas,2892987
North Central,Michigan,9900571
North Central,Minnesota,5419171
North Central,Missouri,6045448
North Central,Nebraska,1869365
North Central,North Dakota,721640
North Central,Ohio,11575977


In [13]:
# Notice the tibble is still grouped by region; 
# use another summarize() step to calculate two new columns: 
# the average state population in each region (average_pop) and 
#the median state population in each region (median_pop).

counties_selected %>%
  group_by(region, state) %>%
  summarize(total_pop = sum(population)) %>%
  summarize(average_pop = mean(total_pop),
            median_pop = median(total_pop))

`summarise()` regrouping output by 'region' (override with `.groups` argument)
`summarise()` ungrouping output (override with `.groups` argument)


region,average_pop,median_pop
North Central,5627687,5580644
Northeast,6221058,3593222
South,7370486,4804098
West,5722755,2798636


### Selecting a county from each region
Previously, you used the walk column, which offers 
a percentage of people in each county that walk to work, 
to add a new column and count to find the total number
of people who walk to work in each county.
Now, you're interested in finding the county within each region
with the highest percentage of citizens who walk to work.

In [15]:
counties_selected <- counties %>%
  select(region, state, county, metro, population, walk)
# Group by region and find the greatest number of citizens who walk to work
counties_selected %>%
group_by(region) %>%
top_n(1, walk)

region,state,county,metro,population,walk
West,Alaska,Aleutians East Borough,Nonmetro,3304,71.2
Northeast,New York,New York,Metro,1629507,20.7
North Central,North Dakota,McIntosh,Nonmetro,2759,17.5
South,Virginia,Lexington city,Nonmetro,7071,31.7


### Finding the highest-income state in each region
You've been learning to combine multiple dplyr verbs together.
Here, you'll combine group_by(), summarize(), and top_n() to 
find the state in each region with the highest income.
When you group by multiple columns and then summarize, 
it's important to remember that the summarize "peels off" 
one of the groups, but leaves the rest on. 
For example, if you group_by(X, Y) then summarize, 
the result will still be grouped by X.

In [16]:
counties_selected <- counties %>%
  select(region, state, county, population, income)

# Find the highest income state in each region.
counties_selected %>%
  group_by(region, state) %>%
  # Calculate average income
  summarize(average_income = mean(income))%>%
  # Find the highest income state in each region
  top_n(1, average_income)

`summarise()` regrouping output by 'region' (override with `.groups` argument)


region,state,average_income
North Central,North Dakota,55574.87
Northeast,New Jersey,73014.1
South,Maryland,69200.38
West,Alaska,65124.54


### Using summarize, top_n, and count together
In this chapter, you've learned to use five dplyr verbs 
related to aggregation: count(), group_by(), summarize(), 
ungroup(), and top_n(). In this exercise, you'll use all 
of them to answer a question: In how many states do more 
people live in metro areas than non-metro areas?

In [19]:
counties_selected <- counties %>%
  select(state, metro, population)
# Find the total population for each combination of state 
# and metro
counties_selected %>%
group_by(state, metro) %>%
summarize(total_pop = sum(population)) %>%
# Extract the most populated row for each state
top_n(1,total_pop)

`summarise()` regrouping output by 'state' (override with `.groups` argument)


state,metro,total_pop
Alabama,Metro,3671377
Alaska,Metro,494990
Arizona,Metro,6295145
Arkansas,Metro,1806867
California,Metro,37587429
Colorado,Metro,4590896
Connecticut,Metro,3406918
Delaware,Metro,926454
Florida,Metro,18941821
Georgia,Metro,8233886


In [20]:
# Ungroup, then count how often Metro or Nonmetro appears 
# to see how many states have more people living in those areas.

# Count the states with more people in Metro or Nonmetro areas
counties_selected %>%
  group_by(state, metro) %>%
  summarize(total_pop = sum(population)) %>%
  top_n(1, total_pop) %>%
  ungroup() %>% 
  count(metro)

`summarise()` regrouping output by 'state' (override with `.groups` argument)


metro,n
Metro,44
Nonmetro,6
