## Selecting and Transforming Data

Learn advanced methods to select and transform columns. Also learn about select helpers, which are functions that specify criteria for columns you want to choose, as well as the rename and transmute verbs.

### Selecting columns
Using the select verb, we can answer interesting questions 
about our dataset by focusing in on related groups of verbs. 
The colon (:) is useful for getting many columns at a time.

In [2]:
# libraries
library(dplyr)
# read data
counties <- readRDS(file = "counties.rds")

# Use glimpse() to examine all the variables in the counties table.
glimpse(counties)

# Select the columns for state, county, population, and 
# (using a colon) all five of those industry-related variables;
# there are five consecutive variables in the table related to 
# the industry of people's work: professional, service, office,
# construction, and production.

# then, arrange the table in descending order of service to 
# find which counties have the highest rates of working in the 
# service industry.

counties %>%
  # Select state, county, population, and industry-related columns
  select(state, county, population, professional:production) %>%
  # Arrange service in descending order 
  arrange(desc(service))


Rows: 3,138
Columns: 40
$ census_id          <chr> "1001", "1003", "1005", "1007", "1009", "1011", ...
$ state              <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Ala...
$ county             <chr> "Autauga", "Baldwin", "Barbour", "Bibb", "Blount...
$ region             <chr> "South", "South", "South", "South", "South", "So...
$ metro              <chr> "Metro", "Metro", "Nonmetro", "Metro", "Metro", ...
$ population         <dbl> 55221, 195121, 26932, 22604, 57710, 10678, 20354...
$ men                <dbl> 26745, 95314, 14497, 12073, 28512, 5660, 9502, 5...
$ women              <dbl> 28476, 99807, 12435, 10531, 29198, 5018, 10852, ...
$ hispanic           <dbl> 2.6, 4.5, 4.6, 2.2, 8.6, 4.4, 1.2, 3.5, 0.4, 1.5...
$ white              <dbl> 75.8, 83.1, 46.2, 74.5, 87.9, 22.2, 53.3, 73.0, ...
$ black              <dbl> 18.5, 9.5, 46.7, 21.4, 1.5, 70.7, 43.8, 20.3, 40...
$ native             <dbl> 0.4, 0.6, 0.2, 0.4, 0.3, 1.2, 0.1, 0.2, 0.2, 0.6...
$ asian              <dbl> 1

state,county,population,professional,service,office,construction,production
Mississippi,Tunica,10477,23.9,36.6,21.5,3.5,14.5
Texas,Kinney,3577,30.0,36.5,11.6,20.5,1.3
Texas,Kenedy,565,24.9,34.1,20.5,20.5,0.0
New York,Bronx,1428357,24.3,33.3,24.2,7.1,11.0
Texas,Brooks,7221,19.6,32.4,25.3,11.1,11.5
Colorado,Fremont,46809,26.6,32.2,22.8,10.7,7.6
Texas,Culberson,2296,20.1,32.2,24.2,15.7,7.8
California,Del Norte,27788,33.9,31.5,18.8,8.9,6.8
Minnesota,Mahnomen,5496,26.8,31.5,18.7,13.1,9.9
Virginia,Lancaster,11129,30.3,31.2,22.8,8.1,7.6


### Select helpers
In the video you learned about the select helper 
starts_with(). Another select helper is ends_with(), 
which finds the columns that end with a particular string.

In [3]:
counties %>%
  # Select the state, county, population, and those ending with "work"
    select(state, county, population, ends_with("work")) %>%
  # Filter for counties that have at least 50% of people engaged in public work
    filter(public_work >= 50)

state,county,population,private_work,public_work,family_work
Alaska,Lake and Peninsula Borough,1474,42.2,51.6,0.2
Alaska,Yukon-Koyukuk Census Area,5644,33.3,61.7,0.0
California,Lassen,32645,42.6,50.5,0.1
Hawaii,Kalawao,85,25.0,64.1,0.0
North Dakota,Sioux,4380,32.9,56.8,0.1
South Dakota,Todd,9942,34.4,55.0,0.8
Wisconsin,Menominee,4451,36.8,59.1,0.4


### Renaming a column after count
The rename() verb is often useful for changing the name of 
a column that comes out of another verb, such as count(). 
In this exercise, you'll rename the n column from count() 
(which you learned about in Chapter 2) to something 
more descriptive.


In [4]:
# Use count() to determine how many counties are in each state. And, then notice the n column in the output; use rename() 
# to rename that to num_counties.
# Rename the n column to num_counties
counties %>%
  count(state) %>%
  rename(num_counties = n)

state,num_counties
Alabama,67
Alaska,28
Arizona,15
Arkansas,75
California,58
Colorado,64
Connecticut,8
Delaware,3
Florida,67
Georgia,159


### Renaming a column as part of a select
rename() isn't the only way you can choose a new name 
for a column: you can also choose a name as part of a select().
Select the columns state, county, and poverty from the 
counties dataset; in the same step, rename the poverty 
column to poverty_rate.

In [5]:
counties %>%
  select(state, county,poverty_rate=poverty)


state,county,poverty_rate
Alabama,Autauga,12.9
Alabama,Baldwin,13.4
Alabama,Barbour,26.7
Alabama,Bibb,16.8
Alabama,Blount,16.7
Alabama,Bullock,24.6
Alabama,Butler,25.4
Alabama,Calhoun,20.5
Alabama,Chambers,21.6
Alabama,Cherokee,19.2


### Using transmute
As you learned in the video, the transmute verb allows 
you to control which variables you keep, which variables 
you calculate, and which variables you drop.

In [6]:
counties %>%
  # Keep the state, county, and populations columns, and add a density column
  transmute(state, county, population, density = population / land_area) %>%
  # Filter for counties with a population greater than one million 
  filter(population > 1000000) %>%
  # Sort density in ascending order
  arrange(density) 

state,county,population,density
California,San Bernardino,2094769,104.4411
Nevada,Clark,2035572,257.9472
California,Riverside,2298032,318.8841
Arizona,Maricopa,4018143,436.748
Florida,Palm Beach,1378806,699.9868
California,San Diego,3223096,766.1943
Washington,King,2045756,966.9999
Texas,Travis,1121645,1132.7459
Florida,Hillsborough,1302884,1277.0743
Florida,Orange,1229039,1360.4142


### Choosing among the four verbs
In this chapter you've learned about the four verbs: select, mutate, transmute, and rename. Here, you'll choose the appropriate verb for each situation. You won't need to change anything inside the parentheses.

In [7]:
# sum up
# Change the name of the unemployment column
counties %>%
  rename(unemployment_rate = unemployment)

# Keep the state and county columns, and the columns containing poverty
counties %>%
  select(state, county, contains("poverty"))

# Calculate the fraction_women column without dropping the other columns
counties %>%
  mutate(fraction_women = women / population)

# Keep only the state, county, and employment_rate columns
counties %>%
  transmute(state, county, employment_rate = employed / population)

census_id,state,county,region,metro,population,men,women,hispanic,white,...,other_transp,work_at_home,mean_commute,employed,private_work,public_work,self_employed,family_work,unemployment_rate,land_area
1001,Alabama,Autauga,South,Metro,55221,26745,28476,2.6,75.8,...,1.3,1.8,26.5,23986,73.6,20.9,5.5,0.0,7.6,594.44
1003,Alabama,Baldwin,South,Metro,195121,95314,99807,4.5,83.1,...,1.4,3.9,26.4,85953,81.5,12.3,5.8,0.4,7.5,1589.78
1005,Alabama,Barbour,South,Nonmetro,26932,14497,12435,4.6,46.2,...,1.5,1.6,24.1,8597,71.8,20.8,7.3,0.1,17.6,884.88
1007,Alabama,Bibb,South,Metro,22604,12073,10531,2.2,74.5,...,1.5,0.7,28.8,8294,76.8,16.1,6.7,0.4,8.3,622.58
1009,Alabama,Blount,South,Metro,57710,28512,29198,8.6,87.9,...,0.4,2.3,34.9,22189,82.0,13.5,4.2,0.4,7.7,644.78
1011,Alabama,Bullock,South,Nonmetro,10678,5660,5018,4.4,22.2,...,1.7,2.8,27.5,3865,79.5,15.1,5.4,0.0,18.0,622.81
1013,Alabama,Butler,South,Nonmetro,20354,9502,10852,1.2,53.3,...,0.6,1.7,24.6,7813,77.4,16.2,6.2,0.2,10.9,776.83
1015,Alabama,Calhoun,South,Metro,116648,56274,60374,3.5,73.0,...,1.2,2.7,24.1,47401,74.1,20.8,5.0,0.1,12.3,605.87
1017,Alabama,Chambers,South,Nonmetro,34079,16258,17821,0.4,57.3,...,0.4,2.1,25.1,13689,85.1,12.1,2.8,0.0,8.9,596.53
1019,Alabama,Cherokee,South,Nonmetro,26008,12975,13033,1.5,91.7,...,0.7,2.5,27.4,10155,73.1,18.5,7.9,0.5,7.9,553.70


state,county,poverty,child_poverty
Alabama,Autauga,12.9,18.6
Alabama,Baldwin,13.4,19.2
Alabama,Barbour,26.7,45.3
Alabama,Bibb,16.8,27.9
Alabama,Blount,16.7,27.2
Alabama,Bullock,24.6,38.4
Alabama,Butler,25.4,39.2
Alabama,Calhoun,20.5,31.6
Alabama,Chambers,21.6,37.2
Alabama,Cherokee,19.2,30.1


census_id,state,county,region,metro,population,men,women,hispanic,white,...,work_at_home,mean_commute,employed,private_work,public_work,self_employed,family_work,unemployment,land_area,fraction_women
1001,Alabama,Autauga,South,Metro,55221,26745,28476,2.6,75.8,...,1.8,26.5,23986,73.6,20.9,5.5,0.0,7.6,594.44,0.5156734
1003,Alabama,Baldwin,South,Metro,195121,95314,99807,4.5,83.1,...,3.9,26.4,85953,81.5,12.3,5.8,0.4,7.5,1589.78,0.5115134
1005,Alabama,Barbour,South,Nonmetro,26932,14497,12435,4.6,46.2,...,1.6,24.1,8597,71.8,20.8,7.3,0.1,17.6,884.88,0.4617184
1007,Alabama,Bibb,South,Metro,22604,12073,10531,2.2,74.5,...,0.7,28.8,8294,76.8,16.1,6.7,0.4,8.3,622.58,0.4658910
1009,Alabama,Blount,South,Metro,57710,28512,29198,8.6,87.9,...,2.3,34.9,22189,82.0,13.5,4.2,0.4,7.7,644.78,0.5059435
1011,Alabama,Bullock,South,Nonmetro,10678,5660,5018,4.4,22.2,...,2.8,27.5,3865,79.5,15.1,5.4,0.0,18.0,622.81,0.4699382
1013,Alabama,Butler,South,Nonmetro,20354,9502,10852,1.2,53.3,...,1.7,24.6,7813,77.4,16.2,6.2,0.2,10.9,776.83,0.5331630
1015,Alabama,Calhoun,South,Metro,116648,56274,60374,3.5,73.0,...,2.7,24.1,47401,74.1,20.8,5.0,0.1,12.3,605.87,0.5175742
1017,Alabama,Chambers,South,Nonmetro,34079,16258,17821,0.4,57.3,...,2.1,25.1,13689,85.1,12.1,2.8,0.0,8.9,596.53,0.5229320
1019,Alabama,Cherokee,South,Nonmetro,26008,12975,13033,1.5,91.7,...,2.5,27.4,10155,73.1,18.5,7.9,0.5,7.9,553.70,0.5011150


state,county,employment_rate
Alabama,Autauga,0.4343637
Alabama,Baldwin,0.4405113
Alabama,Barbour,0.3192113
Alabama,Bibb,0.3669262
Alabama,Blount,0.3844914
Alabama,Bullock,0.3619592
Alabama,Butler,0.3838558
Alabama,Calhoun,0.4063593
Alabama,Chambers,0.4016843
Alabama,Cherokee,0.3904568
