# Data Manipulation with dplyr

Say you've found a great dataset and would like to learn more about it. How can you start to answer the questions you have about the data? You can use dplyr to answer those questions—it can also help with basic transformations of your data. You'll also learn to aggregate your data and add, remove, or change the variables. Along the way, you'll explore a dataset containing information about counties in the United States. You'll finish the course by applying these tools to the babynames dataset to explore trends of baby names in the United States.

## Transforming Data with dplyr

Learn verbs you can use to transform your data, including select, filter, arrange, and mutate. You'll use these functions to modify the counties dataset to view particular observations and answer questions about the data.

In [21]:
# The easiest way to get readr is to install the whole tidyverse:
#install.packages("tidyverse")
# Alternatively, install just readr:
#install.packages("readr")
counties <- readRDS(file = "counties.rds")
#saveRDS(object = mov, file = "movies.rds")
library(dplyr)
# select some variables
counties_selected <- counties %>%
    select(state, county, population, unemployment, income)
counties_selected

state,county,population,unemployment,income
Alabama,Autauga,55221,7.6,51281
Alabama,Baldwin,195121,7.5,50254
Alabama,Barbour,26932,17.6,32964
Alabama,Bibb,22604,8.3,38678
Alabama,Blount,57710,7.7,45813
Alabama,Bullock,10678,18.0,31938
Alabama,Butler,20354,10.9,32229
Alabama,Calhoun,116648,12.3,41703
Alabama,Chambers,34079,8.9,34177
Alabama,Cherokee,26008,7.9,36296


### Arranging observations
Here you see the counties_selected dataset with a 
few interesting variables selected. These variables: 
private_work, public_work, self_employed describe 
whether people work for the government, for private companies,
or for themselves.
In these exercises, you'll sort these observations to find 
the most interesting cases.

In [22]:
counties_selected <- counties %>%
  select(state, county, population, private_work, public_work, self_employed)

# Add a verb to sort in descending order of public_work
counties_selected %>%
  arrange(desc(public_work))

state,county,population,private_work,public_work,self_employed
Hawaii,Kalawao,85,25.0,64.1,10.9
Alaska,Yukon-Koyukuk Census Area,5644,33.3,61.7,5.1
Wisconsin,Menominee,4451,36.8,59.1,3.7
North Dakota,Sioux,4380,32.9,56.8,10.2
South Dakota,Todd,9942,34.4,55.0,9.8
Alaska,Lake and Peninsula Borough,1474,42.2,51.6,6.1
California,Lassen,32645,42.6,50.5,6.8
South Dakota,Buffalo,2038,48.4,49.5,1.8
South Dakota,Dewey,5579,34.9,49.2,14.7
Texas,Kenedy,565,51.9,48.1,0.0


### Filtering for conditions
You use the filter() verb to get only observations 
that match a particular condition, or match multiple 
conditions.

In [23]:
counties_selected <- counties %>%
  select(state, county, population)

# Filter for counties with a population above 1000000
counties_selected %>%
  filter(population > 1000000)

# Filter for counties in the state of California that have a population above 1000000
counties_selected %>%
filter(state == 'California') %>%
filter(population > 1000000)

state,county,population
Arizona,Maricopa,4018143
California,Alameda,1584983
California,Contra Costa,1096068
California,Los Angeles,10038388
California,Orange,3116069
California,Riverside,2298032
California,Sacramento,1465832
California,San Bernardino,2094769
California,San Diego,3223096
California,Santa Clara,1868149


state,county,population
California,Alameda,1584983
California,Contra Costa,1096068
California,Los Angeles,10038388
California,Orange,3116069
California,Riverside,2298032
California,Sacramento,1465832
California,San Bernardino,2094769
California,San Diego,3223096
California,Santa Clara,1868149


### Filtering and arranging
We're often interested in both filtering and sorting a 
dataset, to focus on observations of particular interest 
to you. Here, you'll find counties that are extreme 
examples of what fraction of the population works in the 
private sector.

In [24]:
counties_selected <- counties %>%
  select(state, county, population, private_work, public_work, self_employed)

# Filter for Texas and more than 10000 people; sort in descending order of private_work
counties_selected %>%
filter(state == 'Texas') %>%
filter(population > 10000) %>%
arrange(desc(private_work)) 

state,county,population,private_work,public_work,self_employed
Texas,Gregg,123178,84.7,9.8,5.4
Texas,Collin,862215,84.1,10.0,5.8
Texas,Dallas,2485003,83.9,9.5,6.4
Texas,Harris,4356362,83.4,10.1,6.3
Texas,Andrews,16775,83.1,9.6,6.8
Texas,Tarrant,1914526,83.1,11.4,5.4
Texas,Titus,32553,82.5,10.0,7.4
Texas,Denton,731851,82.2,11.9,5.7
Texas,Ector,149557,82.0,11.2,6.7
Texas,Moore,22281,82.0,11.7,5.9


### Calculating the number of government employees
Use mutate() to add a column called public_workers 
to the dataset, with the number of people employed 
in public (government) work.

In [25]:
counties_selected <- counties %>%
  select(state, county, population, public_work)

# Add a new column public_workers with the number of people employed in public work
counties_selected %>%
mutate(public_workers = population * public_work / 100)

# Sort the new column in descending order.
counties_selected %>%
  mutate(public_workers = public_work * population / 100) %>%
  arrange(desc(public_workers))

state,county,population,public_work,public_workers
Alabama,Autauga,55221,20.9,11541.189
Alabama,Baldwin,195121,12.3,23999.883
Alabama,Barbour,26932,20.8,5601.856
Alabama,Bibb,22604,16.1,3639.244
Alabama,Blount,57710,13.5,7790.850
Alabama,Bullock,10678,15.1,1612.378
Alabama,Butler,20354,16.2,3297.348
Alabama,Calhoun,116648,20.8,24262.784
Alabama,Chambers,34079,12.1,4123.559
Alabama,Cherokee,26008,18.5,4811.480


### Calculating the percentage of women in a county
The dataset includes columns for the total number 
(not percentage) of men and women in each county. 
You could use this, along with the population variable, 
to compute the fraction of men (or women) within each county.

In [26]:
# Select the columns state, county, population, men, and women
counties_selected <- counties %>%
select(state,county, population, men, women)

# Calculate proportion_women as the fraction of the population made up of women
counties_selected %>%
mutate(proportion_women = women / population)

state,county,population,men,women,proportion_women
Alabama,Autauga,55221,26745,28476,0.5156734
Alabama,Baldwin,195121,95314,99807,0.5115134
Alabama,Barbour,26932,14497,12435,0.4617184
Alabama,Bibb,22604,12073,10531,0.4658910
Alabama,Blount,57710,28512,29198,0.5059435
Alabama,Bullock,10678,5660,5018,0.4699382
Alabama,Butler,20354,9502,10852,0.5331630
Alabama,Calhoun,116648,56274,60374,0.5175742
Alabama,Chambers,34079,16258,17821,0.5229320
Alabama,Cherokee,26008,12975,13033,0.5011150


### Select, mutate, filter, and arrange
In this exercise, you'll put together 
everything you've learned in this chapter (select(), 
mutate(), filter() and arrange()), to find the counties with 
the highest proportion of men.

In [27]:
counties %>%
  # Select the five columns 
  select(state, county, population, men, women) %>%
  # Add the proportion_men variable
  mutate(proportion_men = men/population) %>%
  # Filter for population of at least 10,000
  filter(population > 10000) %>%
  # Arrange proportion of men in descending order 
  arrange(desc(proportion_men))


state,county,population,men,women,proportion_men
Virginia,Sussex,11864,8130,3734,0.6852664
California,Lassen,32645,21818,10827,0.6683412
Georgia,Chattahoochee,11914,7940,3974,0.6664428
Louisiana,West Feliciana,15415,10228,5187,0.6635096
Florida,Union,15191,9830,5361,0.6470937
Texas,Jones,19978,12652,7326,0.6332966
Missouri,DeKalb,12782,8080,4702,0.6321389
Texas,Madison,13838,8648,5190,0.6249458
Virginia,Greensville,11760,7303,4457,0.6210034
Texas,Anderson,57915,35469,22446,0.6124320
