/
05-ggplot-exts.Rmd
167 lines (130 loc) · 5.84 KB
/
05-ggplot-exts.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
# Scatter Plot Matrices and Extensions {#ggplot-exts}
## Data
For this section, we'll look at data from the American Community Survey (ACS)
on immigration. To download the data,
1. Go to the [American FactFinder website](https://factfinder.census.gov/).
2. Click on the "Download Center" section, then click the "DOWNLOAD CENTER" button.
3. Click the "NEXT" button, since we know the table we want to download.
4. Select "American Community Survey" from the Program dropdown.
5. Select "2015 ACS 5-year estimates", click the "ADD TO YOUR SELECTIONS" button, then click "NEXT"
6. Select "County - 050" from the geographic type dropdown, then select "All Counties within United States", click the "ADD TO YOUR SELECTIONS" button, then click "NEXT"
7. Type `income mobility` in the "topic or table name"" search box, then select the option that reads:
"B07011: MEDIAN INCOME IN THE PAST 12 MONTHS (IN 2015 INFLATION-ADJUSTED DOLLARS) BY GEOGRAPHICAL MOBILITY IN THE PAST YEAR FOR CURRENT RESIDENCE IN THE UNITED STATES"
Click "GO", then check the checkbox beside the table we found. Now click on the "Download" button, and
uncheck the option that says, "Include descriptive data element names." Click "Ok" to create your zip file. Once the file has been created, click "DOWNLOAD" to download the zip file.
The following will assume you moved the following files within the zip to your `data` folder:
* ACS_15_5YR_B07011_with_ann.csv
* ACS_15_5YR_B07011_metadata.csv
The file `ACS_15_5YR_B07011.txt` tells us how to interpret codes within our data. It is possible for median values to be followed by a `+` or `-` if they
are in the upper or lower open-ended interval. In our dataset we don't have any medians in the upper open-ended interval, but we do have entries in the
lower open-ended interval.
```{r}
library(tidyverse)
```
```{r, echo=FALSE}
select <- dplyr::select
```
```{r exts-setup}
acs <- read_csv("data/ACS_15_5YR_B07011_with_ann.csv", col_types = strrep("c", 15), na = c("-", "(X)"))
meta <- read_csv("data/ACS_15_5YR_B07011_metadata.csv")
meta
```
Let's keep only the variables we care about, using more informative variable names.
```{r}
acs_mobility <- acs %>%
transmute(
geo = `GEO.display-label`,
same_house = HD01_VD03,
same_county = HD01_VD04,
same_state = HD01_VD05,
same_country = HD01_VD06,
different_country = HD01_VD07
)
acs_mobility
```
Now we can add an indicator for whether the median value is in the lowest available interval. This would mean that the median value presented has been bottom-coded.
```{r}
acs_mobility <- acs_mobility %>%
mutate(
same_country_bc = grepl("[0-9]*-", same_country),
different_country_bc = grepl("[0-9]*-", different_country)
)
acs_mobility
```
Let's see how many counties have observations that are bottom coded:
```{r}
acs_mobility %>%
summarize(same_country_bc = sum(same_country_bc), different_country_bc = sum(different_country_bc), counties = n())
```
Let's see what the typical bottom-coded values are:
```{r}
acs_mobility %>%
filter(same_country_bc) %>%
select(same_country) %>%
table()
```
```{r}
acs_mobility %>%
filter(different_country_bc) %>%
select(different_country) %>%
table()
```
In both cases the bottom-coded interval is the range from zero to 2,500.
Since this is a small number of counties given the entire range, let's simply
set the bottom-coded values to equal the upper-bound of their interval (i.e., 2,500).
```{r}
acs_mobility <- acs_mobility %>%
transmute(
geo = geo,
same_house = if_else(grepl("[0-9]*-", same_house), 2500L, as.integer(same_house)),
same_county = if_else(grepl("[0-9]*-", same_county), 2500L, as.integer(same_county)),
same_state = if_else(grepl("[0-9]*-", same_state), 2500L, as.integer(same_state)),
same_country = if_else(same_country_bc, 2500L, as.integer(same_country)),
different_country = if_else(different_country_bc, 2500L, as.integer(different_country))
)
acs_mobility
```
Let's rearrange the data into `tidy` format (one observation per row).
```{r}
tidy_acs <- acs_mobility %>%
gather(location_last_year, median_income, -geo, factor_key = TRUE)
tidy_acs
```
## ggplot2 extensions
There are many extensions the community have made that build on ggplot2.
The following link provides a gallery of many of these extensions:
[ggplot2 extensions](http://www.ggplot2-exts.org/gallery)
Some others that are usefull are `ggjoy` and `GGally`.
## ggjoy
Make sure `ggjoy` is installed.
```
install.packages("ggjoy")
```
`ggjoy` gives us the ability to stack kernel density plots.
```{r}
ggplot(tidy_acs, aes(x = median_income, y = location_last_year, group(location_last_year))) +
ggjoy::geom_joy() +
ggjoy::theme_joy()
```
This plot shows us that, on average, the distance moved in the past year is inversely related to median income.
## scatterplot matrix (GGally::ggscatmat)
Make sure you have `GGally` installed.
```
install.packages("GGally")
```
One particular library, [GGally](http://ggobi.github.io/ggally/), has a great set of visualizations
to extend those that come prebuilt with `ggplot`. One common visualization tool that is missing from
ggplot is the scatterplot matrix. While base R provides `splom()` in the `lattice` library,
`GGally::ggpairs` and `GGally::ggscatmat` pr ovide an easy tool to create a scatterplot matrix with
ggplot2.
```{r}
acs_mobility %>%
as.data.frame() %>%
GGally::ggscatmat(columns = 2:ncol(.), alpha = 0.1)
```
## Assignment
Go to [American FactFinder](https://factfinder.census.gov/). Follow the steps above up until the point where
we typed out "income mobility". This time pick another keyword to search for and select a different table to
analyze (make sure this will give you more than two columns of data you would like to compare).
Use `ggscatmat()` to visualize the variables that interest you. Keep the filenames as they are provided by the
Census, so I can run your R Markdown file.