Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
306 lines (245 sloc) 13.3 KB
---
title: Multi-Census Tongfen
author: Jens von Bergmann
date: '2019-06-04'
slug: multi-census-tongfen
categories:
- cancensus
- CensusMapper
tags: []
description: 'Comparing multiple censuses across fine geographies.'
featured: ''
images: ["https://doodles.mountainmath.ca/posts/2019-06-04-multi-census-tongfen_files/figure-html/shelter_cost_burdened_change-1.png"]
featuredalt: ""
featuredpath: ""
linktitle: ''
type: "post"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = FALSE,
message = FALSE,
warning = FALSE,
fig.width = 9,
cache = TRUE
)
library(tidyverse)
library(tongfen)
library(sf)
library(cancensusHelpers)
library(multiscales)
census_transform <- function(data){
data %>%
mutate_at(intersect(names(.),c("v_CA01_1669","v_CA01_1673","v_CA06_2052","v_CA06_2057", "v_CA11N_2280", "v_CA16_4889")),
function(d) ifelse(.$Population==0,0,d)) %>%
mutate_at(intersect(names(.),c("v_CA01_1666","v_CA01_1670","v_CA06_2049","v_CA06_2053","v_CA11N_2277","v_CA16_4886")),
function(d) ifelse(.$Population==0,1,d))
}
compute_sc <- function(data) {
data %>%
mutate(sc_01=v_CA01_1669+v_CA01_1673,
sc_06=v_CA06_2052+v_CA06_2057,
sc_11=v_CA11N_2280,
sc_16=v_CA16_4889,
t_01=v_CA01_1666+v_CA01_1670,
t_06=v_CA06_2049+v_CA06_2053,
t_11=v_CA11N_2277,
t_16=v_CA16_4886) %>%
mutate(`2001`=sc_01/t_01,
`2006`=sc_06/t_06,
`2011`=sc_11/t_11,
`2016`=sc_16/t_16)
}
caption="MountainMath, StatCan Census 2001, 2006"
bar_theme <- list(
theme_light(),
scale_y_continuous(labels = scales::percent),
labs(caption=caption)
)
```
Two days ago I gave an example using the new (to CensusMapper) 2001 census data to mix with 2006 data on a common geography based on dissemination areas. A [question came up](https://twitter.com/Lfang17/status/1135931232049438720) if this works for several censuses, not just for two. Yes, the [TongFen package](https://github.com/mountainMath/tongfen) was built with exactly that in mind. Time for a quick demo.
For this we will look at the households spending between 30% and 100% of income on housing in the City of Toronto. To grab the relevant variables we query all available CensusMapper datasets and query each for all census variables with the term "30%" in the description and grab all "parent" and "child" variables (wrt the CensusMapper variable hierarchies) for good measure.
```{r echo=TRUE}
all_vectors <- list_census_datasets()$dataset %>%
lapply(function(ds) search_census_vectors(" 30%",ds) %>%
bind_rows(child_census_vectors(.),parent_census_vectors(.))) %>%
bind_rows %>%
unique
regions <- search_census_regions("^Toronto$","CA01") %>% filter(level=="CSD") %>% as_census_region_list()
```
To start off, we take a look at the summary statistics for the City of Toronto. No TongFen needed here, the geography has not changed since 2001.
```{r echo=TRUE}
total_data <- get_census("CA16",regions=regions,vectors = all_vectors$vector,labels = 'short') %>%
compute_sc %>%
gather(key="Year",value="Share",seq(2001,2016,5) %>% as.character)
ggplot(total_data,aes(x=Year,y=Share)) +
geom_bar(stat="identity",fill="brown") +
bar_theme +
labs(title="City of Toronto share of households spending 30% to 100% of income on shelter")
```
The high-level numbers suggest a slight worsening, that is an increasing share of shelter-cost burdened households. If we are interested in a finer breakdown of the geographic distribution, we need to TongFen the data to CTs or DAs.
## CT level data
To get CT level data for all four censues on a common tiling we just need to make the appropriate call, tongfen will take care of the rest.
```{r echo=TRUE}
data_ct <- get_tongfen_census_ct(regions=regions,vectors = all_vectors$vector,geo_format = 'sf')
```
```{r include=FALSE}
bbox=st_bbox(data_ct)
vector_tiles <- simpleCache(get_vector_tiles(bbox),"toronto_csd_vector_tiles")
roads <- rmapzen::as_sf(vector_tiles$roads) %>% filter(kind != "ferry")
water <- rmapzen::as_sf(vector_tiles$water)
map_end_theme <- list(
coord_sf(datum=NA,
xlim=c(bbox$xmin,bbox$xmax),
ylim=c(bbox$ymin,bbox$ymax)),
labs(caption=caption,fill="")
)
map_start_theme <- list(
geom_sf(data=roads,size=0.1,color="darkgrey",fill=NA),
geom_sf(data = water, fill = "lightblue", colour = NA)
)
map_theme <- c(map_start_theme,map_end_theme)
compute_shares <- function(data){
plot_data <- data %>%
compute_sc()%>%
gather(key="Year",value="Share",seq(2001,2016,5) %>% as.character,factor_key = TRUE)
sc_breaks <- seq(0,ceiling(max(plot_data$Share,na.rm=TRUE)*10)/10,0.1)
sc_labels <- seq(1,length(sc_breaks)-1) %>% map(function(i)paste0(scales::percent(sc_breaks[i],2)," to ",scales::percent(sc_breaks[i+1],2)))
sc_breaks[1]=-0.01
sc_labels[1]=paste0("Less than ",scales::percent(sc_breaks[2],1))
sc_labels[length(sc_labels)]=paste0("More than ",scales::percent(sc_breaks[length(sc_labels)],1))
plot_data %>%
mutate(share_d=cut(Share,breaks=sc_breaks,labels=sc_labels))
}
share_theme <- list(
scale_fill_brewer(palette="Oranges",na.value="grey"),
map_theme,
theme(legend.position = "bottom")
)
compute_changes <- function(data){
plot_data <- data %>%
compute_sc()%>%
mutate(`2001 - 2006`=`2006`-`2001`,
`2006 - 2011`=`2011`-`2006`,
`2011 - 2016`=`2016`-`2011`,
`2001 - 2016`=`2016`-`2001`) %>%
gather(key="Year",value="Change",c("2001 - 2006","2006 - 2011","2011 - 2016","2001 - 2016"),factor_key = TRUE)
change_breaks <- c(seq(-0.25,0.25,0.1),-0.02,0.02) %>% sort()
change_labels <- seq(1,length(change_breaks)-1) %>% map(function(i)paste0(scales::percent(change_breaks[i],1)," to ",scales::percent(change_breaks[i+1],1))) %>% unlist
change_labels[1]=paste0("More than ",scales::percent(change_breaks[2],1))
change_labels[length(change_labels)]=paste0("More than ",scales::percent(change_breaks[length(change_labels)],1))
#change_colors <- setNames(RColorBrewer::brewer.pal(length(change_labels),name="PiYG") %>% rev,change_labels)
plot_data %>%
mutate(change_d=cut(Change,breaks=change_breaks,labels=change_labels))
}
change_theme <- list(
scale_fill_brewer(palette="PiYG",direction = 1,na.value="grey"),
map_theme,
theme(legend.position = "bottom")
)
```
With the data at hand, we can easily take a look at the geographic distribution. Here we have encapsulated the computation of the shares in the `compute_shares` function call, details are [in the code](https://github.com/mountainMath/doodles/blob/master/content/posts/2019-06-04-multi-census-tongfen.Rmarkdown).
```{r echo=TRUE, fig.height=9}
plot_data <- data_ct %>%
compute_shares
ggplot(plot_data) +
geom_sf(aes(fill=share_d),size=0.1) +
facet_wrap("Year") +
share_theme +
labs(title="Spending 30% to 100% of income on shelter")
```
This gives us a good overview over where shelter-cost burdened households were located in Toronto in each year. We can also see some changes over time, but to really understand the changes it is easier to plot them directly.
```{r shelter_cost_burdened_change, fig.height=9}
plot_data <- data_ct %>%
compute_changes()
ggplot(plot_data) +
geom_sf(aes(fill=change_d),size=0.1) +
change_theme +
facet_wrap("Year") +
labs(title="Percentage point change in share spending 30% to 100% of income on shelter")
```
There is significant variation on how the shares changed during those years. Part of that volatility is possibly an indication that households cluster around the 30% cutoff and thus can easily slide across that line. Another issue is data quality for the 2011 NHS, one should probably view changes involving 2011 data with a bit of caution. We have observed before that while CT level income estimates were [better than their reputation](https://doodles.mountainmath.ca/blog/2017/09/29/a-retrospective-look-at-nhs-income-data/), this did not translate to good individual level income to shelter cost estimates. Imputation of aggregate income numbers using CRA taxfiler data is reasonably easy, but imputation of individual level incomes to join with shelter cost data is hard. The long-term 2001 to 2016 estimate in the bottom right is probably the one that should receive most of the attention.
## DA level data
If we want even finer resolution we can take this down to dissemination area level. Here we pass an optional "transform" function to transform the initial census data before the TongFen process. This allows us to convert `NA` values for areas without households or population to zeros, eliminating areas for which we can't report data just because they contain subareas without population.
```{r echo=TRUE, fig.height=9}
data_da <- get_tongfen_census_da(regions=regions,vectors = all_vectors$vector,
geo_format = 'sf',na.rm=FALSE,
census_data_transform=census_transform)
plot_data <- data_da %>%
compute_shares
ggplot(plot_data) +
geom_sf(aes(fill=share_d),size=0.01) +
share_theme +
facet_wrap("Year") +
labs(title="Spending 30% to 100% of income on shelter")
```
And of course, we can also compute the changes at the DA level.
```{r fig.height=9}
plot_data <- data_da %>%
compute_changes()
ggplot(plot_data) +
geom_sf(aes(fill=change_d),size=0.1) +
change_theme +
facet_wrap("Year") +
labs(title="Percentage point change in share spending 30% to 100% of income on shelter")
```
## Uncertainty
However, here we start to run into problems. Changes across small geographic areas are prone to errors. Statistical rounding and biases in the census data really start to matter for some of the areas. In the past we have dealt with this using ["Surprise maps"](https://doodles.mountainmath.ca/blog/2017/04/10/surprise/) that make a model assumption, for example "no change in share of shelter-cost burdened households", and colour regions based on their (signed) statistical evidence against our model assumption. Areas with low household counts would provide little statistical evidence against our model assumption, similar to areas with little change but higher numbers.
But it can also be instructional to separate these two factors with a bivariate scale, where we fade out the colours as we lose confidence in our estimates.
```{r fig.height=10}
standard_error=15
plot_data <- data_da %>%
compute_sc() %>%
mutate(`2001 - 2006`=`2006`-`2001`,
`2006 - 2011`=`2011`-`2006`,
`2011 - 2016`=`2016`-`2011`,
`2001 - 2016`=`2016`-`2001`) %>%
gather(key="Year",value="Change",c("2001 - 2006","2006 - 2011","2011 - 2016","2001 - 2016"),factor_key = TRUE) %>%
left_join(data_da %>%
compute_sc() %>%
st_set_geometry(NULL) %>%
group_by(TongfenID) %>%
select_at(vars(starts_with("t_"))) %>%
rename_at(vars(starts_with("t_")),function(d)gsub("t_","20",d)) %>%
mutate(`2001 - 2006`=(standard_error/`2006`+standard_error/`2001`)/2,
`2006 - 2011`=(standard_error/`2011`+standard_error/`2006`)/2,
`2011 - 2016`=(standard_error/`2016`+standard_error/`2011`)/2,
`2001 - 2016`=(standard_error/`2016`+standard_error/`2001`)/2) %>%
select(-one_of(seq(2001,2016,5) %>% as.character)) %>%
gather(key="Year",value="Error",c("2001 - 2006","2006 - 2011","2011 - 2016","2001 - 2016"),factor_key = TRUE)
) %>%
select(TongfenID,Year,Change,Error) %>%
mutate(uncertainty=pmin(1,abs(Error)))
colors <- scales::colour_ramp(
colors=RColorBrewer::brewer.pal(3,"RdYlGn") %>% rev
#colors = c(red = "#AC202F", purple = "#740280", blue = "#2265A3")
)((0:7)/7)
clamp <- function(x,range){
pmin(x,range[2],pmax(x,range[1]))
}
ggplot(plot_data) +# %>% filter(Year=="2001 - 2016")) +
geom_sf(aes(fill=zip(Change,uncertainty)),size=0.01) +
bivariate_scale("fill",
pal_vsup(values = colors, max_desat = 0.8, pow_desat = 0.2, max_light = 0.7, pow_light = 1),
name = c("Percentage point change in shelter cost burdened", "expected error"),
limits = list(c(-0.3, 0.3), c(0,0.2)),
breaks = list(c(-0.25,-0.15,0,0.15,0.25), waiver()),
oob=clamp,
labels = list(scales::percent, scales::percent),
guide = "colourfan",
na.value = "grey"
) +
map_theme +
#theme(legend.position = "bottom") +
facet_wrap("Year") +
labs(title="Percentage point change in share spending 30% to 100% of income on shelter") +
theme(legend.position = 'bottom',
legend.key.size = grid::unit(1, "cm"),
legend.title.align = 0.5,
plot.margin = margin(5.5, 5.5, 10.5, 5.5)
)
```
The *expected error* is the expected percentage point error of the estimate due to statistical rounding. One can refine this by e.g. folding in non-return rates to estimate the uncertainty due to this, and one can tune the colour scale to the specific purpose of the map.
## Summary
In summary, with TongFen it's straight forward to compare census data at the CT or DA level across the four censuses 2001 through 2016 that we have made available via CensusMapper. Especially at the DA level we need to be mindful of limitations to census data as quirks in the data start to distort our results.
As usual, the full code for this post is [available on GitHub](https://github.com/mountainMath/doodles/blob/master/content/posts/2019-06-04-multi-census-tongfen.Rmarkdown) for those interested in replicating or adapt this for their own purposes.
You can’t perform that action at this time.