Skip to content

ianmoran11/tidyABS

Repository files navigation

Warning: tidyABS is in early development. Please use for experimentation only.

Overview

The tidyABS package converts ABS excel tables to tidy data frames. It uses information on cells’ formatting (indenting, bolding, italics) and some rules-of-thumb to detect the structure of excel tables. It sometimes needs pointers from the user.

Note: tidyABS works with .xlsx files only.

Installation

The tidyABS package is not available on CRAN. It can be installed from github with the following script:

# install.packages("devtools")
devtools::install_github("ianmoran11/tidyABS")

Usage

library(tidyABS)
library(tidyverse)

Below is a short demonstration that tidies a table from the Australian Industry publication (Cat No. 8155.0).

tidy_aus_industry_df <-
  tidyABS_example("australian-industry.xlsx") %>%
  process_sheet(sheets = "Table_1") %>%
  assemble_table_components()

tidy_aus_industry_df %>% str()
#> Classes 'tbl_df', 'tbl' and 'data.frame':    1668 obs. of  8 variables:
#>  $ row         : int  8 8 8 8 8 8 8 8 9 9 ...
#>  $ col         : int  2 3 4 5 6 7 8 9 2 3 ...
#>  $ comment     : chr  NA NA NA NA ...
#>  $ value       : chr  "485" "5843" "54410" "57577" ...
#>  $ col_group_01: chr  "Employment at end of June" "Wages and salaries" "Sales and service income" "Total income" ...
#>  $ col_group_02: chr  "'000" "$m" "$m" "$m" ...
#>  $ row_group_01: chr  "2006–07" "2006–07" "2006–07" "2006–07" ...
#>  $ row_group_02: chr  "AGRICULTURE, FORESTRY AND FISHING" "AGRICULTURE, FORESTRY AND FISHING" "AGRICULTURE, FORESTRY AND FISHING" "AGRICULTURE, FORESTRY AND FISHING" ...

Examples

The tidyABS package contains several example files. Use the helper tidyABS_example() function with no arguments to list these files:

tidyABS_example()
#> [1] "~$australian-industry.xlsx"                          
#> [2] "australian-industry.xlsx"                            
#> [3] "consumer-price-index.xlsx"                           
#> [4] "environmental-economic-accounts.xlsx"                
#> [5] "PhD_ subfield-citizenship-status-ethnicity-race.xlsx"
#> [6] "PhD_major-field.xlsx"

Example 1: Australian Industry

Above is the first sheet of an excel workbook in the Australian Industry publication. We can retrieve the path of this file using the tidyABS_example function:

ai_path <- tidyABS_example("australian-industry.xlsx")

To process the sheet above, we pass the workbook file path to the process_sheet function and identify the sheet we’d like to tidy.

ai_processed <- process_sheet(path = ai_path, sheets = "Table_1")

ai_processed %>% str(1)
#> List of 3
#>  $ col_groups:Classes 'tbl_df', 'tbl' and 'data.frame':  2 obs. of  10 variables:
#>  $ row_groups:Classes 'tbl_df', 'tbl' and 'data.frame':  2 obs. of  11 variables:
#>  $ tabledata :Classes 'tbl_df', 'tbl' and 'data.frame':  1668 obs. of  4 variables:

This produces a list of three data frames. They store the location and format information of row names (row_groups), column names (col_groups) and table data (tabledata).

We can inspect the row and column names in ai_processed using the inspect_table_components function.

inspect_table_components(ai_processed)
#> $col_groups
#> $col_groups[[1]]
#> [1] "Employment at end of June"                                 
#> [2] "Wages and salaries"                                        
#> [3] "Sales and service income"                                  
#> [4] "Total income"                                              
#> [5] "Total expenses"                                            
#> [6] "Operating profit before tax"                               
#> [7] "Earnings before interest tax depreciation and amortisation"
#> [8] "Industry value added"                                      
#> 
#> $col_groups[[2]]
#> [1] "'000" "$m"  
#> 
#> 
#> $row_groups
#> $row_groups[[1]]
#>  [1] "2006–07" "2007–08" "2008–09" "2009–10" "2010–11" "2011–12" "2012–13"
#>  [8] "2013–14" "2014–15" "2015–16" "2016–17"
#> 
#> $row_groups[[2]]
#>  [1] "AGRICULTURE, FORESTRY AND FISHING"              
#>  [2] "MINING"                                         
#>  [3] "MANUFACTURING"                                  
#>  [4] "ELECTRICITY, GAS, WATER AND WASTE SERVICES"     
#>  [5] "CONSTRUCTION"                                   
#>  [6] "WHOLESALE TRADE"                                
#>  [7] "RETAIL TRADE"                                   
#>  [8] "ACCOMMODATION AND FOOD SERVICES"                
#>  [9] "TRANSPORT, POSTAL AND WAREHOUSING"              
#> [10] "INFORMATION MEDIA AND TELECOMMUNICATIONS"       
#> [11] "RENTAL, HIRING AND REAL ESTATE SERVICES"        
#> [12] "PROFESSIONAL, SCIENTIFIC AND TECHNICAL SERVICES"
#> [13] "ADMINISTRATIVE AND SUPPORT SERVICES"            
#> [14] "PUBLIC ADMINISTRATION AND SAFETY (PRIVATE)"     
#> [15] "EDUCATION AND TRAINING (PRIVATE)"               
#> [16] "HEALTH CARE AND SOCIAL ASSISTANCE (PRIVATE)"    
#> [17] "ARTS AND RECREATION SERVICES"                   
#> [18] "OTHER SERVICES"                                 
#> [19] "TOTAL SELECTED INDUSTRIES"                      
#> 
#> 
#> $tabledata
#> list()

We can use plot_table_components to visualize how these groups are spatially layed out in the spreadsheet and see how tidyABS will relate them to table values. Row names directly to the left of their data points should be labelled “W”, and column names directly above should be labelled “N”. (See the unpivotr package for more information.)

plot_table_components(ai_processed) +
  ylim(-30, 0)

Finally, we can assemble the components into a tidy data frame using assemble_table_components.

assemble_table_components(ai_processed) %>%
  glimpse()
#> Observations: 1,668
#> Variables: 8
#> $ row          <int> 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 1...
#> $ col          <int> 2, 3, 4, 5, 6, 7, 8, 9, 2, 3, 4, 5, 6, 7, 8, 9, 2...
#> $ comment      <chr> NA, NA, NA, NA, NA, "estimate has a relative stan...
#> $ value        <chr> "485", "5843", "54410", "57577", "52046", "5461",...
#> $ col_group_01 <chr> "Employment at end of June", "Wages and salaries"...
#> $ col_group_02 <chr> "'000", "$m", "$m", "$m", "$m", "$m", "$m", "$m",...
#> $ row_group_01 <chr> "2006–07", "2006–07", "2006–07", "2006–07", "2006...
#> $ row_group_02 <chr> "AGRICULTURE, FORESTRY AND FISHING", "AGRICULTURE...

Example 2: Environmental-Economic Accounts

Here’s an example that requires some manual work, the Environmental-Economic Accounts. Let’s retrieve the path of our example workbook and proces Table 6.1:

eea_path <- tidyABS_example("environmental-economic-accounts.xlsx")

eea_processed <- process_sheet(path = eea_path, sheets = "Table 6.1")

On visual inspection, we can see row_group_01 has been given a “W” orientation to the data, not “WNW”.

plot_table_components(eea_processed)

Luckily, we can fix this with the change_direction function.

eea_processed <-
  eea_processed %>%
  change_direction("row_group_01", "WNW")

Plotting the table confirms the direction has been corrected.

plot_table_components(eea_processed)

Finally, we can assemble the components into a tidy dataframe using assemble_table_components.

assemble_table_components(eea_processed) %>%
  glimpse()
#> Observations: 156
#> Variables: 8
#> $ row          <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8...
#> $ col          <int> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 2, 3,...
#> $ comment      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
#> $ value        <chr> "13532", "14352", "14075", "14658", "15085", "155...
#> $ col_group_01 <chr> "2003–04", "2004–05", "2005–06", "2006–07", "2007...
#> $ row_group_01 <chr> "Energy Taxes ", "Energy Taxes ", "Energy Taxes "...
#> $ row_group_02 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
#> $ row_group_03 <chr> "Crude oil and LPG", "Crude oil and LPG", "Crude ...

Example 3: Consumer Price Index (time series)

Time series data require the user to manually identify the inner table cells. This is because some of the column names are numeric — for example, collection month.

I recommend using the readABS package for this. It was created for importing ABS time series data and does not require manual identifcation of table cells.

That said, here’s how you would process this table with tidyABS.

cpi_path <- tidyABS_example("consumer-price-index.xlsx")

We need to identify the inner table cells using the manual_value_references argument.This argument takes a vector of addresses, identifying the inner corners of the table.

cpi_processed <-
  process_sheet(
    path = cpi_path, sheets = "Data1",
    manual_value_references = "B11:AB292"
  )

Here is the resulting data frame.

assemble_table_components(cpi_processed) %>%
  glimpse()
#> Observations: 7,185
#> Variables: 15
#> $ row          <int> 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 1...
#> $ col          <int> 2, 3, 4, 5, 6, 7, 9, 10, 2, 3, 4, 5, 6, 7, 9, 10,...
#> $ comment      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
#> $ value        <chr> "3.7", "3.8", "3.7", "3.8", "3.7", "3.8", "3.9", ...
#> $ col_group_01 <chr> "Index Numbers ;  All groups CPI ;  Sydney ;", "I...
#> $ col_group_02 <chr> "Index Numbers", "Index Numbers", "Index Numbers"...
#> $ col_group_03 <chr> "Original", "Original", "Original", "Original", "...
#> $ col_group_04 <chr> "INDEX", "INDEX", "INDEX", "INDEX", "INDEX", "IND...
#> $ col_group_05 <chr> "Quarter", "Quarter", "Quarter", "Quarter", "Quar...
#> $ col_group_06 <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3",...
#> $ col_group_07 <chr> "1948-09-01", "1948-09-01", "1948-09-01", "1948-0...
#> $ col_group_08 <chr> "2018-12-01", "2018-12-01", "2018-12-01", "2018-1...
#> $ col_group_09 <chr> "282", "282", "282", "282", "282", "282", "282", ...
#> $ col_group_10 <chr> "A2325806K", "A2325811C", "A2325816R", "A2325821J...
#> $ row_group_01 <chr> "1948-09-01", "1948-09-01", "1948-09-01", "1948-0...

About

No description, website, or topics provided.

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published