Skip to content

Repository for the FAO-OECD fishery and aquaculture employment data imputation tool.

Notifications You must be signed in to change notification settings

pamdx/FM_imputation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

94 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Getting started

Prerequisites

To run the tool, you need the following installed on your computer:

  • A recent R installation (version 4.1.0 was used to build the tool)
  • A recent RStudio installation (version 1.4.1103 was used to build the tool)
  • The following R packages installed: dplyr, ggplot2, readr, tidyr, tibble, compareDF, stargazer, gridExtra, Rilostat, OECD. You can install these packages by running the R code below:
install.packages(
  c("dplyr", "ggplot2", "readr", "tidyr", "tibble", "compareDF", "stargazer", "gridExtra", "Rilostat", "OECD")
  )

Installation

On FAO computers:

  1. Extract the contents of the compressed folder (named "emputator") to the destination of your choice (e.g. on your Desktop).
  2. Double-click the "emputator.Rproj" file in the main folder.
  3. The tool will open in RStudio.

On OECD computers:

  1. Extract the contents of the compressed folder (named "emputator") to the destination of your choice (e.g. on your Desktop).
  2. Open RStudio using the link \main.oecd.org\EM_Apps\R\RStudio.cmd
  3. Open the “emputator.Rproj” file from the main folder.
  4. The tool will open in RStudio.

Contents of the extracted folder

The decompressed folder contains the following files and sub-folders:

files

The root folder includes the files that run the imputation tool:

File Type Description
.gitignore GITIGNORE file Tells Git (a version control system) which files or folders to ignore in a project.
.RData R workspace file Used by R to save a session’s environment (the objects in memory).
.RHistory RHISTORY file Contains the history of commands entered by the user during an open R session.
emputator.Rproj R project file Opens the tool in RStudio.
main.R R script Main script from which the tool is run.
README.md Markdown file Manual for the companion GitHub site.

The inputs, modules and outputs sub-folders and their contents are described throughout the rest of this manual.

How to use the imputation tool

With the emputator project open in RStudio, open the main.R script by clicking on the file in the "Files" tab of the lower-right panel in RStudio:

openmainscript

Loading basic packages, functions and data

First, in the main.R script, run the code block below by selecting it and pressing CTRL+ENTER. It will load the packages, functions and data necessary for the tool to run properly.

rm(list=ls()) # clear R environment

# Load packages

pkgs <- c("dplyr", "ggplot2", "readr", "tidyr", "tibble", "compareDF", "stargazer", "gridExtra", "rmarkdown")
lapply(pkgs, require, character.only = TRUE) # One-liner to load all packages

# Load app functions

source("./modules/functions.R")

# Load data from inputs folder

source("./modules/data_import.R")

Contents of the Inputs sub-folder

This folder contains the data necessary for the imputation tool. Its contents are imported by running the source("./modules/data_import.R") command above.

File Type Description
FM_DB.rds R data file Contains the up-to-date FAO-OECD employment database on which to perform the imputation. Can be converted from a CSV file stored in the same subfolder by running the inputs_update.R module, or can be the output of another R project (e.g. consolidation of data reported by countries).
ILO_labor.rds R data file Contains the ILO labor force database to be used in linear models. Retrieved from the ILO's servers with the Rilostat package by running the inputs_update.R module.
OECD_fleet.rds R data file Contains the OECD fleet database to be used in linear models. Retrieved from the OECD's servers with the OECD package by running the inputs_update.R module.
PROD.rds R data file Contains the FAO capture and aquaculture production database to be used in linear models and productivity computations. Retrieved from FAO's servers with a custom function by running the inputs_update.R module.

Please note that the FM_DB.rds file should have the following structure:

Column Type Accepted values
geographic_area character Values listed in the "Name_En" column from this FAO country reference
OC2 character "Aquaculture"
"Inland fishing"
"Marine fishing"
"Subsistence"
"Unspecified"
"Processing"
OC3 character "Aquaculture"
"Inland Waters Fishing"
"Marine Coastal Fishing"
"Marine Deep-Sea Fishing"
"Marine Fishing, nei"
"Subsistence"
"Unspecified"
"Processing"
working_time character "Full time"
"Part time"
"Occasional"
"Status Unspecified"
sex character "M"
"F"
"U"
year integer Any year between 1950 and the current year
value integer Any positive integer, or blank if accompanied by an "M" or "Q" flag
flag character (Blank) = Official figure
"B" = Break in time series
"E" = FAO estimate
"I" = Estimate from the reporting country
"M" = Missing value (data cannot exist, not applicable)
"P" = Provisional data
"Q" = Confidential data
"T" = Data reported by non-official or semi-official sources
comment character Blank or any text providing background on the entry

Setting filtering parameters and generating a chart of existing data

The next step is to tell the tool which country, sector and period needs to be imputed. By running the code below, two popup windows will successively open for the user, asking first to choose a country and, second, a sector. To change the target period, the user can directly modify the code below.

# Main filtering parameters

country_input <- select_country(FM_raw, country_input)
OC2_input <- select_sector(FM_raw, country_input)
start_year <- 1995
end_year <- 2020

# Subseries-related analyses and visualization of existing data

source("./modules/subseries_analysis.R")

A "rainbow" bar chart like the one below will be displayed in the "Plots" tab of the lower-right panel of RStudio.

current_estimates

Overview of the available imputation methods

At this stage, it is important for the user to be familiar with the imputation methods offered by the application. Below is a short description of each of the available methods.

Linear regression

Linear regression in this tool is implemented with the lm() function from base R.

In automatic regression mode, the following linear models are fitted to the data, and the model with the highest adjusted R-squared is selected to generate the estimates.

Model Specification 1 Specification 2
1 emp_value ~ trend + prod_value + labor_value emp_value ~ trend + prod_value + labor_value + fleet_value
2 emp_value ~ prod_value + labor_value emp_value ~ prod_value + labor_value + fleet_value
3 emp_value ~ trend + prod_value emp_value ~ trend + prod_value
4 emp_value ~ prod_value emp_value ~ prod_value

where

  • Specification 2 is used for imputation of fishery employment in OECD countries only (the fleet data is currently only available for OECD countries).
  • Specification 1 is used in all other cases.
  • "trend" is the sequence of the years composing the time series being imputed. Including a trend variable may increase the linear fit of time series that exhibit a clear upward or downward trend over time.

In manual regression mode, the user specifies the linear model to be fitted to the data. A tilde (~) should be used to separate the dependent variable from the independent variable(s), and multiple independent variables should be separated by a plus (+) sign. The independent variable should always be "emp_value", while the independent variable(s) can be chosen among "trend", "prod_value", "labor_value" and "fleet_value".

Please note that the linear regression method is not available in subseries imputation mode. This is to avoir fitting a linear model with only a subset of employment as a dependent variable, whereas the dependent variables represent the entirety of the production, fleet or labour force.

Polynomial trends

The polynomial trends estimates are generated by polynomial regression implemented with the lm() and poly() functions from base R and the stats package, respectively. In these regressions, the employment value is the dependent variable, while the years are the independent variable. Regressions are automatically fitted for the first, second, third and fourth degree polynomials. These amount to fitting a linear, quadratic, cubic and quartic trend to the employment data.

The trend with the highest adjusted R-squared (the one that fits the data most closely) is automatically selected to generate the imputed data.

Linear interpolation

Linear interpolation estimates are generated with the following function:

where

  • y is the employment value to estimate
  • x is the year associated with the value to estimate
  • x0 is the reference year (i.e. with official data) on the left side of x
  • x1 is the reference year on the right side of x
  • y0 is the official employment value associated with x0
  • y1 is the official employment value associated with x1

Historical average

Historical average estimates of employment are the mean of the x previous consecutive years of official data, where x is a positive integer defined by the user (by default, 5). Note that these estimates will not be generated if less than x previous consecutive years of official data are available.

Historical growth

Historical growth estimates of employment are an extrapolation based on the compound annual growth rate (CAGR) of the x previous consecutive years of official data, where x is a positive integer defined by the user (by default, 5). Note that these estimates will not be generated if less than x previous consecutive years of official data are available.

Backward dragging

As their name suggests, backward dragged estimates impute missing employment data with the closest official value from a posterior year. This method is most commonly applied to the years at the beginning of the time series.

Forward dragging

As their name suggests, forward dragged estimates impute missing employment data with the closest official value from an anterior year. This method is most commonly applied to the years at the end of the time series.

Setting imputation parameters

Next, the user can change the code below to modify how estimates are calculated. Regardless of whether the code is modified, it needs to be run, otherwise the imputation scripts will fail.

#### Estimations parameters ####

  # Linear regression

share_valid_reg <- 0.5 # Proportion of years with official data necessary to run regression (to avoid generating estimates from too little information)
obs_threshold_linreg <- round(length(years_all) * share_valid_reg) # Do not modify
reg_type <- 1 # Regression type 1 = automatic (runs predetermined models and selects the one with best fit), 2 = manual (see below)
trend <- seq(start_year:end_year) # Do not modify
reg_dynamic <- emp_value ~ prod_value + labor_value # Specify manual regression by choosing independent variables from: trend, prod_value, labor_value, fleet_value (separated by "+")
fit_threshold_reg <- 0.8 # R2-squared threshold for the regression to be taken in consideration

  # Polynomial trends

share_valid_trend <- 0.5 # Proportion of years with official data necessary to run regression (to avoid generating estimates from too little information)
obs_threshold_trend <- round(length(years_all) * share_valid_trend) # Do not modify
fit_threshold_trend <- 0.8 # R2-squared threshold for the regression to be taken in consideration

  # Historical growth/average

histavg_threshold <- 5 # Number of previous years on which to base historical average estimates
histgrowth_threshold <- 5 # Number of previous years on which to base historical growth  estimates

Choosing the imputation mode

There are two ways to perform the imputation of missing values:

  • by generating aggregated imputed values for years with no official data, which are then disaggregated based on the weights of subseries for years with official data ("aggregated imputation", most convenient and suitable for most cases)
  • by imputing one subseries at a time ("subseries imputation", more suitable in cases where only some subseries need to be estimated for a given year)

In both imputation modes, it is possible to replace existing estimates with new, different estimates. Importantly, it is impossible to modify official data with the application.

Aggregated imputation

To run the aggregated imputation, first execute the code below.

# Aggregated imputation

source("./modules/processing_aggregated.R")

This will create visualizations of the results of each of the imputation method available for the time series at hand:

  • charts of the covariates available for the linear regression: covariates
  • a chart of the best-fitting linear model: linear_fit
  • a chart of the best-fitting polynomial trend: polynomial_fit
  • a "rainbow" chart for each available imputation method (below is an example for the linear interpolation results): linearint_results

Then, to launch the imputation prompt, run the code below.

agg_imputation_type <- 1 # Select 1 to apply estimation to all missing consecutive years, select 2 to apply estimation separately to each year in the period with consecutive missing years.

source("./modules/imputation_aggregated.R")

Note that you can choose between two types of aggregated imputations by setting the agg_imputation_type variable:

  1. The user can apply the same estimation method to consecutive years without official data by using agg_imputation_type <- 1 (most convenient and suitable for most cases). In this case, the imputation prompt will ask which method to apply to all consecutive years, as shown below.

agg_imp_cons

  1. The user can decide to treat consecutive years without official data differently and apply each time a different estimation method by using agg_imputation_type <- 2 (more rarely used). In this case the imputation prompt will ask which imputation method to use for each consecutive year, as shown below.

agg_imp_yby

Subseries imputation

To run the subseries imputation, execute the code below. Note that if you run the subseries imputation after the aggregated imputation, the results of the aggregated imputation will be replaced by those of the subseries imputation.

# Subseries imputation

source("./modules/imputation_subseries.R")

A first prompt will ask you to select the subseries you want to impute. Select one subseries and click OK.

subs_imp_1

A second prompt will ask you what imputation method should be applied to the subseries selected. Note that you can also choose to remove existing estimates. Select one method and click OK.

subs_imp_2

Finally, a third prompt will ask you what year should be imputed. You can select multiple year by pressing CTRL or SHIFT. Click OK to confirm your selection.

subs_imp_3

A "rainbow" chart of the current state of subseries imputation will be displayed in RStudio. The first prompt will reappear in case you want to continue the process with other subseries. If you are done with the imputation, select "Stop imputation" and click OK.

Exporting the imputed data and imputation report

To export the imputed data (CSV) and the imputation report (HTML) in the outputs folder, run the following code block in the main.R script.

# Final data export and report generation

source("./modules/final_data_export_viz.R")

Contents of the Outputs sub-folder

The folder contains the imputation results by country and sector. For each country/sector processed by the tool, an HTML report and a CSV file of the imputed data are saved here.

File Type Description
[country]_[sector]_report.html HTML document Summary of the imputation process and results.
[country]_[sector]_imputed.csv CSV file Imputed time series

Appendix: contents of the Modules sub-folder

This folder contains the R scripts that are necessary for the imputation tool to perform its computations and produce the desired outputs.

File Type Description
data_import.R R script Imports data from the inputs folder into the R environment.
final_data_export_viz.R R script Creates the HTML imputation report and CSV file of the imputed data and saves them in the outputs folder.
functions.R R script Includes all the functions that the tool uses to generate imputed data, charts, tables, etc.
imputation_aggregated.R R script Runs the imputation in aggregated mode.
imputation_subseries.R R script Runs the imputation by subseries.
inputs_update.R R script Updates the data located in the inputs folder: online databases from FAO, the OECD and ILO are queried to retrieve the latest production, fleet and labour force data.
processing_aggregated.R R script Generates all the objects necessary to run the imputation in aggregated mode: tables with imputed data and their associated "rainbow" bar charts.
report.Rmd R Markdown file Generates the HTML imputation report.
subseries_analysis.R R script Performs a series of basic computation from the data: visualizes the existing estimates, identifies years with missing data, computes the weight of each subseries for each year, etc.

About

Repository for the FAO-OECD fishery and aquaculture employment data imputation tool.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages