vignettes/input_data.Rmd

---
title: "Input data"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Input data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
resource_files:
  bib/mapspamc.bib
bibliography: bib/mapspamc.bib
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

To create crop distribution maps with `mapspamc`, various types of data are required, including national and subnational agricultural statistics, cropland and irrigated area extent, and spatially explicit information on economic and bio-physical suitability. Users of `mapspamc` can use country-specific sources for each of these categories, such as national irrigation, cropland and crop suitability maps. However, apart from the subnational statistics, most of the data can be taken from global data products. To support easy implementation of the package, `mapspamc` is accompanied by a [public data repository](https://doi.org/10.5281/zenodo.7031917) (referred to as the mapspamc_db) that contains a large number of global maps that are publicly available. The table below shows the global data products that are included in the database. They are briefly discussed below.

```{r tab-data-sources, echo=FALSE, warning=FALSE, message=FALSE}
library(readxl)
library(kableExtra)
sources <- read_excel("tables_chart/tables.xlsx", sheet = "tab_data_sources")

sources %>%
  kable(., booktabs = TRUE, linesep = "", caption = "Data products in the mapspamc database")
```


## Crop statistics

The aim of `mapspamc` is to spatially allocate national and subnational information on harvested and physical crop area for four different production systems (subsistence, low-input rainfed, high-input rainfed and high-input irrigated). The use of subnational statistics is of key importance and greatly improves the quality of the gridded maps [@Joglekar2019]. Four pieces of information need to be collected by the user:

- **Harvested area statistics.** Data on crop level harvested area at administrative unit level 1 and, if possible, level 2. `mapspamc` is designed to handle missing information at the subnational level so any data is useful.

- **Subnational administrative unit map.**. It is essential to have a map (e.g. shapefile or any other polygon format) with the location of the subnational administrative units that corresponds with the subnational statistics. These two sources of information need to be perfectly consistent or made consistent.

- **Cropping intensity statistics.** Information on the cropping intensity (e.g. the number of crop rotations in case of multicropping) per crop at the national level and, if available, at the subnational 1 level. This information will be combined with the harvested area statistics to estimate the physical crop area at the national and subnational level. 

- **production system area shares.** Data on the area share for each crop and all four production systems, preferably at the national and, if available, subnational 1 level. This information will be combined with physical area estimation to calculate the physical crop area for each of the four production systems times crop combinations at the national and subnational level.

Finding subnational information on harvested area is not easy as they are not always collected and/or published by national statistical agencies and if they are available they often cover a selection of crops and might have many missing values. Cropping intensity and production system shares are probably even much more difficult to find and often requires making a lot of assumptions to fill data gaps (e.g. assuming the same cropping intensity for similar crops).

Some places where you might look for subnational statistics:

- National statistical agencies and agricultural research institutes, are the best places to find subnational agricultural statistics.

- [CountryStat](http://www.fao.org/economic/ess/countrystat) is database with Food and agriculture statistics at the subnational level. Its predecessor [Agro-Maps](http://kids.fao.org/agromaps/) with older data can also still be accessed. Coverage of both databases is, however, limited.

- Knowing in which regions crops are not produced is also very useful. By setting the harvested area to 0 for some regions the model will be forced to allocate the (national) statistics in other subnational units. 

- Agricultural trade statistics in FAOSTAT might give you an idea about the production system shares. If most crop production is exported, a large share of the farmers can probably be categorized as high-input (or irrigated) farming. 

- As mentioned above, the [AQUASTAT](http://www.fao.org/nr/water/aquastat/data/query/index.html?lang=en) database provides information on irrigated area shares.

Information on harvested area at the national level can be obtained from [FAOSTAT](http://www.fao.org/faostat/en/#home), which is also included in mapspamc_db. Data on irrigated crop area can be taken from [AQUASTAT](http://www.fao.org/nr/water/aquastat/data/query/index.html?lang=en).


## Cropland extent

To allocate the physical area statistics `mapspamc` requires a cropland extent, which shows the location of cropland in a country for the target year. There are several global cropland products and often countries produce a national land cover map, which shows the location of cropland. To account for the uncertainty in the location of cropland, @Fritz2011 combined several different products into one so-called synergy cropland map. 

The synergy cropland approach combines all available (global) cropland maps and creates a ranking for each grid cell that measures the level of agreement between the various input maps. In the `mapspamc` the grid cells with the least uncertainty are selected first when allocating physical area of individual crops. Apart from the ranking, the synergy cropland approach also prepares maps with the mean and maximum cropland area per grid cell. The mean area product is used as the base layer by `mapspamc` but in case this is not sufficient to allocate all the physical area, grid cells from the maximum cropland map can be used (see [Model preparation](model_preparation.html) for more information on how this is done).  

There are two options to obtain a synergy cropland map. The first is to take an existing product. If `mapspamc` is used to produce crop distribution maps for around 2010, the user can use an existing global product [@Lu2020] that was also used as input for SPAM2010. A second option is to construct a country specific synergy cropland extent. mapspamc_db includes several recent cropland products that can be combined using scripts in `mapspamc` to create  a country-specific synergy cropland map.  


## Irrigation

To allocate the irrigated crops, `mapspamc` needs an irrigated area extent. Similar to the synergy cropland map, we create a synergy irrigated area map that takes into account the uncertainties related to the location of irrigated areas. At present, there are only two global products that provide this information. the Global Map of Irrigated Areas (GMIA) [@Siebert2013], shows the areas that are equipped for irrigation based on national survey data and maps for 2005 at 5 arc minutes. The Global Irrigated Areas (GIA) map [@Meier2018] depicts actual irrigated areas around the period 2005. It combines normalized difference vegetation index (NDVI) maps, crop suitability data and information on the location of areas equipped for irrigation to create an irrigated area map at a resolution of 30 arc seconds. In comparison to the GMIA, it shows 18% more irrigated area globally.


## Biophysical suitability and potential yield

Spatially explicit information on biophysical suitability and potential yield is taken from @IIASA2012. The International Institute for Applied Systems Analysis (IIASA) in collaboration with FAO, developed the global agro-ecological zones (GAEZ) methodology that assesses the biophysical potential for a large number of crops across three production systems: low-input rainfed, high-input rainfed and high-input irrigated systems. The first class is used for both the subsistence and low-input system in `mapspamc`. GAEZ presents spatially explicit information on the biophysical suitability (on a scale from 1 to 100) and potential yield (in t/ha) separate for each production system.  mapspamc_db includes data for GAEZv3.^[Recently GAEZv4 was released, which we aim to add in an update of mapspamc_db].


## Accessibility

As a proxy for access to markets and quality of road infrastructure we used a global map with travel time to high-density urban centers. @Weiss2018 depicts the global road infrastructure in 2015 by presenting travel time to urban areas at a resolution of 1×1 kilometer.


## Rural population density

We used the WorldPop [@Tatem2017] database as a our primary source of information for national population density. WorldPop combines a random forest model with census data to generate a gridded prediction of population density at ~100 meter spatial resolution [@Stevens2015a]. Data is available for at a number of spatial resolutions and various years. We used the global maps at 1x1 kilometer as input for our analysis. To identify the rural population, we used the urban extent from @Schiavina2022. Rural population was selected by removing grid cells that are located within the urban extent. 


## Crop price

To calculate the potential revenue of a crop at the grid cell level, we followed the SPAM approach [@Wood-Sichra2016] and multiplied the potential yield from @IIASA2012 with crop prices from @FAO2019.  


## References