# Data acquisition
This *Jupyter* notebook comprises the required core and auxiliary data downloads for my master thesis. All downloads are stored locally in a directory tree called ***data*** within the notebooks root directory. Be sure that you have enough space on your hard disk because these downloads need it. Further, you should have a fast web access in order to download the required files quickly. The proceeding notebook is a top down approach, therefore you should execute code cells in a top-down manner. If you execute code cells in a arbitrary order it leads to unrecognized exceptions and errors. So, for your own sake don't do it please. Moreover, it is not checked if you already downloaded the data yet. Hence, executing the code snippets twice will lead to re-download of the entire datasets.

<font style="color:red;">UPDATE TOPIC</font>
My master thesis has the emphasis deforestation and deforestation drivers on global, continental and local scale in the tropical zone between 2001 and 2010. Therefore, the downloads will be filtered in the extent of this zone which covering a latitudinal area between approximately 23.43&deg; North and -23.43&deg; South (WGS84). Furthermore, the data acquisition focuses on datasets of the mentioned time period. This notebook comprises the following sections:

[**Preparation**](#Preparation) contains all required initial steps like importing crucial standard library modules and construction of the directory tree for storing the downloaded datasets. The code snippets of this section are fundamental and if you refuse to execute them the proceeding code cells will run into fatal errors.

[**Core data**](#Core-data) section encompass the required code cells for downloading and filtering of the core datasets for determining the drivers of deforestation in tropics. These datasets are raster files of the type GeoTIFF (file-extension \*.tif) which are stored in a sub-folder sorted manner within the directory **core** according to [Section Preparation](#Preparation). As mentioned at the first paragraph you must execute the code cells in the provided top-down order otherwise the execution fails and crucial data will be missing in the proceeding chapters. 

[**Auxiliary data**](#Auxiliary-data) <font color='red'>**Not done yet**</font> this section should contain extra data downloads like satellite images for NDVI computation, country masks, tile masks etc.

## Preparation
As the first step we must import all necessary *Python* standard library modules as well as other mandatory packages to accomplish the data download and filtering successful. In detail the following modules are required:
- ***matplotlib*** a ipython built-in magic command for importing matplotlib and to enable the inline backend for notebooks
- ***re*** a module for applying regular expressions
- ***os*** a module to use operating system dependent functionality
- ***zipfile*** a module that provides tools for ZIP archives
- ***threading*** a module that provides a multi-threading API
- ***urllib.request*** a module to open and download urls
- ***geopandas*** a powerful and feature rich package for manipulating geo-vector files like shp, geojson etc. wrapped around pandas and GDAL
- ***IPython.display.clear_output*** a module method to clear the output of an arbitrary code cell
- ***collections.namedtuple*** a module to create tuple-like objects that have fields accessible by attribute lookup as well as being indexable and iterable

In [3]:
%matplotlib inline
import os
import re
import zipfile
import threading
import urllib.request
import geopandas as gpd
from IPython.display import clear_output
from collections import namedtuple

<font style="color:red;">Short explanation of folder content</font>
Finally we create with the following code cell the ***data*** directory tree among the root folder.
- **data**
    - **core** the entire data from [Section 1.2](#Core-data)
        - **gfc** data from [Section 1.2.1](#Global-Forest-Change)
        - **gl30** data from [Section 1.2.2](#GlobalLand30)
            - **gl_00** gl30 from 2000
            - **gl_10** gl30 from 2010
        - **gc** data from [Section 1.2.3](#GlobCover)
    - **auxiliary** the entire data from [Section 1.3](#Auxiliary-data)
        - **masks** data from [Section 1.2.2](#GlobalLand30)

In [4]:
Directories = namedtuple('Directories', 'root core gfc gl30 esvd auxiliary masks proc')

# directories to create
directories_data = """data data.core data.core.gfc data.core.gl30 data.core.esvd data.auxiliary 
                      data.auxiliary.masks data.proc"""

# os compatibility replace "." with os dependent path separator and store string in namedtuple Directories
dirs = Directories(*re.sub(r'\.', os.sep, directories_data).split())

# make directories according to values assigned to namedtuple Directories
for directory in dirs:
    try:
        os.mkdir(directory)
        print('Created:\t{}'.format(directory))
    except OSError as e:
        print('Error:\t{} {}'.format(directory, e.strerror))

Error:	data File exists
Error:	data/core File exists
Error:	data/core/gfc File exists
Error:	data/core/gl30 File exists
Error:	data/core/gl30/gl_00 File exists
Error:	data/core/gl30/gl_10 File exists
Error:	data/core/esvd File exists
Error:	data/auxiliary File exists
Error:	data/auxiliary/masks File exists


<font style="color:red;">Describe the functions</font>

In [5]:
def download(url: str) -> str:
    request = urllib.request.Request(url)
    response = urllib.request.urlopen(request)
    return response.read()


def write_binary(content: str, to_path: str) -> None:
    with open(to_path, 'wb') as dst:
        dst.write(content)
 

def worker(url: str, to_path: str) -> None:
    content = download(url)
    write_binary(content, to_path)

## Core data
Explain the phrase core data
- pivotal data for the processing pipeline
- list the different datasets
- explain the filter properties

### Global Forest Change
[**Global Forest Change 2000-2012 Version 1.0**](https://earthenginepartners.appspot.com/science-2013-global-forest/download_v1.0.html) (GFC) is the first high resolution dataset that provides a comprehensive view on the annual global forest cover change between 2000 and 2012 \cite{Hansen2013, Li2017}. The initial GFC dataset released by Hansen et al. is extended by recent releases which encompass the annual forest cover changes between [2000-2013 (Version 1.1)](https://earthenginepartners.appspot.com/science-2013-global-forest/download_v1.1.html), [2000-2014 (Version 1.2)](https://earthenginepartners.appspot.com/science-2013-global-forest/download_v1.2.html), [2000-2015 (Version 1.3)](https://earthenginepartners.appspot.com/science-2013-global-forest/download_v1.3.html) and [2000-2016 (Version 1.4)](https://earthenginepartners.appspot.com/science-2013-global-forest/download_v1.4.html) respectively. All versions of this dataset has in common, that they are derived from growing season imagery captured by the remote sensing satellite Landsat 7 Enhanced Thematic Mapper Plus (ETM+) at a spatial resolution of 30 meters per pixel \cite{Hansen2013a}. On the satellite imagery a time-series spectral metrics analysis is applied to gather the global forest extent at 2000 as well as the annual forest loss and gain. Hence, GFC comprises three independent data layers  tree cover, annually forest loss and  forest gain divided into 10x10 degree tiles by the geodetic coordinate system *World Geodetic System 1984* (EPSG:4326). Furthermore, across the provided layers the pixel data is coded in unsigned 8 bit integers. Hansen et al. defined trees as all vegetation taller than 5 meters for their study. Forest loss is defined as a stand displacement disturbance leading from a forest state to a non forest-state. To compute this losses 

[Global Forest Watch](http://www.globalforestwatch.org/) interactive map

- Flow general what is gfc then detailed info monitoring method, details of the different layers, how certain is the info
- trees defined as all vegetation higher than 5 meters Hansen2013, Hansen2013a
- forest loss defined as a stand displacement disturbance (> x% crown cover to 0% crown cover)  Hansen2013, Hansen2013a
- monitored by a reference percent tree cover stratum Hansen2013, Hansen2013a
- forest degeneration for example selective removals btw. all impacts on forest which are not lead to a non forest state are not considered Hansen2013a
- term forest refer to tree cover Hansen2013a
- gain is the inverse of loss e.g. the change of a non forest state to forest (crown cover densities >50%)
- Forest loss detection is less uncertain then gain detection (loss is more reliable) Li2017
- Gain is a more gradual and ecological complex process, signal is more difficult to detect Li2017
- Li2017 compares 4 different forest cover change products on their performance to estimate loss and gain patterns in china
- at the end show a example picture of the data


\cite{Hansen2013}
\cite{Hansen2013a}
\cite{Tropek2014}
\cite{Bellot2014}
\cite{Li2017}
\cite{Li2017a}

<font color='red'>**Code explanation here**</font>
- [**Treecover2000**](http://commondatastorage.googleapis.com/earthenginepartners-hansen/GFC2013/treecover2000.txt)
- [**Gain**](http://commondatastorage.googleapis.com/earthenginepartners-hansen/GFC2013/gain.txt)
- [**Lossyear**](http://commondatastorage.googleapis.com/earthenginepartners-hansen/GFC2013/lossyear.txt)
- need url for each tile for each layer
- researchers provide a url index file for each lyer
- download the relevant files
- store content (tile urls) in a variable 

In [6]:
# data source URL
head = 'http://commondatastorage.googleapis.com/earthenginepartners-hansen/GFC2013/'
# files to download from source url
tails = 'treecover2000.txt gain.txt lossyear.txt'.split()

data_urls = []
for tail in tails:
    content = download(head + tail)
    data_urls += content.decode('utf-8').splitlines()

'GFC dataset comprises {} files'.format(len(data_urls))

'GFC dataset comprises 1512 files'

<font color='red'>**Code explanation here**</font>

In [7]:
def orientation_to_int(orient: str) -> int:
    coor, orient = re.match(r'(?P<coor>\d+)(?P<orient>N|S|W|E)', orient, re.I).groups()
    if orient.lower() in ('n', 'e'):
        return int(coor)
    else:
        return -1 * int(coor)


to_download = []
for url in data_urls:
    lat_lon = re.search(r'(\d{2}\w_\d{3}\w)(?=\.tif)', url).groups()[0]
    lat = orientation_to_int(lat_lon.split('_')[0])
    if -20 <= lat <= 30:
        to_download.append(url)

'{} files are in bounds between 30N and 20S'.format(len(to_download))

'648 files are in bounds between 30N and 20S'

<font color='red'>**Code explanation here**</font> 

In [9]:
for idx in range(0, len(to_download), 3):
    url_stack = to_download[idx:idx + 3]
    threads = []
    for url in url_stack:
        path = os.path.join(dirs.core, url.split('/')[-1])
        thread = threading.Thread(target=worker, args=(url, path))
        thread.start()
        threads.append(thread)
    [thread.join() for thread in threads]
    clear_output()
    print('Downloaded {} of {}'.format(idx + 3, len(to_download)))

Downloaded 648 of 648


![Hansen preview](img/hansen_preview.png)

### GlobalLand30
[GlobLand30](http://www.globallandcover.com/GLC30Download/index.aspx) (GL30)

![Chen preview](img/chen_preview.png)

### Global aboveground carbon dataset
- Baccini et al. 2012 http://data.globalforestwatch.org/datasets/d33587b6aee248faa2f388aaac96f92c_0, https://carbonmaps.ourecosystem.com/interface/
- Saatchi et al. 2011 https://carbonmaps.ourecosystem.com/interface/

In [119]:
url = 'http://data.globalforestwatch.org/datasets/d33587b6aee248faa2f388aaac96f92c_0.geojson'
http://data.globalforestwatch.org/datasets/d33587b6aee248faa2f388aaac96f92c_0.geojson?where=&geometry={"xmin":-30819409.80457507,"ymin":-5866287.454882018,"xmax":30819409.80457507,"ymax":5874440.08971801,"spatialReference":{"wkt":"PROJCS[\"WGS_1984_Web_Mercator_Auxiliary_Sphere\",GEOGCS[\"GCS_WGS_1984\",DATUM[\"D_WGS_1984\",SPHEROID[\"WGS_1984\",6378137.0,298.257223563]],PRIMEM[\"Greenwich\",0.0],UNIT[\"Degree\",0.0174532925199433]],PROJECTION[\"Mercator_Auxiliary_Sphere\"],PARAMETER[\"False_Easting\",0.0],PARAMETER[\"False_Northing\",0.0],PARAMETER[\"Central_Meridian\",25.664062499993346],PARAMETER[\"Standard_Parallel_1\",0.0],PARAMETER[\"Auxiliary_Sphere_Type\",0.0],UNIT[\"Meter\",1.0]]"}}
http://data.globalforestwatch.org/datasets/d33587b6aee248faa2f388aaac96f92c_0.geojson?where=&geometry={"xmin":-180,"ymin":-23,"xmax":180,"ymax":23,"spatialReference":{"wkt":"GEOGCS[\"GCS_WGS_1984\",DATUM[\"D_WGS_1984\",SPHEROID[\"WGS_1984\",6378137,298.257223563]],PRIMEM[\"Greenwich\",0],UNIT[\"Degree\",0.017453292519943295]]"}}    

### Global soil carbon dataset
- http://climate.globalforestwatch.org/map/3/20.19/-22.75/ALL/dark/none/857

### Ecosystem service valuation database
[Ecosystem Service Valuation Database](https://www.es-partnership.org/services/data-knowledge-sharing/ecosystem-service-valuation-database/) (ESVD)


In [117]:
url = 'https://www.es-partnership.org/wp-content/uploads/2016/06/ESVD-TEEB-database.xls'

content = download(url)
write_binary(content, os.path.join(dirs.esvd, url.split('/')[-1]))

## Auxiliary data
- [Global Administrative Areas](http://www.gadm.org/) extract lvl 0 and lvl 1
- [Natural Earth Data](http://www.naturalearthdata.com/) full extract
- intact forest landsacape 2000 from global forest watch
- gl30 mask from dropbox or hp

In [118]:
url = 'http://biogeo.ucdavis.edu/data/gadm2.8/gadm28_levels.shp.zip'
url = 'http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_populated_places_simple.zip'

##### References

[<a id="cit-Hansen2013" href="#call-Hansen2013">1</a>] C. M., V. P., Moore R. <em>et al.</em>, ``_High-Resolution Global Maps of 21st-Century Forest Cover Change_'', Science, vol. 342, number 6160, pp. 850--853, November 2013.

[<a id="cit-Li2017" href="#call-Li2017">2</a>] Li Yan, Sulla-Menashe Damien, Motesharrei Safa <em>et al.</em>, ``_Inconsistent estimates of forest cover change in China between 2000 and 2013 from multiple datasets: differences in parameters, spatial resolution, and definitions_'', Scientific Reports, vol. 7, number 8748, pp. , August 2017.

[<a id="cit-Hansen2013a" href="#call-Hansen2013a">3</a>] C. M., V. P., Moore R. <em>et al.</em>, ``_Supplementary Materials for: High-Resolution Global Maps of 21st-Century Forest Cover Change_'', Sciene, vol. 342, number 6160, pp. 1--32, November 2013.  [online](http://science.sciencemag.org/content/suppl/2013/11/14/342.6160.850.DC1)

[<a id="cit-Tropek2014" href="#call-Tropek2014">4</a>] Tropek Robert, Sedl{\'{a}}{\v{c}}ek Ond{\v{r}}ej, Beck Jan <em>et al.</em>, ``_Comment on High-resolution global maps of 21st-century forest cover change_'', Science, vol. 344, number 981, pp. ,  2014.

[<a id="cit-Bellot2014" href="#call-Bellot2014">5</a>] Bellot Franz-Fabian, Bertram Mathias, Navratilb Peter <em>et al.</em>, ``_The high-resolution global map of 21st-century forest cover change from the University of Maryland (Hansen Map) is hugely overestimating deforestation in Indonesia_'', FORCLIME Press release, vol. , number , pp. ,  2014.  [online](http://www.forclime.org/documents/press_release/FORCLIME_Overestimation%20of%20Deforestation.pdf)

[<a id="cit-Li2017a" href="#call-Li2017a">6</a>] Li Yan, Sulla-Menashe Damien, Motesharrei Safa <em>et al.</em>, ``_Supplementary Information for Inconsistent estimates of forest cover change in China between 2000 and 2013 from multiple datasets_'', Scientific reports, vol. 7, number 8748, pp. , August 2017.

