# Data acquisition
This *Jupyter* notebook comprises the required core and auxiliary data downloads for my master thesis. All downloads are stored locally in a directory tree called ***data*** within the notebooks root directory. Be sure that you have enough space on your hard disk because these downloads need it. Further, you should have a fast web access in order to download the required files quickly. The proceeding notebook is a top down approach, therefore you should execute code cells in a top-down manner. If you execute code cells in a arbitrary order it leads to unrecognized exceptions and errors. So, for your own sake don't do it please. Moreover, it is not checked if you already downloaded the data yet. Hence, executing the code snippets twice will lead to re-download of the entire datasets.

My master thesis has the emphasis deforestation and deforestation drivers on global, continental and local scale in the tropical zone between 2001 and 2010. Therefore, the downloads will be filtered in the extent of this zone which covering a latitudinal area between approximately 23.43&deg; North and -23.43&deg; South (WGS84). Furthermore, the data acquisition focuses on datasets of the mentioned time period. This notebook comprises the following sections:

[**Preparation**](#Preparation) contains all required initial steps like importing crucial standard library modules and construction of the directory tree for storing the downloaded datasets. The code snippets of this section are fundamental and if you refuse to execute them the proceeding code cells will run into fatal errors.

[**Core data**](#Core-data) section encompass the required code cells for downloading and filtering of the core datasets for determining the drivers of deforestation in tropics. These datasets are raster files of the type GeoTIFF (file-extension \*.tif) which are stored in a sub-folder sorted manner within the directory **core** according to [Section Preparation](#Preparation). As mentioned at the first paragraph you must execute the code cells in the provided top-down order otherwise the execution fails and crucial data will be missing in the proceeding chapters. 

[**Auxiliary data**](#Auxiliary-data) <font color='red'>**Not done yet**</font> this section should contain extra data downloads like satellite images for NDVI computation, country masks, tile masks etc.

## Preparation
As the first step we must import all necessary *Python* standard library modules as well as other mandatory packages to accomplish the data download and filtering successful. In detail the following modules are required:
- ***matplotlib*** a ipython built-in magic command for importing matplotlib and to enable the inline backend for notebooks
- ***re*** a module for applying regular expressions
- ***os*** a module to use operating system dependent functionality
- ***zipfile*** a module that provides tools for ZIP archives
- ***threading*** a module that provides a multi-threading API
- ***urllib.request*** a module to open and download urls
- ***geopandas*** a powerful and feature rich package for manipulating geo-vector files like shp, geojson etc. wrapped around pandas and GDAL
- ***collections.namedtuple*** a module to create tuple-like objects that have fields accessible by attribute lookup as well as being indexable and iterable
- ***IPython.display.clear_output*** a module method to clear the output of an arbitrary code cell 

In [2]:
%matplotlib inline
import os
import re
import zipfile
import threading
import urllib.request
import geopandas as gpd
from collections import namedtuple
from IPython.display import clear_output

Finally we create with the following code cell the ***data*** directory tree among the root folder.
- **data**
    - **core** the entire data from [Section 1.2](#Core-data)
        - **gfc** data from [Section 1.2.1](#Global-Forest-Change)
        - **gl30** data from [Section 1.2.2](#GlobalLand30)
        - **gc** data from [Section 1.2.3](#GlobCover)
    - **auxiliary** the entire data from [Section 1.3](#Auxiliary-data)
        - **masks** data from [Section 1.2.2](#GlobalLand30)

In [3]:
Directories = namedtuple('Directories', 'root core gfc gl30 gc auxiliary masks')

# directories to create
directories_data = 'data data.core data.core.gfc data.core.gl30 data.core.gc data.auxiliary data.auxiliary.masks'

# os compatibility replace "." with os dependent path separator and store string in namedtuple Directories
dirs = Directories(*re.sub(r'\.', os.sep, directories_data).split())

# make directories according to values assigned to namedtuple Directories
for directory in dirs:
    try:
        os.mkdir(directory)
        print('Created:\t{}'.format(directory))
    except OSError as e:
        print('Error:\t{} {}'.format(directory, e.strerror))

Error:	data File exists
Error:	data/core File exists
Error:	data/core/gfc File exists
Error:	data/core/gl30 File exists
Error:	data/core/gc File exists
Error:	data/auxiliary File exists
Error:	data/auxiliary/masks File exists


## Core data
Explain the phrase core data
- pivotal data for the processing pipeline
- list the different datasets
- explain the filter properties

### Global Forest Change
[**Global Forest Change 2000-2012 Version 1.0**](https://earthenginepartners.appspot.com/science-2013-global-forest/download_v1.0.html) (GFC) is the first high resolution dataset that provides a comprehensive view on the annual global forest cover change between 2000 and 2012 \cite{Hansen2013, Li2017}. The initial GFC dataset released by Hansen et al. is extended by recent releases which encompass the annual forest cover changes between [2000-2013 (Version 1.1)](https://earthenginepartners.appspot.com/science-2013-global-forest/download_v1.1.html), [2000-2014 (Version 1.2)](https://earthenginepartners.appspot.com/science-2013-global-forest/download_v1.2.html), [2000-2015 (Version 1.3)](https://earthenginepartners.appspot.com/science-2013-global-forest/download_v1.3.html) and [2000-2016 (Version 1.4)](https://earthenginepartners.appspot.com/science-2013-global-forest/download_v1.4.html) respectively. All versions of this dataset has in common, that they are derived from growing season imagery captured by the remote sensing satellite Landsat 7 Enhanced Thematic Mapper Plus (ETM+) at a spatial resolution of 30 meters per pixel \cite{Hansen2013a}. On the satellite imagery a time-series spectral metrics analysis is applied to gather the global forest extent at 2000 as well as the annual forest loss and gain. Hence, GFC comprises three independent data layers  tree cover, annually forest loss and  forest gain divided into 10x10 degree tiles by the geodetic coordinate system *World Geodetic System 1984* (EPSG:4326).  

[Global Forest Watch](http://www.globalforestwatch.org/) interactive map

- Flow general what is gfc then detailed info monitoring method, details of the different layers, how certain is the info
- trees defined as all vegetation higher than 5 meters Hansen2013, Hansen2013a
- forest loss defined as a stand displacement disturbance (> x% crown cover to 0% crown cover)  Hansen2013, Hansen2013a
- monitored by a reference percent tree cover stratum Hansen2013, Hansen2013a
- forest degeneration for example selective removals btw. all impacts on forest which are not lead to a non forest state are not considered Hansen2013a
- term forest refer to tree cover Hansen2013a
- gain is the inverse of loss e.g. the change of a non forest state to forest (crown cover densities >50%)
- Forest loss detection is less uncertain then gain detection (loss is more reliable) Li2017
- Gain is a more gradual and ecological complex process, signal is more difficult to detect Li2017
- Li2017 compares 4 different forest cover change products on their performance to estimate loss and gain patterns in china
- at the end show a example picture of the data

\cite{Hansen2013}
\cite{Hansen2013a}
\cite{Tropek2014}
\cite{Bellot2014}
\cite{Li2017}
\cite{Li2017a}

<font color='red'>**Code explanation here**</font>
- [**Treecover2000**](http://commondatastorage.googleapis.com/earthenginepartners-hansen/GFC2013/treecover2000.txt)
- [**Gain**](http://commondatastorage.googleapis.com/earthenginepartners-hansen/GFC2013/gain.txt)
- [**Lossyear**](http://commondatastorage.googleapis.com/earthenginepartners-hansen/GFC2013/lossyear.txt)
- need url for each tile for each layer
- researchers provide a url index file for each lyer
- download the relevant files
- store content (tile urls) in a variable 

In [7]:
# data source URL
head_url = 'http://commondatastorage.googleapis.com/earthenginepartners-hansen/GFC2013/'

# append the relevant layer URL index file to head url, download file content and assign it to a list variable
treecover_urls = urllib.request.urlopen(head_url + 'treecover2000.txt').read().decode().splitlines()
gain_urls = urllib.request.urlopen(head_url + 'gain.txt').read().decode().splitlines()
lossyear_urls = urllib.request.urlopen(head_url + 'lossyear.txt').read().decode().splitlines()

print('Treecover2000:\t{} URLs\nGain:\t\t{} URLs\nLossyear:\t{} URLs'
      .format(len(treecover_urls), len(gain_urls), len(lossyear_urls)))

Treecover2000:	504 URLs
Gain:		504 URLs
Lossyear:	504 URLs


<font color='red'>**Code explanation here**</font>

In [8]:
def is_in_extent(coord: int, orient: str, north_limit: int, south_limit: int):
    if orient.lower() == 'n':
        return coord <= north_limit
    elif orient.lower() == 's':
        return coord <= south_limit
    return False

# http://commondatastorage.googleapis.com/earthenginepartners-hansen/GFC2013/Hansen_GFC2013_lossyear_00N_030W.tif
regex = re.compile(r"""
                        (?:\w+_){3}  # supress group consumption, alphanumeric char one or more times (greedy) followed by underline, match group three times  
                        (?P<coord>\d{2})  # named group consumption, digit two times
                        (?P<orient>S|N)  # named group consumption, S or N
                    """, re.VERBOSE)

# [(treecover, gain, lossyear), ...]
raw_urls = list(zip(treecover_urls, gain_urls, lossyear_urls))
filtered_urls = []

for urls in raw_urls:
    coord, orient = regex.search(urls[0]).groups()
    if is_in_extent(int(coord), orient, 30, 20):
        filtered_urls.append(urls)

print('{} URLs are not in bounds between {} North and {} South.'
      .format(len(raw_urls) - len(filtered_urls), 30, -20))
print('Filtered URLs contains 3 * {} elements.'.format(len(filtered_urls)))

288 URLs are not in bounds between 30 North and -20 South.
Filtered URLs contains 3 * 216 elements.


<font color='red'>**Code explanation here**</font>

In [9]:
def open_urls(*args):
    response = [urllib.request.urlopen(url) for url in args]
    return response


def write_response(out_path: str, *args):
    for response in args:
        filename = response.url.split('/')[-1]
        with open(os.path.join(out_path, filename), 'wb') as dst:
            dst.write(response.read())
        print('DOWNLOADED {}\nTO {}'.format(response.url, os.path.join(out_path, filename)))


def worker(out_path: str, urls: list):
    response = open_urls(*urls)
    write_response(out_path, *response)


finished = 0
for idx in range(0, len(filtered_urls), 3):
    current_urls = filtered_urls[idx:idx + 3]
    threads = []
    for urls in current_urls:
        thread = threading.Thread(target=worker, args=(dirs.gfc, urls))
        thread.start()
        threads.append(thread)
    [thread.join() for thread in threads]
    clear_output()
    finished += len(current_urls) * 3
    print('DOWNLOADED {} OF {} FILES'.format(finished, len(filtered_urls) * 3))

DOWNLOADED 648 OF 648 FILES


![Hansen preview](img/hansen_preview.png)

### GlobalLand30

[GL30 mask](http://globallandcover.com/document/globemapsheet.zip)

In [8]:
gl30_mask_url = 'http://globallandcover.com/document/globemapsheet.zip'
gl30_mask_path = os.path.join(dirs.masks, gl30_mask_url.split('/')[-1])

with zipfile.ZipFile(gl30_mask_path, 'r') as src:
    to_extract = src.namelist()
    src.extractall(path=dirs.masks)
 

print('Extracted the following files {}'.format(to_extract))
os.remove(gl30_mask_path)
print('Removed archive {}'.format(gl30_mask_path))

Extracted the following files ['GlobeMapSheet.dbf', 'GlobeMapSheet.prj', 'GlobeMapSheet.sbn', 'GlobeMapSheet.sbx', 'GlobeMapSheet.shp', 'GlobeMapSheet.shp.xml', 'GlobeMapSheet.shx']
Removed archive data/auxiliary/masks/globemapsheet.zip


In [15]:
gl30_mask_shp = [item for item in os.listdir(dirs.masks) if bool(re.match(r'.+\.shp$', item))][0]

gl30_mask = gpd.read_file(os.path.join(dirs.masks, gl30_mask_shp)).cx[:,-23.43:23.43]
gl30_file_filter = [
    '_'.join([items[0] + str(items[1]).zfill(2), str(items[2]).zfill(2)])
    for items in zip(gl30_mask.NS, gl30_mask.UTMZONE, gl30_mask.ROW)
                   ]

print('{} Tiles are not in bounds between {} North and {} South.'
      .format(853 - len(gl30_mask), 23.43, -23.43))
print('File filter contains {} elements.'.format(len(gl30_file_filter)))
gl30_mask.head()

495 Tiles are not in bounds between 23.43 North and -23.43 South.
File filter contains 358 elements.


Unnamed: 0,NS,UTMZONE,ROW,CONTINENT,REMARK,geometry
37,S,48,10,Asia,S48_10,"POLYGON ((108 -15.00000003, 101.99999988 -15.0..."
38,S,50,10,Asia,S50_10,"POLYGON ((119.99999988 -15.00000003, 114.00000..."
39,S,43,5,Asia,S43_5,"POLYGON ((78.00000012 -9.999999989999999, 72 -..."
40,S,48,5,Asia,S48_5,"POLYGON ((108 -9.999999989999999, 101.99999988..."
41,S,49,5,Asia,S49_5,"POLYGON ((114.00000012 -9.999999989999999, 108..."


In [16]:
regex_id = re.compile(r'(?P<id>(?:N|S)\d{2}_\d{2})', re.VERBOSE)
regex_ext = re.compile(r'.+\.tif', re.VERBOSE)

count = 0
for item in os.listdir(dirs.gl30):
    match = regex_id.search(item).group('id')
    if match not in gl30_file_filter:
        os.remove(os.path.join(dirs.gl30, item))
    else:
        with zipfile.ZipFile(os.path.join(dirs.gl30, item)) as src:
            to_extract = [ele for ele in src.namelist() if bool(regex_ext.match(ele))][0]
            src.extract(to_extract, path=dirs.gl30)
        os.rename(os.path.join(dirs.gl30, to_extract), dirs.gl30 + os.sep + to_extract.split(os.sep)[-1])
        os.rmdir(dirs.gl30 + os.sep + to_extract.split(os.sep)[0])
        os.remove(os.path.join(dirs.gl30, item))
        clear_output()
        print('Unzip:\t{}'.format(os.path.join(dirs.gl30, to_extract)))
        print('Move:\t{}'.format(os.path.join(dirs.gl30, to_extract.split(os.sep)[-1])))
    print('Removed:\t{}'.format(os.path.join(dirs.gl30, item)))

Unzip:	data/core/gl30/S04_00_2010LC030/s04_00_2010lc030.tif
Move:	data/core/gl30/s04_00_2010lc030.tif
Removed:	data/core/gl30/S04_00_2010LC030.zip


![Chen preview](img/chen_preview.png)

### GlobCover

In [17]:
globcover_url = 'http://due.esrin.esa.int/files/Globcover2009_V2.3_Global_.zip'

response = open_urls(globcover_url)
write_response(dirs.gc, *response)

DOWNLOADED http://due.esrin.esa.int/files/Globcover2009_V2.3_Global_.zip
TO data/core/gc/Globcover2009_V2.3_Global_.zip


In [19]:
path_to_zip = os.path.join(dirs.gc, os.listdir(dirs.gc)[0])
regex = re.compile(r'.+\.tif', re.IGNORECASE)

with zipfile.ZipFile(path_to_zip, 'r') as src:
    to_extract = [item for item in src.namelist() if bool(regex.match(item))]
    src.extractall(path=dirs.gc, members=to_extract)

print('Extracted the following files {}'.format(to_extract))
os.remove(path_to_zip)
print('Removed archive {}'.format(path_to_zip))

Extracted the following files ['GLOBCOVER_L4_200901_200912_V2.3.tif', 'GLOBCOVER_L4_200901_200912_V2.3_CLA_QL.tif']
Removed archive data/core/gc/Globcover2009_V2.3_Global_.zip


### Ecosystem service valuation database
https://www.es-partnership.org/services/data-knowledge-sharing/ecosystem-service-valuation-database/

## Auxiliary data
<p><font color='red'>**Not done yet**</font></p>
- country mask http://gadm.org/
- http://biogeo.ucdavis.edu/data/gadm2.8/gadm28_levels.shp.zip

## Result
<p><font color='red'>**Not done yet**</font></p>

##### References

[<a id="cit-Hansen2013" href="#call-Hansen2013">1</a>] C. M., V. P., Moore R. <em>et al.</em>, ``_High-Resolution Global Maps of 21st-Century Forest Cover Change_'', Science, vol. 342, number 6160, pp. 850--853, November 2013.

[<a id="cit-Hansen2013a" href="#call-Hansen2013a">2</a>] C. M., V. P., Moore R. <em>et al.</em>, ``_Supplementary Materials for: High-Resolution Global Maps of 21st-Century Forest Cover Change_'', Sciene, vol. 342, number 6160, pp. 1--32, November 2013.  [online](http://science.sciencemag.org/content/suppl/2013/11/14/342.6160.850.DC1)

[<a id="cit-Tropek2014" href="#call-Tropek2014">3</a>] Tropek Robert, Sedl{\'{a}}{\v{c}}ek Ond{\v{r}}ej, Beck Jan <em>et al.</em>, ``_Comment on High-resolution global maps of 21st-century forest cover change_'', Science, vol. 344, number 981, pp. ,  2014.

[<a id="cit-Bellot2014" href="#call-Bellot2014">4</a>] Bellot Franz-Fabian, Bertram Mathias, Navratilb Peter <em>et al.</em>, ``_The high-resolution global map of 21st-century forest cover change from the University of Maryland (Hansen Map) is hugely overestimating deforestation in Indonesia_'', FORCLIME Press release, vol. , number , pp. ,  2014.  [online](http://www.forclime.org/documents/press_release/FORCLIME_Overestimation%20of%20Deforestation.pdf)

[<a id="cit-Li2017" href="#call-Li2017">5</a>] Li Yan, Sulla-Menashe Damien, Motesharrei Safa <em>et al.</em>, ``_Inconsistent estimates of forest cover change in China between 2000 and 2013 from multiple datasets: differences in parameters, spatial resolution, and definitions_'', Scientific Reports, vol. 7, number 8748, pp. , August 2017.

[<a id="cit-Li2017a" href="#call-Li2017a">6</a>] Li Yan, Sulla-Menashe Damien, Motesharrei Safa <em>et al.</em>, ``_Supplementary Information for Inconsistent estimates of forest cover change in China between 2000 and 2013 from multiple datasets_'', Scientific reports, vol. 7, number 8748, pp. , August 2017.

