# 1. Data acquisition
This *Jupyter* notebook comprises the required core and auxiliary data downloads for my master thesis. All downloads are stored locally in a directory tree called ***data*** within the notebooks root directory. Be sure that you have enough space on your hard disk because these downloads need it. Further, you should have a fast web access in order to download the required files quickly. The proceeding notebook is a top down approach, therefore you should execute code cells in a top down manner. If you execute code cells in a arbitrary order it leads to unrecognized exceptions and errors. So, for your own sake don't do it please. Moreover, it is not checked if you already downloaded the data yet. Hence, executing the code snippets twice will lead to re-download of the entire datasets.

My master thesis has the emphasis deforestation and deforestation drivers on global, continental and local scale in the tropical zone. Therefore, the downloads will be filtered in the extent of this zone which covering a latitudinal area between 23.43&deg; North and 23.43&deg; South (WGS84). This notebook comprises the following sections:

[**1.1. Preparation**](#1.1.-Preparation) contains all required initial steps like importing crucial standard library modules and construction of the directory tree for storing the downloaded datasets. The code snippets of this section are fundamental and if you refuse to execute them the proceeding code cells will run into fatal errors.

[**1.2. Core data**](#1.2.-Core-data)

[**1.3. Auxiliary data**](#1.3.-Auxiliary-data)

## 1.1. Preparation
As the first step we must import all necessary *Python* standard library modules for the data download and filtering. In detail the following modules are required:
- ***collections.namedtuple***
- ***IPython.display.clear_output*** a module method to clear the output of an arbitrary code cell
- ***urllib.request*** a module to open and download urls
- ***os*** a module to use operating system dependent functionality
- ***re*** a module for applying regular expressions
- ***threading***

In [73]:
from collections import namedtuple
from IPython.display import clear_output
import urllib.request
import os
import re
import threading

Finally we create with the following code cell the ***data*** directory tree among the root folder.
- **data**
    - **core** the entire data from [Section 1.2.](#1.2.-Core-data)
        - **gfc** data from [Section 1.2.1.](#1.2.1.-Global-Forest-Change)
        - **gl30** data from [Section 1.2.2.](#1.2.2.-GlobalLand30)
        - **gc** data from [Section 1.2.3.](#1.2.3.-GlobCover)
    - **auxiliary** the entire data from [Section 1.3.](#1.3.-Auxiliary-data)
    - **urls** URLs from different sources for validation 
   

In [74]:
Directories = namedtuple('Directories', 'root urls core gfc gl30 gc auxiliary'.split())
directories_data = 'data data.urls data.core data.core.gfc data.core.gl30 data.core.gc data.auxiliay'

# os compatibility replace "." with os dependent path seperator
dirs = Directories(*re.sub(r'\.', os.sep, directories_data).split())

for directory in dirs:
    try:
        os.mkdir(directory)
        print('Created:\t{}'.format(directory))
    except OSError as e:
        print('Error:\t{} {}'.format(directory, e.strerror))

Created:	data
Created:	data/urls
Created:	data/core
Created:	data/core/gfc
Created:	data/core/gl30
Created:	data/core/gc
Created:	data/auxiliay


## 1.2. Core data

### 1.2.1. Global Forest Change
[**Global Forest Change 2000-2012 (V1.0)**](https://earthenginepartners.appspot.com/science-2013-global-forest/download_v1.0.html)
- [**Treecover2000**](http://commondatastorage.googleapis.com/earthenginepartners-hansen/GFC2013/treecover2000.txt)
- [**Gain**](http://commondatastorage.googleapis.com/earthenginepartners-hansen/GFC2013/gain.txt)
- [**Lossyear**](http://commondatastorage.googleapis.com/earthenginepartners-hansen/GFC2013/lossyear.txt)

In [77]:
base_url = 'http://commondatastorage.googleapis.com/earthenginepartners-hansen/GFC2013/'

treecover_urls = urllib.request.urlopen(base_url + 'treecover2000.txt').read().decode().splitlines()
gain_urls = urllib.request.urlopen(base_url + 'gain.txt').read().decode().splitlines()
lossyear_urls = urllib.request.urlopen(base_url + 'lossyear.txt').read().decode().splitlines()

print('Treecover2000:\t{} URLs\nGain:\t\t{} URLs\nLossyear:\t{} URLs'
      .format(len(treecover_urls), len(gain_urls), len(lossyear_urls)))

Treecover2000:	504 URLs
Gain:		504 URLs
Lossyear:	504 URLs


In [78]:
def is_in_extent(coord: int, orient: str, north_limit: int, south_limit: int):
    if orient.lower() == 'n':
        return coord <= north_limit
    elif orient.lower() == 's':
        return coord <= south_limit
    return False

# http://commondatastorage.googleapis.com/earthenginepartners-hansen/GFC2013/Hansen_GFC2013_lossyear_00N_030W.tif
regex = re.compile(r"""
                        (?:\w+_){3}  # supress group consumption, alphanumeric char one or more times followed by underline, match group three times  
                        (?P<coord>\d{2})  # named group consumption, digit two times
                        (?P<orient>S|N)  # named group consumption, S or N
                    """, re.VERBOSE)

# [(treecover, gain, lossyear), ...]
raw_urls = list(zip(treecover_urls, gain_urls, lossyear_urls))
filtered_urls = []

for urls in raw_urls:
    coord, orient = regex.search(urls[0]).groups()
    if is_in_extent(int(coord), orient, 30, 20):
        filtered_urls.append(urls)

print('{} URLs are not in bounds between {} North and {} South.'.format(len(raw_urls) - len(filtered_urls), 30, 20))
print('Filtered URLs contains 3 * {} elements.'.format(len(filtered_urls)))

288 URLs are not in bounds between 30 North and 20 South.
Filtered URLs contains 3 * 216 elements.


In [81]:
def open_urls(*args):
    response = [urllib.request.urlopen(url) for url in args]
    return response

def write_response(out_path: str, *args):
    for response in args:
        filename = response.url.split('/')[-1]
        with open(out_path + os.sep + filename, 'wb') as dst:
            dst.write(response.read())
        print('DOWNLOADED {}\nTO {}'.format(response.url, out_path + os.sep + filename))

def worker(out_path: str, urls: list):
    response = open_urls(*urls)
    write_response(out_path, *response)

finished = 0
for idx in range(0, len(filtered_urls), 3):
    current_urls = filtered_urls[idx:idx + 3]
    threads = []
    for urls in current_urls:
        thread = threading.Thread(target=worker, args=(dirs.gfc, urls))
        thread.start()
        threads.append(thread)
    [thread.join() for thread in threads]
    clear_output()
    finished += len(current_urls) * 3
    print('DOWNLOADED {} OF {} FILES'.format(finished, len(filtered_urls) * 3))

DOWNLOADED 648 OF 648 FILES


### 1.2.2. GlobalLand30

### 1.2.3. GlobCover

## 1.3. Auxiliary data