# Data acquisition
This *Jupyter* notebook comprises the required core and auxiliary data downloads for my master thesis. All downloads are stored locally in a directory tree called ***data*** within the notebooks root directory. Be sure that you have enough space on your hard disk because these downloads need it. Further, you should have a fast web access in order to download the required files quickly. The proceeding notebook is a top down approach, therefore you should execute code cells in a top down manner. If you execute code cells in a arbitrary order it leads to unrecognized exceptions and errors. So, for your own sake don't do it please. Moreover, it is not checked if you already downloaded the data yet. Hence, executing the code snippets twice will lead to re-download of the entire datasets.

My master thesis has the emphasis deforestation and deforestation drivers on global, continental and local scale in the tropical zone. Therefore, the downloads will be filtered in the extent of this zone which covering a latitudinal area between 23.43&deg; North and -23.43&deg; South (WGS84). This notebook comprises the following sections:

[**Preparation**](#Preparation) contains all required initial steps like importing crucial standard library modules and construction of the directory tree for storing the downloaded datasets. The code snippets of this section are fundamental and if you refuse to execute them the proceeding code cells will run into fatal errors.

[**Core data**](#Core-data)

[**Auxiliary data**](#Auxiliary-data) <font color='red'>**Not done yet**</font>

[**Result**](#Result)

## Preparation
As the first step we must import all necessary *Python* standard library modules for the data download and filtering. In detail the following modules are required:
- ***collections.namedtuple*** a module to create tuple-like objects that have fields accessible by attribute lookup as well as being indexable and iterable
- ***IPython.display.clear_output*** a module method to clear the output of an arbitrary code cell
- ***urllib.request*** a module to open and download urls
- ***os*** a module to use operating system dependent functionality
- ***re*** a module for applying regular expressions
- ***threading*** a module that provides a multi-threading API
- ***zipfile*** a module that provides tools for ZIP archives
- ***geopandas*** a powerful and feature rich package for manipulating geo-vector files like shp, geojson etc. 

In [1]:
from collections import namedtuple
from IPython.display import clear_output
import urllib.request
import os
import re
import threading
import zipfile
import geopandas as gpd

Finally we create with the following code cell the ***data*** directory tree among the root folder.
- **data**
    - **core** the entire data from [Section 1.2](#Core-data)
        - **gfc** data from [Section 1.2.1](#Global-Forest-Change)
        - **gl30** data from [Section 1.2.2](#GlobalLand30)
        - **gc** data from [Section 1.2.3](#GlobCover)
    - **auxiliary** the entire data from [Section 1.3](#Auxiliary-data)

In [2]:
Directories = namedtuple('Directories', 'root core gfc gl30 gc auxiliary masks'.split())
directories_data = 'data data.core data.core.gfc data.core.gl30 data.core.gc data.auxiliary data.masks'

# os compatibility replace "." with os dependent path separator
dirs = Directories(*re.sub(r'\.', os.sep, directories_data).split())

for directory in dirs:
    try:
        os.mkdir(directory)
        print('Created:\t{}'.format(directory))
    except OSError as e:
        print('Error:\t{} {}'.format(directory, e.strerror))

Error:	data File exists
Error:	data/core File exists
Error:	data/core/gfc File exists
Error:	data/core/gl30 File exists
Error:	data/core/gc File exists
Error:	data/auxiliary File exists
Error:	data/masks File exists


## Core data

### Global Forest Change
[**Global Forest Change 2000-2012 (V1.0)**](https://earthenginepartners.appspot.com/science-2013-global-forest/download_v1.0.html)
- [**Treecover2000**](http://commondatastorage.googleapis.com/earthenginepartners-hansen/GFC2013/treecover2000.txt)
- [**Gain**](http://commondatastorage.googleapis.com/earthenginepartners-hansen/GFC2013/gain.txt)
- [**Lossyear**](http://commondatastorage.googleapis.com/earthenginepartners-hansen/GFC2013/lossyear.txt)

In [3]:
base_url = 'http://commondatastorage.googleapis.com/earthenginepartners-hansen/GFC2013/'

treecover_urls = urllib.request.urlopen(base_url + 'treecover2000.txt').read().decode().splitlines()
gain_urls = urllib.request.urlopen(base_url + 'gain.txt').read().decode().splitlines()
lossyear_urls = urllib.request.urlopen(base_url + 'lossyear.txt').read().decode().splitlines()

print('Treecover2000:\t{} URLs\nGain:\t\t{} URLs\nLossyear:\t{} URLs'
      .format(len(treecover_urls), len(gain_urls), len(lossyear_urls)))

Treecover2000:	504 URLs
Gain:		504 URLs
Lossyear:	504 URLs


In [132]:
def is_in_extent(coord: int, orient: str, north_limit: int, south_limit: int):
    if orient.lower() == 'n':
        return coord <= north_limit
    elif orient.lower() == 's':
        return coord <= south_limit
    return False

# http://commondatastorage.googleapis.com/earthenginepartners-hansen/GFC2013/Hansen_GFC2013_lossyear_00N_030W.tif
regex = re.compile(r"""
                        (?:\w+_){3}  # supress group consumption, alphanumeric char one or more times followed by underline, match group three times  
                        (?P<coord>\d{2})  # named group consumption, digit two times
                        (?P<orient>S|N)  # named group consumption, S or N
                    """, re.VERBOSE)

# [(treecover, gain, lossyear), ...]
raw_urls = list(zip(treecover_urls, gain_urls, lossyear_urls))
filtered_urls = []

for urls in raw_urls:
    coord, orient = regex.search(urls[0]).groups()
    if is_in_extent(int(coord), orient, 30, 20):
        filtered_urls.append(urls)

print('{} URLs are not in bounds between {} North and {} South.'
      .format(len(raw_urls) - len(filtered_urls), 30, -20))
print('Filtered URLs contains 3 * {} elements.'.format(len(filtered_urls)))

288 URLs are not in bounds between 30 North and -20 South.
Filtered URLs contains 3 * 216 elements.


In [5]:
def open_urls(*args):
    response = [urllib.request.urlopen(url) for url in args]
    return response


def write_response(out_path: str, *args):
    for response in args:
        filename = response.url.split('/')[-1]
        with open(out_path + os.sep + filename, 'wb') as dst:
            dst.write(response.read())
        print('DOWNLOADED {}\nTO {}'.format(response.url, out_path + os.sep + filename))


def worker(out_path: str, urls: list):
    response = open_urls(*urls)
    write_response(out_path, *response)


finished = 0
for idx in range(0, len(filtered_urls), 3):
    current_urls = filtered_urls[idx:idx + 3]
    threads = []
    for urls in current_urls:
        thread = threading.Thread(target=worker, args=(dirs.gfc, urls))
        thread.start()
        threads.append(thread)
    [thread.join() for thread in threads]
    clear_output()
    finished += len(current_urls) * 3
    print('DOWNLOADED {} OF {} FILES'.format(finished, len(filtered_urls) * 3))

DOWNLOADED 648 OF 648 FILES


### GlobalLand30

[GL30 mask](http://globallandcover.com/document/globemapsheet.zip)

In [19]:
gl30_mask_url = 'http://globallandcover.com/document/globemapsheet.zip'
gl30_mask_path = dirs.masks + os.sep + gl30_mask_url.split('/')[-1]

with zipfile.ZipFile(gl30_mask_path, 'r') as src:
    to_extract = src.namelist()
    src.extractall(path=dirs.masks)
    
print('Extracted the following files {}'.format(to_extract))
os.remove(gl30_mask_path)
print('Removed archive {}'.format(gl30_mask_path))

Extracted the following files ['GlobeMapSheet.dbf', 'GlobeMapSheet.prj', 'GlobeMapSheet.sbn', 'GlobeMapSheet.sbx', 'GlobeMapSheet.shp', 'GlobeMapSheet.shp.xml', 'GlobeMapSheet.shx']
Removed archive data/masks/globemapsheet.zip


In [20]:
gl30_mask_shp = [item for item in os.listdir(dirs.masks) if bool(re.match(r'.+\.shp$', item))][0]
gl30_mask_path = dirs.masks + os.sep

gl30_mask = gpd.read_file(gl30_mask_path + gl30_mask_shp).cx[:,-23.43:23.43]
gl30_file_filter = [
    '_'.join([items[0] + str(items[1]).zfill(2), str(items[2]).zfill(2)])
    for items in zip(gl30_mask.NS, gl30_mask.UTMZONE, gl30_mask.ROW)
                   ]

print('{} Tiles are not in bounds between {} North and {} South.'
      .format(853 - len(gl30_mask), 23.43, -23.43))
print('File filter contains {} elements.'.format(len(gl30_file_filter)))
gl30_mask.head()

495 Tiles are not in bounds between 23.43 North and -23.43 South.
File filter contains 358 elements.


Unnamed: 0,NS,UTMZONE,ROW,CONTINENT,REMARK,geometry
37,S,48,10,Asia,S48_10,"POLYGON ((108 -15.00000003, 101.99999988 -15.0..."
38,S,50,10,Asia,S50_10,"POLYGON ((119.99999988 -15.00000003, 114.00000..."
39,S,43,5,Asia,S43_5,"POLYGON ((78.00000012 -9.999999989999999, 72 -..."
40,S,48,5,Asia,S48_5,"POLYGON ((108 -9.999999989999999, 101.99999988..."
41,S,49,5,Asia,S49_5,"POLYGON ((114.00000012 -9.999999989999999, 108..."


In [21]:
regex_id = re.compile(r'(?P<id>(?:N|S)\d{2}_\d{2})')
regex_ext = re.compile(r'.+\.tif')

count = 0
for item in os.listdir(dirs.gl30):
    match = regex_id.search(item).group('id')
    if match not in gl30_file_filter:
        os.remove(dirs.gl30 + os.sep + item)
    else:
        with zipfile.ZipFile(dirs.gl30 + os.sep + item) as src:
            to_extract = [ele for ele in src.namelist() if bool(regex_ext.match(ele))][0]
            src.extract(to_extract, path=dirs.gl30)
        os.rename(dirs.gl30 + os.sep + to_extract, dirs.gl30 + os.sep + to_extract.split(os.sep)[-1])
        os.rmdir(dirs.gl30 + os.sep + to_extract.split(os.sep)[0])
        os.remove(dirs.gl30 + os.sep + item)

### GlobCover

In [23]:
globcover_url = 'http://due.esrin.esa.int/files/Globcover2009_V2.3_Global_.zip'

response = open_urls(globcover_url)
write_response(dirs.gc, *response)

DOWNLOADED http://due.esrin.esa.int/files/Globcover2009_V2.3_Global_.zip
TO data/core/gc/Globcover2009_V2.3_Global_.zip


In [24]:
path_to_zip = dirs.gc + os.sep + os.listdir(dirs.gc)[0]
regex = re.compile(r'.+\.tif', re.IGNORECASE)

with zipfile.ZipFile(path_to_zip, 'r') as src:
    to_extract = [item for item in src.namelist() if bool(regex.match(item))]
    src.extractall(path=dirs.gc, members=to_extract)

print('Extracted the following files {}'.format(to_extract))
os.remove(path_to_zip)
print('Removed archive {}'.format(path_to_zip))

Extracted the following files ['GLOBCOVER_L4_200901_200912_V2.3.tif', 'GLOBCOVER_L4_200901_200912_V2.3_CLA_QL.tif']
Removed archive data/core/gc/Globcover2009_V2.3_Global_.zip


## Auxiliary data
<p><font color='red'>**Not done yet**</font></p>

## Result