# 3. Accessing MODIS Data products using `gdal`


[GDAL](https://gdal.org) is the workhorse of geospatial processing. Basically, GDAL offers a common library to access a vast number of formats (if you want to see how vast, [check this](https://gdal.org/formats_list.html)). In addition to letting you open and convert obscure formats to something more useful, a lot of functionality in terms of processing raster data is available (for example, working with projections, combining datasets, accessing remote datasets, etc).

For vector data, the counterpart to GDAL is OGR (which is now a part of the GDAL library anyway), which also supports [many vector formats](https://gdal.org/ogr_formats.html). The combination of both libraries is a very powerful tool to work with geospatial data, not only from Python, but from [many other popular computer languages](https://trac.osgeo.org/gdal/#GDALOGRInOtherLanguages).

In this session, we will introduce the `gdal` geospatial module which can read a wide range of raster scientific data formats. We will also introduce the related `ogr` vector package.

In pacticular, we will learn how to:

* access and download NASA geophysical datasets (specifically, the MODIS LAI/FPAR product)
* apply a vector mask to the dataset
* apply quality control flags to the data
* stack datasets into a 3D numpy dataset for further analysis, including interpolation of missing values
* visualise the data
* store the stacked dataset

**These are all tasks that you will be required to do for the [part 1 formal assessment](Formal_assessment_part1.ipynb) of this course. You will however be using a different NASA dataset.**

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#MODIS-LAI-product" data-toc-modified-id="MODIS-LAI-product-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>MODIS LAI product</a></span><ul class="toc-item"><li><span><a href="#NASA-MODIS-data-access" data-toc-modified-id="NASA-MODIS-data-access-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>NASA MODIS data access</a></span><ul class="toc-item"><li><span><a href="#Register-at-NASA-Earthdata" data-toc-modified-id="Register-at-NASA-Earthdata-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Register at NASA Earthdata</a></span></li><li><span><a href="#Accessing-NASA-MODIS-URLs" data-toc-modified-id="Accessing-NASA-MODIS-URLs-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Accessing NASA MODIS URLs</a></span></li></ul></li><li><span><a href="#MODIS-filename-format" data-toc-modified-id="MODIS-filename-format-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>MODIS filename format</a></span></li><li><span><a href="#downloading-the-data-file" data-toc-modified-id="downloading-the-data-file-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>downloading the data file</a></span></li></ul></li><li><span><a href="#GDAL" data-toc-modified-id="GDAL-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>GDAL</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Binary-data" data-toc-modified-id="Binary-data-2.0.1"><span class="toc-item-num">2.0.1&nbsp;&nbsp;</span>Binary data</a></span></li></ul></li></ul></li><li><span><a href="#Exercise-4.1" data-toc-modified-id="Exercise-4.1-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Exercise 4.1</a></span></li><li><span><a href="#Downloading-data" data-toc-modified-id="Downloading-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Downloading data</a></span><ul class="toc-item"><li><span><a href="#NASA-EarthData" data-toc-modified-id="NASA-EarthData-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>NASA EarthData</a></span></li><li><span><a href="#Direct-downloading" data-toc-modified-id="Direct-downloading-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Direct downloading</a></span></li></ul></li><li><span><a href="#Exercise-A-Different-Dataset" data-toc-modified-id="Exercise-A-Different-Dataset-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Exercise A Different Dataset</a></span><ul class="toc-item"><li><span><a href="#Download" data-toc-modified-id="Download-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Download</a></span></li><li><span><a href="#Explore" data-toc-modified-id="Explore-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Explore</a></span></li><li><span><a href="#Read-a-dataset" data-toc-modified-id="Read-a-dataset-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Read a dataset</a></span></li><li><span><a href="#Water-mask" data-toc-modified-id="Water-mask-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>Water mask</a></span></li><li><span><a href="#Valid-pixel-mask" data-toc-modified-id="Valid-pixel-mask-5.5"><span class="toc-item-num">5.5&nbsp;&nbsp;</span>Valid pixel mask</a></span></li><li><span><a href="#3D-dataset" data-toc-modified-id="3D-dataset-5.6"><span class="toc-item-num">5.6&nbsp;&nbsp;</span>3D dataset</a></span></li></ul></li><li><span><a href="#4.3-Vector-masking" data-toc-modified-id="4.3-Vector-masking-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>4.3 Vector masking</a></span></li><li><span><a href="#Exercise-4.3" data-toc-modified-id="Exercise-4.3-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Exercise 4.3</a></span></li></ul></div>


## MODIS LAI product 
To introduce `gdal` we will use a dataset from the MODIS LAI product over the UK. 

You should note that the dataset you need to use for your assessed practical is a MODIS dataset with similar characteristics to the one in this example.

The data product [MOD16](https://modis.gsfc.nasa.gov/data/dataprod/mod15.php) LAI/FPAR has been generated from NASA MODIS sensors Terra and Aqua data since 2002. We are now in dataset collection 6 (the data version to use).

    LAI is defined as the one-sided green leaf area per unit ground area in broadleaf canopies and as half the total needle surface area per unit ground area in coniferous canopies. FPAR is the fraction of photosynthetically active radiation (400-700 nm) absorbed by green vegetation. Both variables are used for calculating surface photosynthesis, evapotranspiration, and net primary production, which in turn are used to calculate terrestrial energy, carbon, water cycle processes, and biogeochemistry of vegetation. Algorithm refinements have improved quality of retrievals and consistency with field measurements over all biomes, with a focus on woody vegetation.
    
We use such data to map and understand about the dynamics of terrestrial vegetation / carbon, for example, for climate studies.

The raster data are arranged in tiles, indexed by row and column, to cover the globe:


![MODIS tiles](https://www.researchgate.net/profile/J_Townshend/publication/220473201/figure/fig5/AS:277546596880390@1443183673583/The-global-MODIS-Sinusoidal-tile-grid.png)



**Exercise 3.1.1**

The pattern on the tile names is `hXXvYY` where `XX` is the horizontal coordinate and `YY` the vertical.


* use the map above to work out the names of the two tiles that we will need to access data over the UK
* set the variable `tiles` to contain these two names in a list

For example, for the two tiles covering Madegascar, we would set:

    tiles = ['h22v10','h22v11']

In [1]:
# do exercise here

### NASA MODIS data access

#### Register at NASA Earthdata

Before you attempt to do this section, you will need to register at [NASA Earthdata](https://urs.earthdata.nasa.gov/home).


We have set up these notes so that you don't have to put your username and password in plain text. Instead, you need to enter your username and password when prompted by `cylog`. The password is stored in an encrypted file, although it can be accessed as plain text within your Python session.

**N.B. using `cylog().login()` is only intended to work with access to NASA Earthdata and to prevent you having to expose your username and password in these notes**

In [2]:
from cylog import cylog
import requests

url = 'https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006/2018.09.30/' 
        
# grab the HTML information
html = requests.get(url,auth=cylog().login()).text

# test a few lines of the html
if html[:20] == '<!DOCTYPE HTML PUBLI':
    print('this seems to be ok ... ')
    print('use cylog().login() anywhere you need to specify the tuple (username,password)')

this seems to be ok ... 
use cylog().login() anywhere you need to specify the tuple (username,password)



The NASA servers go down for weekly maintenance, usually on Wednesday afternoon (UK time), so you might not want to attempt this exercise then. If so, skip the rest of this section and do it another time.





#### Accessing NASA MODIS URLs


Before dealing into the data, we will first learn something of how to *automatically* access such data. Since we might often want to work with a large number of files (e.g. for analysing LAI or other variables over space/time), we will want to write code that allows us to do this. 

If the data we want to use are accessible to us as a URL, we can simply use `requests` as in previous exercises.

Sometimes, we will be able to specify the parameters of the dataset we want, e.g. using [JSON](https://www.json.org). At othertimes (as in the case here) we might need to do a little work ourselves to construct the particular URL we want.

If you visit the site [https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006](https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006), you will see 'date' style links (e.g. `2018.09.30`) through to sub-directories. 

In these, e.g. [https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006/2018.09.30/](https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006/2018.09.30/) you will find URLs of a set of files. 

The files pointed to by the URLs are the MODIS MOD15 4-day composite 500 m LAI/FPAR product [MCD15A3H](https://lpdaac.usgs.gov/dataset_discovery/modis/modis_products_table/mcd15a3h_v006).

There are links to several datasets on the page, including 'quicklook files' that are jpeg format images of the datasets, e.g.:

![MCD15A3H.A2018273.h17v03](https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006/2018.09.30/BROWSE.MCD15A3H.A2018273.h17v03.006.2018278143630.1.jpg)

as well as `xml` files and `hdf` datasets. 

When we access this 'listing' (directory links such as [https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006/2018.09.30/](https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006/2018.09.30/)) from Python, we will obtain the information in [HTML](https://www.w3schools.com/html/). We don't expect you to know this language, but knowing some of the basics is oftem useful.


In [21]:
import requests
from cylog import cylog


url = 'https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006/2018.09.30/' 
        
# grab the HTML information
# use auth=cylog().login() here instead of (username,password)

html = requests.get(url,auth=cylog().login()).text

# print a few lines of the html
print(html[:951])
# etc
print('\n','-'*30,'etc','-'*30)
# at the end
print(html[-964:])

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
 <head>
  <title>Index of /MOTA/MCD15A3H.006/2018.09.30</title>
 </head>
 <body>
<pre>
********************************************************************************

                         U.S. GOVERNMENT COMPUTER

This US Government computer is for authorized users only.  By accessing this
system you are consenting to complete monitoring with no expectation of privacy.
Unauthorized access or use may subject you to disciplinary action and criminal
prosecution.

Attention user: You are downloading data from NASA's Land Processes Distributed
Active Archive Center (LP DAAC) located at the USGS Earth Resources Observation and
Science (EROS) Center.

Downloading these data requires a NASA Earthdata Login username and password.
To obtain a NASA Earthdata Login account, please visit
<a href="https://urs.earthdata.nasa.gov/users/new">https://urs.earthdata.nasa.gov/users/new/</a>

 ------------------------------ etc -----------

In HTML the code: 

    <a href="MCD15A3H.A2018273.h35v10.006.2018278143650.hdf">MCD15A3H.A2018273.h35v10.006.2018278143650.hdf</a>  


specifies an HTML link, that will appear as 

    MCD15A3H.A2018273.h35v10.006.2018278143650.hdf 2018-10-05 09:42  7.6K 
    
and link to the URL specified in the `href` field: `MCD15A3H.A2018273.h35v10.006.2018278143650.hdf`.

We could interpret this information by searching for strings etc., but the package `BeautifulSoup` can help us a lot in this.


 

In [22]:
import requests
from bs4 import BeautifulSoup
from cylog import cylog


url = 'https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006/2018.09.30/' 
html = requests.get(url,auth=cylog().login()).text

# use BeautifulSoup
# to get all urls referenced with
# html code <a href="some_url">
soup = BeautifulSoup(html,'lxml')
links = [mylink.attrs['href'] for mylink in soup.find_all('a')]

# print a few lines of the links
print(links[:10])
# etc
print('\n','-'*30,'etc','-'*30,'\n')
# at the end
print(links[-10:])

['https://urs.earthdata.nasa.gov/users/new', 'https://lpdaac.usgs.gov', '?C=N;O=D', '?C=M;O=A', '?C=S;O=A', '?C=D;O=A', '/MOTA/MCD15A3H.006/', 'BROWSE.MCD15A3H.A2018273.h00v08.006.2018278143557.1.jpg', 'BROWSE.MCD15A3H.A2018273.h00v08.006.2018278143557.2.jpg', 'BROWSE.MCD15A3H.A2018273.h00v09.006.2018278143556.1.jpg']

 ------------------------------ etc ------------------------------ 

['MCD15A3H.A2018273.h34v09.006.2018278143649.hdf', 'MCD15A3H.A2018273.h34v09.006.2018278143649.hdf.xml', 'MCD15A3H.A2018273.h34v10.006.2018278143649.hdf', 'MCD15A3H.A2018273.h34v10.006.2018278143649.hdf.xml', 'MCD15A3H.A2018273.h35v08.006.2018278143649.hdf', 'MCD15A3H.A2018273.h35v08.006.2018278143649.hdf.xml', 'MCD15A3H.A2018273.h35v09.006.2018278143649.hdf', 'MCD15A3H.A2018273.h35v09.006.2018278143649.hdf.xml', 'MCD15A3H.A2018273.h35v10.006.2018278143650.hdf', 'MCD15A3H.A2018273.h35v10.006.2018278143650.hdf.xml']


**Exercise E3.1.2**

* copy the code in the block above up until the `print` statements
* using a `for ... in ...:` loop and an `if ... :` statement (or better still, an implicit loop), make a list called `hdf_filenames` of only those filenames (links) that have `hdf` as their filename extension.

**Hint 1**: first select an example item from the `links` list: 

    item = links[-1]
    print('item is',item)
    
and print:

    item[-3:]
        
but maybe better (why would this be?) is:

    item.split('.')[-1]
    
**Hint 2**: An implicit loop is a construct of the form:

    [item for item in links]

In an implicit for loop, we can actually add a conditional statement if we like, e.g. try:

    hdf_filenames = [item for item in links if item[-5] == '4']
    
This will print out `item` if the condition `item[-5] == '4'` is met. That's a bit of a pointless test, but illustrates the pattern required. Try this now with the condition you want to use to select `hdf` files.

In [23]:
# do exercise here
[item for item in links if item[-5] == '4']

['MCD15A3H.A2018273.h01v07.006.2018278143554.hdf',
 'MCD15A3H.A2018273.h01v10.006.2018278143554.hdf',
 'MCD15A3H.A2018273.h09v03.006.2018278143554.hdf',
 'MCD15A3H.A2018273.h10v04.006.2018278143614.hdf',
 'MCD15A3H.A2018273.h11v10.006.2018278143614.hdf',
 'MCD15A3H.A2018273.h13v12.006.2018278143604.hdf',
 'MCD15A3H.A2018273.h13v13.006.2018278143604.hdf',
 'MCD15A3H.A2018273.h16v05.006.2018278143634.hdf',
 'MCD15A3H.A2018273.h16v08.006.2018278143634.hdf',
 'MCD15A3H.A2018273.h19v02.006.2018278143634.hdf',
 'MCD15A3H.A2018273.h19v10.006.2018278143644.hdf',
 'MCD15A3H.A2018273.h20v05.006.2018278143634.hdf',
 'MCD15A3H.A2018273.h21v07.006.2018278143644.hdf',
 'MCD15A3H.A2018273.h22v04.006.2018278143644.hdf',
 'MCD15A3H.A2018273.h23v08.006.2018278143634.hdf',
 'MCD15A3H.A2018273.h25v05.006.2018278143644.hdf',
 'MCD15A3H.A2018273.h26v03.006.2018278143634.hdf',
 'MCD15A3H.A2018273.h27v09.006.2018278143634.hdf',
 'MCD15A3H.A2018273.h28v09.006.2018278143634.hdf',
 'MCD15A3H.A2018273.h28v14.006.

### MODIS filename format

The `hdf` filenames are of the form:

    MCD15A3H.A2018273.h35v10.006.2018278143650.hdf
    
where:

* the first field (`MCD15A3H`) gives the product code
* the second (`A2018273`) gives the observation date: day of year `273`, `2018` here
* the third (`h35v10`) gives the 'MODIS tile' code for the data location
* the remaining fields specify the product version number (`006`) and a code representing the processing date.

If we want a particular dataset, we would assume then that we know the information to construct the first four fields.

We then have the task remaining of finding an address of the pattern:

    MCD15A3H.A2018273.h17v03.006.*.hdf
    
where `*` represents a wildcard (unknown element of the URL/filename).

Putting together the code from above to get a list of the `hdf` files:

In [24]:
import requests
from bs4 import BeautifulSoup
from cylog import cylog

url = 'https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006/2018.09.30/' 
html = requests.get(url,auth=cylog().login()).text
links = [mylink.attrs['href'] for mylink in BeautifulSoup(html,'lxml').find_all('a')]

# get all files that end 'hdf' as in example above
hdf_filenames = [item for item in links if item.split('.')[-1] == 'hdf']
# print some out
hdf_filenames[:10]

['MCD15A3H.A2018273.h00v08.006.2018278143557.hdf',
 'MCD15A3H.A2018273.h00v09.006.2018278143556.hdf',
 'MCD15A3H.A2018273.h00v10.006.2018278143557.hdf',
 'MCD15A3H.A2018273.h01v07.006.2018278143554.hdf',
 'MCD15A3H.A2018273.h01v08.006.2018278143557.hdf',
 'MCD15A3H.A2018273.h01v09.006.2018278143558.hdf',
 'MCD15A3H.A2018273.h01v10.006.2018278143554.hdf',
 'MCD15A3H.A2018273.h01v11.006.2018278143555.hdf',
 'MCD15A3H.A2018273.h02v06.006.2018278143556.hdf',
 'MCD15A3H.A2018273.h02v08.006.2018278143557.hdf']

We now want to specify a particular tile or tiles to access.

In this case, we want to look at the field `item.split('.')[-4]` and check to see if it is the list `tiles`.

**Exercise 3.1.3**

First, let's check what we get when we look at `item.split('.')[-4]`.

* set a variable called `tiles` containing the names of the UK tiles (as in Exercise 3.1.1)
* write a loop `for item in links:` to loop over each item in the list `links`
* inside this loop set the condition `if item.split('.')[-1] == 'hdf':` to select only `hdf` files, as above
* inside this conditional statement, print out `item.split('.')[-4]` to see if it looks like the tile names
* having confirmed that you are getting the right information, add another conditional statement to see if `item.split('.')[-4] in tiles`, and then print only those filenames that pass both of your tests
* see if you can combine the two tests (the two `if` statements) into a single one

**Hint 1**: if you print all of the tilenames, this will go on for quite some time. Instead it may be better to use `print(item.split('.')[-4],end=' ')`, which will put a space, rather than a newline between each item printed.

**Hint 2**: recall what the logical statement `(A and B)` gives when thinking about the combined `if` statement

In [25]:
# do exercise here

In [8]:
import requests
from bs4 import BeautifulSoup
from cylog import cylog

tiles = ['h17v03', 'h18v03']
destination_folder = 'data'
url = 'https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006/2018.09.30/' 

html = requests.get(url,auth=cylog().login()).text
links = [mylink.attrs['href'] for mylink in BeautifulSoup(html,'lxml').find_all('a')]

# get all files that end 'hdf' as in example above
hdf_filenames = [item for item in links if item.split('.')[-1] == 'hdf']
tile_filenames = [item for item in hdf_filenames if item.split('.')[-4] in tiles]

# loop over tile filenames
print(tile_filenames)

['MCD15A3H.A2018273.h17v03.006.2018278143630.hdf', 'MCD15A3H.A2018273.h18v03.006.2018278143633.hdf']


**Exercise E3.1.4**

Most of the code above is generic in nature and it would make sense at this point to write a *function* that includes this.

In the code box below, you are given the basic structure of such a function.

* then, take the code from above and use it to complete the function. At this point, it should only return the filenames, as in the code segment above.
* improve the description

In [12]:
#
# do exercise here
#

import requests
from pathlib import Path
from bs4 import BeautifulSoup
from cylog import cylog

def modis_tiles(tiles=['h17v03', 'h18v03'],\
                 destination_folder = 'data',\
                 url='https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006/2018.09.30/'):
    """Put in a description here
       
    Parameters
    ----------
    tiles: list
        List of strings of MODIS product tilenames
    destination_folder: str
        The destination folder
    url: str
        The required URL
    
    Returns
    --------
    A string with the location of the downloaded file.
    """
    output_fname = None
    ## code here !!!
    return output_fname

# test:
print(modis_tiles())

None


### downloading the data file

We can now form the full url of the dataset:

    for filename in tile_filenames:  
        full_url = url+filename
        
We suppose that we want to save the dataset to a local file on the system.

It makes sense here to give the file the full MODIS filename. We need only then specify a directory ('folder') that we want to put the dataset in.

We set this to be `data` here. Before we go any further we should check:

* that the directory exists (if not, create it)
* that the file we want to download doesn't already exist (else, don't bother)

We can conveniently use methods in [`pathlib.Path`](https://docs.python.org/3/library/pathlib.html) for this.

The following function uses the `requests` library to pull the data from the URL and save it to a local file.

You don't need to be able to write code of this complexity at the moment, though you might find it interesting to look at and work out what is going on.

In [26]:
import requests
from pathlib import Path
from bs4 import BeautifulSoup
from cylog import cylog

def save_data(url,filename,destination_folder):
    """Downloads hdf data from a NASA URL for the specified tiles
       and saves the files in destination_folder.
       
       Checks to see if destination_folder exists, if not,
       creates it.
       
       Could do with further error checking
       
    Parameters
    ----------
    filename: string
        MODIS product filename
    destination_folder: str
        The destination folder
    url: str
        The required URL
    
    Returns
    --------
    A string with the location of the downloaded file.
    """
    # make sure destination_folder exists
    dest_path = Path(destination_folder)
    if not dest_path.exists():
        dest_path.mkdir()
    
    # make a compound file name from folder and filename
    output_fname = dest_path.joinpath(filename)
    
    # does the file already exist?
    if not output_fname.exists():
        # put download and save code here
        with requests.Session() as session:
            # get password-authorised url
            session.auth = cylog().login()
            r1 = session.request('get',url+filename)
            r2 = session.get(r1.url)
            with open(output_fname, 'wb') as fp:
                r = fp.write(r2.content)
            
    return str(output_fname)


# set up an example
url      = 'https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006/2018.09.30/' 
filename = 'MCD15A3H.A2018273.h17v03.006.2018278143630.hdf'
destination_folder = 'data'

save_data(url,filename,destination_folder)

'data/MCD15A3H.A2018273.h17v03.006.2018278143630.hdf'

**Exercise E3.1.5**

* make use of the function `save_data` to complete your function `modis_tiles` so that it loops over the tiles you request and dowenloads the data. It should return a list of the output filenames.

In [27]:
# do exercise here

**Exercise 3.1.6 Homework**

* There's a lot more that could be done with `save_data()`. Spend some time thinking this through and try to come up with an improved function.

**Hint**: for example, when you have the filename e.g.`MCD15A3H.A2018273.h17v03.006.2018278143630.hdf`, you should no longer have to specify the 'full' URL `https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006/2018.09.30/`. Instead, you should be able to give it as `https://e4ftl01.cr.usgs.gov/MOTA` and work out the rest from the filename. That would involve converting the year and day of year from the filename (`2018`, `273` here) into a date string (`2018.09.30` here). It is probably easiest to use the [`datetime`](https://docs.python.org/3/library/datetime.html) module for this once you have pulled out `year` and `doy` as integers:

    import datetime

    d = datetime.datetime(year,1,1) + datetime.timedelta(doy-1)
    datestr = f'{d.year}.{d.month}.{d.day}'

## GDAL

The basic operation of `gdal` involves:

- load the gdal module
- open a spatial dataset (an hdf format file here)
- specify which subsets you want.

We can explore the subsets in the file with `GetSubDatasets()`

First, let's make sure we have some suitable datasets for this session.

In [28]:
from geog_data import procure_dataset

tile_filenames = ['MCD15A3H.A2018273.h17v03.006.2018278143630.hdf', 'MCD15A3H.A2018273.h18v03.006.2018278143633.hdf']
for f in tile_filenames:
    procure_dataset(f)

Running outside UCL Geography. Will need to download data. This might take a while!
trying geog server ...
trying NASA server ...
file data/MCD15A3H.A2018273.h17v03.006.2018278143630.hdf exists
Running outside UCL Geography. Will need to download data. This might take a while!
trying geog server ...
trying NASA server ...
file data/MCD15A3H.A2018273.h18v03.006.2018278143633.hdf exists


In [34]:
import gdal 
from pathlib import Path

destination_folder = 'data'
filename = 'MCD15A3H.A2018273.h17v03.006.2018278143630.hdf'

dest_path = Path(destination_folder)
print (filename)
print ('*'*len(filename))

fname = str(dest_path.joinpath(filename))
# open dataset
g = gdal.Open(fname)
if g != None:
    subdatasets = g.GetSubDatasets()
    for fname, name in subdatasets:
        print (name)
        print ("\t", fname)



MCD15A3H.A2018273.h17v03.006.2018278143630.hdf
**********************************************
[2400x2400] Fpar_500m MOD_Grid_MCD15A3H (8-bit unsigned integer)
	 HDF4_EOS:EOS_GRID:"data/MCD15A3H.A2018273.h17v03.006.2018278143630.hdf":MOD_Grid_MCD15A3H:Fpar_500m
[2400x2400] Lai_500m MOD_Grid_MCD15A3H (8-bit unsigned integer)
	 HDF4_EOS:EOS_GRID:"data/MCD15A3H.A2018273.h17v03.006.2018278143630.hdf":MOD_Grid_MCD15A3H:Lai_500m
[2400x2400] FparLai_QC MOD_Grid_MCD15A3H (8-bit unsigned integer)
	 HDF4_EOS:EOS_GRID:"data/MCD15A3H.A2018273.h17v03.006.2018278143630.hdf":MOD_Grid_MCD15A3H:FparLai_QC
[2400x2400] FparExtra_QC MOD_Grid_MCD15A3H (8-bit unsigned integer)
	 HDF4_EOS:EOS_GRID:"data/MCD15A3H.A2018273.h17v03.006.2018278143630.hdf":MOD_Grid_MCD15A3H:FparExtra_QC
[2400x2400] FparStdDev_500m MOD_Grid_MCD15A3H (8-bit unsigned integer)
	 HDF4_EOS:EOS_GRID:"data/MCD15A3H.A2018273.h17v03.006.2018278143630.hdf":MOD_Grid_MCD15A3H:FparStdDev_500m
[2400x2400] LaiStdDev_500m MOD_Grid_MCD15A3H (8-bit u

In the previous code snippet we have done a number of different things:

1. Import the GDAL library
2. Open a file with GDAL, storing a handler to the file in `g`
3. Test that `g` is not `None` (as this indicates failure opening the file. Try changing `filename` above to something else)
4. We then use the `GetSubDatasets()` method to read out information on the different subdatasets available from this file 
5. Loop over the retrieved subdatasets to print the name (human-readable information) and the GDAL filename. This last item is the filename that you need to use to tell GDAL to open a particular data layer of the 6 layers present in this example

Let's say that we want to access the LAI information. By contrasting the output of the above code (or `gdalinfo`) to the contents of the [LAI/fAPAR product information page](https://lpdaac.usgs.gov/dataset_discovery/modis/modis_products_table/mcd15a3h), we find out that we want the layers for `Lai_500m`, `FparLai_Qc`, `FparExtra_QC` and `LaiStdDev_500m`. 

To read these individual datasets, we need to open each of them individually using GDAL, and the GDAL filenames used above:

In [49]:
import gdal 
from pathlib import Path

# How to access specific datasets in gdal
filename = 'MCD15A3H.A2018273.h17v03.006.2018278143630.hdf'
destination_folder = 'data'

fname = str(dest_path.joinpath(filename))

# Let's create a list with the selected layer names
selected_layers = [  "Lai_500m", "FparLai_QC", "LaiStdDev_500m" ]

# We will store the data in a dictionary
# Initialise an empty dictionary
data = {}

# for convenience, we will use string substitution to create a 
# template for GDAL filenames, which we'll substitute on the fly:
file_template = 'HDF4_EOS:EOS_GRID:"{}":MOD_Grid_MOD15A3H:{}'
# This has two substitutions (the %s parts) which will refer to:
# - the filename
# - the data layer

for i, layer in enumerate ( selected_layers ):
    this_file = file_template.format( fname, layer )
    print ("Opening Layer %d: %s" % (i+1, this_file ))
    g = gdal.Open ( this_file )
    
    if g is None:
        raise IOError
    data[layer] = g.ReadAsArray() 
    print ("\t>>> Read %s!" % layer)

Opening Layer 1: HDF4_EOS:EOS_GRID:"data/MCD15A3H.A2018273.h17v03.006.2018278143630.hdf":MOD_Grid_MOD15A3H:Lai_500m
	>>> Read Lai_500m!
Opening Layer 2: HDF4_EOS:EOS_GRID:"data/MCD15A3H.A2018273.h17v03.006.2018278143630.hdf":MOD_Grid_MOD15A3H:FparLai_QC
	>>> Read FparLai_QC!
Opening Layer 3: HDF4_EOS:EOS_GRID:"data/MCD15A3H.A2018273.h17v03.006.2018278143630.hdf":MOD_Grid_MOD15A3H:LaiStdDev_500m
	>>> Read LaiStdDev_500m!


In the previous code, we have seen a way of neatly creating the filenames required by GDAL to access the independent datasets: a template string that gets substituted with the `fname` and the `layer` name. Note that the presence of double quotes in the template requires us to use single quotes around it. The data is now stored in a dictionary, and can be accessed as e.g. `data['Lai_500m']` which is a numpy array:

In [50]:
type(data['Lai_500m'])

NoneType

In [56]:
print(this_file)
g = gdal.Open ( this_file )
g.ReadAsArray() 
print(g)

HDF4_EOS:EOS_GRID:"data/MCD15A3H.A2018273.h17v03.006.2018278143630.hdf":MOD_Grid_MOD15A3H:LaiStdDev_500m
<osgeo.gdal.Dataset; proxy of <Swig Object of type 'GDALDatasetShadow *' at 0x12406ee10> >


In [60]:
g.RasterXSize

0

Now we have to translate the LAI values into meaningful quantities. According to the [LAI](https://lpdaac.usgs.gov/products/modis_products_table/leaf_area_index_fraction_of_photosynthetically_active_radiation/8_day_l4_global_1km/mod15a2) webpage, there is a scale factor of 0.1 involved for LAI and SD LAI:

In [None]:
lai = data['Lai_1km'] * 0.1
lai_sd = data['LaiStdDev_1km'] * 0.1

In [None]:
print "LAI"
print lai
print "SD"
print lai_sd

In [None]:
# plot the LAI

import pylab as plt
%matplotlib inline

# colormap
cmap = plt.cm.Greens

plt.imshow(lai,interpolation='none',vmin=0.1,vmax=4.,cmap=cmap)
plt.title('MODIS LAI data: DOY 185 2011')
plt.colorbar()

In [None]:
# plot the LAI std

import pylab as plt

# colormap
cmap = plt.cm.spectral
# this sets the no data colour. 'k' is black

plt.imshow(lai_sd,interpolation='none',vmax=1.,cmap=cmap)
plt.title('MODIS LAI STD data: DOY 185 2011')
plt.colorbar()

It is not possible to produce LAI estimates if it is persistently cloudy, so the dataset may contain some gaps.

These are identified in the dataset using the QC (Quality Control) information.

We should then examine this. 

The codes for this are also given on the LAI product page. They are described as bit combinations:

<table>
<tr>
<th>Bit No.</th>	<th>Parameter Name</th><th>	Bit Combination</th><th>Explanation</th>
<tr>
<td>0	</td><td>MODLAND_QC bits</td><td>	0</td><td>	Good quality (main algorithm with or without saturation)	 	 </td>
</tr>
<tr>
<td>&nbsp;</td><td>&nbsp;</td><td>	1	</td><td>Other Quality (back-up algorithm or fill values)	 	 </td>
</tr>

<tr>
<td>1	</td><td>Sensor</td><td>	0</td><td>	TERRA</td>
</tr>
<tr>
<td>&nbsp;</td><td>&nbsp;</td><td>	1	</td><td>AQUA</td>
</tr>

<tr>
<td>2	</td><td>DeadDetector</td><td>	0</td><td>	Detectors apparently fine for up to 50% of channels 1	2	 </td>
</tr>
<tr>
<td>&nbsp;</td><td>&nbsp;</td><td>	1	</td><td>Dead detectors caused >50% adjacent detector retrieval</td>
</tr>

<tr>
<td>3-4</td><td>CloudState</td><td>	00</td><td>	Significant clouds NOT present (clear)	 	 </td>
</tr>
<tr>
<td>&nbsp;</td><td>&nbsp;</td><td>	01	</td><td>Significant clouds WERE present</td>
</tr>
<tr>
<td>&nbsp;</td><td>&nbsp;</td><td>	10	</td><td>Mixed clouds present on pixel</td>
</tr>
<tr>
<td>&nbsp;</td><td>&nbsp;</td><td>	11	</td><td>Cloud state not defined assumed clear</td>
</tr>

<tr>
<td>5-7</td><td>CF_QC</td><td>	000</td><td>	Main (RT) method used	best result possible (no saturation)	 </td>
</tr>
<tr>
<td>&nbsp;</td><td>&nbsp;</td><td>	001	</td><td>Main (RT) method used with saturation. Good	very usable</td>
</tr>
<tr>
<td>&nbsp;</td><td>&nbsp;</td><td>	010	</td><td>Main (RT) method failed due to bad geometry	empirical algorithm used</td>
</tr>
<tr>
<td>&nbsp;</td><td>&nbsp;</td><td>	011	</td><td> Main (RT) method failed due to problems other than geometry	empirical algorithm used</td>
</tr>
<tr>
<td>&nbsp;</td><td>&nbsp;</td><td>	100	</td><td> Pixel not produced at all	value coudn’t be retrieved (possible reasons: bad L1B data	unusable MODAGAGG data)</td>
</tr>
</table>

In using this information, it is up to the use which data he/she wants to pass through for any further processing. There are clearly trade-offs: if you look for only the highest quality data, then the number of samples is likely to be lower than if you were more tolerant. But if you are too tolerant, you will get spurious results. 



But let's just say that we want to use only the highest quality data. 

This means we want bit 0 to be 0 ...

Let's have a look at the QC data:

In [None]:
qc = data['FparLai_QC'] # Get the QC data which is an unsigned 8 bit byte
print qc , qc.dtype

We see various byte values:

In [None]:
np.unique(qc)

In [None]:
# translated into binary using bin()
for i in np.unique(qc):
    print i,bin(i)

#### Binary data

Computers store data in base 2, rather than the more usual base 10. A byte contains 8 bits, which can either be 0 or 1. A number is made up of the sum of powers of two that are given a 1. The diagram below shows how the number 53 in base 10 is written as as 00110101 in base 2 (using 8 bits):

![Taken from[http://dustlayer.com/cpu-6510-articles/2013/4/18/math-basics-converting-numbering-systems](http://dustlayer.com/cpu-6510-articles/2013/4/18/math-basics-converting-numbering-systems)](https://static1.squarespace.com/static/511651d6e4b0a31c035e30aa/t/51713d9ce4b02974eba2db13/1366375836670/dustlayer.com-binary-to-decimal.png)

In the case of QA flags, the ability to indicate a yes/no or a small set of options (e.g. "Very good", "Good", "Bad" "Terrible") by position results in an efficient way to convey a lot of information. In the case of the LAI QA product, we have that number 53 (00110101) can be explained as

<dl>
    <dt>Bit 0 (rightmost bit) = 1</dt><dd> "Other" quality</dd>
    <dt>Bit 1 = 0 </dt><dd> Sensor is "TERRA"</dd>    
    <dt>Bit 2 = 1 </dt><dd> Dead detectors</dd>    
    <dt>Bits 3 and 4 = 10 </dt><dd>Mixed clouds present on pixel</dd>    
    <dt>Bit 5, 6 and 7 = 001 </dt><dd>Main (RT) method used with saturation. Good very usable</dd>    
</dl>

So we have encoded 5 very different and qualitative pieces of information in a very small amount of storage. In some cases, we might just be interested in checking one or a few of the flags. A simple way to achieve this is to use a logical AND (`&`) operation, where one would `AND` a template binary mask with the number. For example, if we wanted the third bit, we could use the number 0000100 (e.g. decimal 4):

    00110101 & 0000100 = 0000100
    53 & 4 = 4
    
This gives us the wanted bit in position 3 (so we can either get 4 or 0). We can then *shift* this number of the right by 3 units to have a 0 or a 1. In Python, we can do this by the binary right shift operator, `>>`:

    0000100 >> 2 = 00000001
    (53 & 4) >> 2 = 1
    
When operating on Numpy arrays, one has to operate with [special functions](https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.bitwise.html) ``np.bitwise_and`` and ``np.right_shift``, which will operate on an element by element basis. 

Let's see what the different fields we found in the LAI product QA flagsfields mean by 

1. Finding the unique values in the `qc` array
2. Looping over them
3. Calculate the `AND` of the flag and a mask (in this case, we'll check bits 5-7 and bit 0)
4. Shifting the result of the `AND` operation right (if required)

We will be printing the results. To print the binary representation of  a number, we can use the format string `#010b` (which uses 0 bits, plus to extra bits for the `0b` at the beginning, and additionally does zero padding). You can read more about [Python string formatters here](https://docs.python.org/2/library/string.html#format-specification-mini-language).



In [None]:
mask_bits_57 = 0b11100000 # Or decimal 2**7 + 2**6 + 2**5 = 224
mask_bit_0 = 0b00000001 # Or decimal 1

for value in np.unique(qc):
    print "Value: {:03d}(->{:#010b}), flags 5-7: {:d}(->{:#010b}),flag 0: {:d}(->{:#010b})".format(
            value,
            value,
            value & mask_bits_57 >> 5, 
            value & mask_bits_57 >> 5, 
            value & mask_bit_0, 
            value & mask_bit_0)
    

So, for example (examining the table above) `105` is interpreted at `0b011` in fields 5 to 7 (which is 3 in decimal). This indicates that 'Main (RT) method failed due to problems other than geometry empirical algorithm used'. Here, bit zero is set to `1`, so this is a 'bad' pixel.

In this case, we are only interested in bit 0, which is an easier task than interpreting all of the bits.

In [None]:
# the good data are where qc bit 1 is 0

qc = data['FparLai_QC'] # Get the QC data
# find bit 0
qc = np.bitwise_and(qc, 1)

plt.imshow(qc)
plt.title('QC bit 1')
plt.colorbar()

We can use this mask to generate a masked array. Masked arrays, as we have seen before, are like normal arrays, but they have an associated mask. 

Remember that the mask in a masked array should be `False` for good data, so we can directly use `qc` as defined above. 

We shall also choose another colormap (there are [lots to choose from](http://wiki.scipy.org/Cookbook/Matplotlib/Show_colormaps)), and set values outside the 0.1 and 4 to be shown as black pixels.

In [None]:
# colormap
cmap = plt.cm.Greens
cmap.set_bad ( 'k' )
# this sets the no data colour. 'k' is black

# generate the masked array
laim = np.ma.array ( lai, mask=qc )

# and plot it
plt.imshow ( laim, cmap=cmap, interpolation='none', vmin=0.1, vmax=4)
plt.colorbar()

Similarly, we can do a similar thing for Standard Deviation

In [None]:
cmap = plt.cm.spectral
cmap.set_bad ( 'k' )
stdm = np.ma.array ( lai_sd, mask=qc )
plt.imshow ( stdm, cmap=cmap, interpolation='none', vmin=0.001, vmax=0.5)
plt.colorbar()

For convenience, we might wrap all of this up into a function:

In [None]:
import gdal
import numpy as np
import numpy.ma as ma


def getLAI(filename, \
           qc_layer = 'FparLai_QC',\
           scale = [0.1, 0.1],\
           selected_layers = ["Lai_1km", "LaiStdDev_1km"]):
           
    # get the QC layer too
    selected_layers.append(qc_layer)
    scale.append(1)
    # We will store the data in a dictionary
    # Initialise an empty dictionary
    data = {}
    # for convenience, we will use string substitution to create a 
    # template for GDAL filenames, which we'll substitute on the fly:
    file_template = 'HDF4_EOS:EOS_GRID:"%s":MOD_Grid_MOD15A2:%s'
    # This has two substitutions (the %s parts) which will refer to:
    # - the filename
    # - the data layer
    for i,layer in enumerate(selected_layers):
        this_file = file_template % ( filename, layer )
        g = gdal.Open ( this_file )
        
        if g is None:
            raise IOError
        
        data[layer] = g.ReadAsArray() * scale[i]

    qc = data[qc_layer] # Get the QC data
    # find bit 0
    qc = np.bitwise_and( qc, 1)
    
    odata = {}
    for layer in selected_layers[:-1]:
        odata[layer] = ma.array(data[layer],mask=qc)
    
    return odata
    

In [None]:
filename = 'data/MCD15A2.A2011185.h09v05.005.2011213154534.hdf'

lai_data = getLAI(filename)

# colormap
cmap = plt.cm.Greens
cmap.set_bad ( 'k' )
# this sets the no data colour. 'k' is black

# and plot it
plt.imshow ( lai_data['Lai_1km'], cmap=cmap, interpolation='nearest', vmin=0.1, vmax=4)
plt.colorbar()

## Exercise 4.1

You are given the MODIS LAI data files for the year 2012 in the directory `data` for the UK (MODIS tile h17v03).

Read these LAI datasets into a masked array, using QA bit 0 to mask the data (i.e. good quality data only) and generate a movie of LAI.

You should end up with something like:

![](files/images/lai_uk02.gif)

## Downloading data

For the exercise above, you were supplied with the datasets that were previously downloaded. But how would you go about downloading your own data?

### NASA EarthData

The easiest option would be to use NASA's own system for discovering and accessing data, [EarthData](https://search.earthdata.nasa.gov/search).
<div class="alert alert-block alert-danger">
Go to [EarthData](https://search.earthdata.nasa.gov/search) **NOW** and create a username/password combination
** YOU WILL NEED THESE TO DO THE COURSE ASSIGNMENT!**
</div>

We will be looking for data about snow cover, in particular the MODIS/TERRA snow cover product (product code MOD10A1) covering the UK. You would basically basically search for MOD10A1, select a start and end times, and maybe filter by tile (select `h17v03` in the *Granule Search* box). Then you can click on *Download data*. In the next screen, you would select *Direct Download*. You can then get a Download script that you can download and run (it will ask you for your username and password, and proceed to download the data), but you can also see that the URLs are of the form

```
https://n5eil01u.ecs.nsidc.org/DP5/MOST/MOD10A1.006/2016.12.15/MOD10A1.A2016350.h17v03.006.2016352154432.hdf
```
So maybe we can try downloading them with a script?

### Direct downloading

We have asceartained that the URLs have the following format:
```
https://n5eil01u.ecs.nsidc.org/DP5/MOST/MOD10A1.006/<date>/<hdf_file>
```

You can visit any date folder, and see the list of files. We can use the requests module for this. We will also use the `datetime` module to operate with dates. The `datetime` module allows you to work with time easily. In here, we'll be using it to create a date object using `datetime.datetime(year, month, day)`, a time increase (`datetime.timedelta(days=1)`), and to convert from a `datetime` object to a string with a particular format `datetime.strftime("%Y.%m.%d")`, which is the format that corresponds to `2001.12.31`, for example. 

We can then use requests to loop over each date, get the file listing, and filter it by MODIS tile (e.g. `h17v03` in our case), and select the files that have a `.hdf` extension, ignoring those that also have metadata (`.hdf.xml`).

The code to connect to the server and get the data is quite complicated, so we'll use a function provided below called `url_downloader`. You don't need to understand the details of the authentication, but you should be able to figure out what's happening in this bit of the code

```python
            if r.ok:
                for line in r.text.split("\n"):
                    if line.find("h17v03") >= 0 and line.find(".hdf") >= 0 and line.find(".xml") < 0:
                        fname = line.split("href")[1][1:].split('"')[1]
```




In [None]:
import os
import requests
import datetime
print ("""
########################################################################################
###  CHANGE ME! CHANGE ME! CHANGE ME! CHANGE ME! CHANGE ME! CHANGE ME! CHANGE ME!    ###
########################################################################################
""")
username = "profLewis"
password = "GeogG1222016"

def url_downloader (username, password, destination_folder="data/",
                    url="https://n5eil01u.ecs.nsidc.org/DP5/MOST/MOD10A1.006/2016.01.01/"):
    
    """Downloads data from a NASA URL, provided that a username/password pair exist.
    Parameters
    ----------
    username: str
        The NASA EarthData username
    password: str
        The NASA EarthData password
    destination_folder: str
        The destination folder
    url: str
        The required URL
    
    Returns
    --------
    A string with the location of the downloaded file.
    """
    with requests.Session() as session:
            session.auth = (username, password)
            r1 = session.request('get', url)
            r = session.get(r1.url, auth=(username, password))
            if r.ok:
                for line in r.text.split("\n"):
                    if line.find("h17v03") >= 0 and line.find(".hdf") >= 0 and line.find(".xml") < 0:
                        fname = line.split("href")[1][1:].split('"')[1]
            url_granule = url + fname
            r1 = session.request('get', url_granule)
            r = session.get(r1.url, auth=(username, password))
            output_fname = os.path.join(destination_folder, fname)
            if r.ok:
                with open(output_fname, 'w') as fp:
                    fp.write(r.text)
            print ("Saved file {}".format(output_fname))
            return output_fname

filename = url_downloader (username, password)


Then we just need a loop to get different dates:

In [None]:
import datetime


# Set the starting date
this_date = datetime.datetime(2016,1,1)
# Set the end date
end_date = datetime.datetime(2016,1,5)
# Increase the date by one day
time_delta = datetime.timedelta(days=1)

# Main loop
while this_date <= end_date:
    url = "https://n5eil01u.ecs.nsidc.org/DP5/MOST/MOD10A1.006/{:s}/".format(this_date.strftime("%Y.%m.%d"))
    url_downloader(username, password, url=url)
    this_date = this_date + time_delta

## Exercise A Different Dataset


            
We have now dowloaded a different dataset, the [MOD10A product](http://www.icess.ucsb.edu/modis/SnowUsrGuide/usrguide_1dtil.html), which is the 500 m MODIS daily snow cover product, over the UK.

This is a good opportunity to see if you can apply what was learned above about interpreting QC information and using `gdal` to examine a dataset.

If you examine the [data description page](http://nsidc.org/data/MOD10A1), you will see that the data are in HDF EOS format (the same as the LAI product). 

### Download
<div class="alert alert-block alert-success">
Download the MODIS Terra daily snow product for the UK for the year 2012 for the month of February and put them in the directory `data`.
</div>

###  Explore

<div class="alert alert-block alert-success">
Show all of the subset data layers in this dataset. 
</div>

### Read a dataset

Suppose we are interested in the dataset `NDSI_Snow_Cover` over the land surface.
<div class="alert alert-block alert-success">
Read this dataset for one of the files into a numpy array and show a plot of the dataset.
</div>

### Water mask


The [data description page](http://nsidc.org/data/MOD10A1) tells us that values of `239` will indicate whether the data is ocean. You can use this information to build the water mask.
<div class="alert alert-block alert-success">
Demonstrate how to build a water mask from one of these files, setting the mask `False` for land and `True` for water. 

Produce a plot of this.
</div>

### Valid pixel mask

<div class="alert alert-block alert-success">
As well as having a land/water mask, we should generate a mask for valid pixels. For the snow dataset, values between 0 and 100 (inclusive) represent valid snow cover data values. Other values are not valid for some reason. Set the mask to `False` for valid pixels and `True` for others. Produce a plot of the mask. 
</div>

### 3D dataset
<div class="alert alert-block alert-success">
Generate a 3D masked numpy array using the valid pixel mask for masking, of `Fractional_Snow_Cover` for each day of February 2012. 

You might like to produce a movie of the result.
</div>


## 4.3 Vector masking

In this section, we will use a pre-defined function to generate a mask from some vector boundary data.

In this case, we will generate a mask for Ireland, projected into the coordinate system of the  MODIS LAI dataset, and use that to generate a new LAI data only for Ireland.

Sometimes, geospatial data is acquired and recorded for particular geometric objects such as polygons or lines. An example is a road layout, where each road is represented as a geometric object (a line, with points given in a geographical projection), with a number of added *features* associated with it, such as the road name, whether it is a toll road, or whether it is dual-carriageway, etc. This data is quite different to a raster, where the entire scene is tessellated into pixels, and each pixel holds a value (or an array of value in the case of multiband rasterfiles). 

If you are familiar with databases, vector files are effectively a database, where one of the fields is a geometry object (a line in our previous road example, or a polygon if you consider a cadastral system). We can thus select different records by writing queries on the features. Some of these queries might be spatial (e.g. check whether a point is inside a particular country polygon).

The most common format for vector data is the **ESRI Shapfile**, which is a multifile format (i.e., several files are needed in order to access the data). We'll start by getting hold of a shapefile that contains the countries of the world as polygons, together with information on country name, capital name, population, etc. The file is available [here](http://www.naturalearthdata.com/features/).



We will download the file using `requests`, and we will use the Python [`zipfile`](https://pymotw.com/2/zipfile/) module to extract the contents of the zip file in the `./data/` folder:

In [None]:
import requests
import zipfile


zip_file_name = "data/ne_50m_admin_0_countries.zip"
# Download file
r = requests.get ("http://www.naturalearthdata.com/http//www.naturalearthdata.com/" +
                  "download/10m/cultural/ne_50m_admin_0_countries.zip")
if r.ok:
    with open ( zip_file_name, 'w') as fp:
        fp.write(r.content)
# Unzip file
zip_ref = zipfile.ZipFile(zip_file_name, 'r')
zip_ref.extractall("./data/")
zip_ref.close()


We can check on the UNIX shell that the zip file has been both downloaded, and its contents extracted. We can see that we have a bunch of new files with extensions like `.dbf, .prj, .shx` and `.shp`. The latter is the main file, the other files are auxiliary (but they are all needed):

In [None]:
!ls -lh data/ne*


We need to import `ogr`, and then open the file. As with GDAL, we get a handler to the file, (`g` in this case). OGR files can have different layers, although Shapefiles only have one. We need to select the layer using `GetLayer(0)` (selecting the first layer).

In [None]:
from osgeo import ogr

g = ogr.Open( "data/ne_50m_admin_0_countries.shp" )
layer = g.GetLayer( 0 )


In order to see a field (the field `NAME`) we can loop over the features in the layer, and use the `GetField('NAME')` method. We'll only do ten features here:

In [None]:
for n_feat, feat in enumerate(layer):
    print "Feature #{:d}: {:s}".format(n_feat+1, feat.GetField('NAME'))
    if n_feat == 10:
        break

If you wanted to see the different layers, we could do this using:

In [None]:
layerDefinition = layer.GetLayerDefn()

for i in range(layerDefinition.GetFieldCount()):
    print "Field %d: %s" % ( i+1, layerDefinition.GetFieldDefn(i).GetName() )

There is much more information on using `ogr` on the associated [notebook OGR_Python](OGR_Python.ipynb) that you should explore at some point.

One thing we may often wish to dowith such vector datsets is produce a mask, e.g. for national boundaries. One of the complexities of this is changing the projection that the vector data come in to that of the raster dataset.  

This is too involved to go over in this session, so we will simply present you with a function to achieve this. The function is available in [`python/raster_mask.py`](./python/raster_mask.py), which we'll import into the system by first adding the `python` folder to the Python path. The function is called `rasterise_vector`. Let's see the help...

In [None]:
import sys
sys.path.insert(0,"./python/")

from raster_mask import rasterise_vector
help(rasterise_vector)

So, at the very least, we need to 

1. Give the raster set that we want to use as a reference. The mask will have the same properties as this raster (e.g. projection, extent, number of pixels, ...).
2. Give a vector file (in this case, the world borders file).
3. Select one country using the slightly awkward SQL notation of `"field_name='Ireland'"` (e.g.)



In [None]:
filename = 'data/MCD15A2.A2012273.h17v03.005.2012297134400.hdf'

# a layer (doesn't matter so much which: use for geometry info)
layer = 'Lai_1km'
# the full dataset specification
file_template = 'HDF4_EOS:EOS_GRID:"%s":MOD_Grid_MOD15A2:%s'
file_spec = file_template % ( filename, layer)

M = rasterise_vector ( file_spec, "data/ne_50m_admin_0_countries.shp", 
                  "NAME='Ireland'", verbose=False)
plt.imshow(M, interpolation="nearest")


This is available as [python/raster_mask.py](python/raster_mask.py).

Most of the code below should be familiar from above (we make use of the `getLAI()` function we developed).

In [None]:
from raster_mask import getLAI


# the data file name
filename = 'data/MCD15A2.A2012273.h17v03.005.2012297134400.hdf'

# a layer (doesn't matter so much which: use for geometry info)
layer = 'Lai_1km'
# the full dataset specification
file_template = 'HDF4_EOS:EOS_GRID:"%s":MOD_Grid_MOD15A2:%s'
file_spec = file_template%(filename,layer)

mask = rasterise_vector ( file_spec, "data/ne_50m_admin_0_countries.shp", 
                  "NAME='Ireland'", verbose=False)
plt.imshow(mask)
# get the LAI data
data = getLAI(filename)

# reset the data mask
# 'mask' is True for Ireland
# so take the opposite 
data['Lai_1km'] = ma.array(data['Lai_1km'], mask=np.logical_not(mask))
data['LaiStdDev_1km'] = ma.array(data['Lai_1km'], mask=np.logical_not(mask))

plt.title('LAI for Eire: 2012273')
plt.imshow(data['Lai_1km'],vmax=6)
plt.colorbar()

## Exercise 4.3

Apply the concepts above to generate a 3D masked numpy data array of LAI and std LAI for Eire for the year 2012.

Plot your results and make a move of LAI.

Plot average LAI for Eire as a function of day of year for 2012.

# Summary

In this session, we have learned to use some geospatial tools using GDAL in Python. A good set of [working notes on how to use GDAL](http://jgomezdans.github.io/gdal_notes/) has been developed that you will find useful for further reading, as well as looking at the [advanced](advanced.ipynb) section.

We have also very briefly introduced dealing with vector datasets in `ogr`, but this was mainly through the use of a pre-defined function that will take an ESRI shapefile (vector dataset), warp this to the projection of a raster dataset, and produce a mask for a given layer in the vector file.

If there is time in the class, we will develop some exercises to examine the datasets we have generated and/or to explore some different datasets or different locations.
