# Scrape AVIRIS CH$_4$ Plumes Geotiffs from ORNL-DAAC
## ER 131 Project | Group 4
**Author: ** ['Marshall Worsham'] <br>
**Date: ** 11-03-2020

This notebook contains code for scraping and downloading to a local directory the 600+ CH$_4$ plume geotiffs from the Thorpe et al. 2019 ORNL-DAAC repository. It uses a combination of Python and bash commands, in particular, `BeautifulSoup` and `wget`. 

### Citation
https://doi.org/10.3334/ORNLDAAC/1727

## Libraries and dependencies

In [4]:
# install libraries if needed
#!pip3 install requests urllib bs4

import os
import numpy as np
import urllib
import requests
import codecs
from urllib.request import Request, urlopen, urlretrieve
from bs4 import BeautifulSoup

## Request a single file
If you just want one file from the repository, it's easy to get.

In [6]:
url = 'https://daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/ang20160910t185702_cmf_pub_ch4.tif' # specify the path to the file you want
r = requests.get(url) # get the file

# save the file's content locally in the filetype you specify
# 'wb' indicates 'open file in binary mode for writing, truncating the file first'
with open('./ch4x.tif', 'wb') as f: 
    f.write(r.content)

## Scrape list of downloadable geotiffs from ORNL-DAAC as index.html
Getting a whole batch of files is a bit more complicated. This workflow requires a username and password to be registered in the NASA Earth Data (EOSDIS) system. You can [register a username and password here](https://urs.earthdata.nasa.gov/oauth/authorize?client_id=YQOhivHfMTau88rjbMOVyg&response_type=code&redirect_uri=https://daac.ornl.gov/cgi-bin/urs/urs_logon_proc.pl&state=https%3A%2F%2Fdaac.ornl.gov%2Fcgi-bin%2Fdsviewer.pl%3Fds_id%3D1727). 

In [5]:
# define username and password
username = #set username
password = #set password

# this should be the base url you want to access
baseurl = 'https://daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/'

# a file within this https directory will have a path like this: 'https://daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/ang20160910t185702_cmf_pub_ch4.tif'

### Bash

In [18]:
# bash commands to set up a new directory and pull the index.html that lists all download links
!mkdir ER131_Project # make a directory to store the downloads
!cd ER131_Project # cd into that directory

# download the index.html that outlines the structure of the https download page
# with a little bit of cleaning (done below), this file will give you a list of individual links for each geotif in the repository, which you can call to download the files
!wget --mirror --convert-links --no-parent --wait=5 --user=hmworsham --password=J0hnR@dk3 https://daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/

mkdir: ER131_Project: File exists
--2020-11-03 10:20:45--  https://daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/
Resolving daac.ornl.gov (daac.ornl.gov)... 160.91.19.24
Connecting to daac.ornl.gov (daac.ornl.gov)|160.91.19.24|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://urs.earthdata.nasa.gov/oauth/authorize?app_type=401&client_id=QyeRbBJg8YuY_WBh-KBztA&response_type=code&redirect_uri=https%3A%2F%2Fdaac.ornl.gov%2Fdaacdata%2Fdoesntmater&state=aHR0cHM6Ly9kYWFjLm9ybmwuZ292L2RhYWNkYXRhL2Ntcy9DSDRfUGx1bWVfQVZJUklTLU5HL2RhdGEvdGlmZi8 [following]
--2020-11-03 10:20:51--  https://urs.earthdata.nasa.gov/oauth/authorize?app_type=401&client_id=QyeRbBJg8YuY_WBh-KBztA&response_type=code&redirect_uri=https%3A%2F%2Fdaac.ornl.gov%2Fdaacdata%2Fdoesntmater&state=aHR0cHM6Ly9kYWFjLm9ybmwuZ292L2RhYWNkYXRhL2Ntcy9DSDRfUGx1bWVfQVZJUklTLU5HL2RhdGEvdGlmZi8
Resolving urs.earthdata.nasa.gov (urs.earthdata.nasa.gov)... 2001:4d0:241a:4081::89, 198.118.243.33
Co

### Python

In [None]:
# define the directory we'll be working in
tifdir = '/Volumes/Brain/GIS/ER131_Project/daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/

# define result as the index.html file you just downloaded
result = os.sep.join(tifdir, 'index.html')

In [8]:
# use codecs to make python read the html result
f=codecs.open(result, 'r', 'utf-8')

# turn the codecs object into a machine- and human-readable html structure
document= BeautifulSoup(f.read(), 'html.parser')

# print if you want to check the html structure
#print(document.prettify())

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
 <head>
  <title>
   Index of /daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff
  </title>
 </head>
 <body>
  <h1>
   Index of /daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff
  </h1>
  <table>
   <tr>
    <th valign="top">
    </th>
    <th>
     <a href="https://daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/?C=N;O=D">
      Name
     </a>
    </th>
    <th>
     <a href="https://daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/?C=M;O=A">
      Last modified
     </a>
    </th>
    <th>
     <a href="https://daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/?C=S;O=A">
      Size
     </a>
    </th>
    <th>
     <a href="https://daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/?C=D;O=A">
      Description
     </a>
    </th>
   </tr>
   <tr>
    <th colspan="5">
     <hr/>
    </th>
   </tr>
   <tr>
    <td valign="top">
    </td>
    <td>
     <a href="https://daac.ornl.gov/daacdata/cms/CH4_Plume_AV

In [19]:
# scrape the html file for geotiff download links

# initialize an empty list
daac_links = [] 

# find all of the links and store them in the daac_links list
for link in document.find_all('a'):
    daac_links.append(link.get('href'))

https://daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/?C=N;O=D
https://daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/?C=M;O=A
https://daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/?C=S;O=A
https://daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/?C=D;O=A
https://daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/
https://daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/ang20160910t185702_cmf_pub_ch4.tif
https://daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/ang20160910t185702_cmf_pub_rgb.tif
https://daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/ang20160910t191651_cmf_pub_ch4.tif
https://daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/ang20160910t191651_cmf_pub_rgb.tif
https://daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/ang20160910t192242_cmf_pub_ch4.tif
https://daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/ang20160910t192242_cmf_pub_rgb.tif
https://daac.ornl.gov/daacdata/cms/CH4_Plum

In [11]:
# there are some links higher up in the hierarchy that don't point to geotiffs
# there are others that point to rgb rasters of the earth surface beneath the AVIRIS flight paths
# we don't need these

daac_tif_links = daac_links[4:] # remove the upper-hierarchy links
daac_ch4_links = [i for i in daac_tif_links if 'ch4' in i] # remove the rgb rasters

In [23]:
# check that the list is populated and check length
print(daac_ch4_links[0:4])

assert len(daac_ch4_links) == 612

['https://daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/ang20160910t185702_cmf_pub_ch4.tif', 'https://daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/ang20160910t191651_cmf_pub_ch4.tif', 'https://daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/ang20160910t192242_cmf_pub_ch4.tif', 'https://daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/ang20160910t193531_cmf_pub_ch4.tif']


In [147]:
# set up the list of links so that bash can read it

# make sure you're in the directory you want to use to store files and that it's large enough to hold ~150 GB of data
os.chdir('/Volumes/Brain/GIS/ER131_Project/daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/') # !!update this to a local directory!!

# save out the list of links as a text file in the target directory
with open('ch4_tifs.txt', 'w') as f:
    for item in daac_ch4_links:
        f.write("%s\n" % item)

### Bash

In [132]:
# make sure bash is also using your target directory as the working directory
!cd Brain/GIS/ER131_Project/daac.ornl.gov/daacdata/cms/CH4_Plume_AVIRIS-NG/data/tiff/

--2020-11-03 09:21:08--  http://ch4_tif/
Resolving ch4_tif (ch4_tif)... failed: nodename nor servname provided, or not known.
wget: unable to resolve host address ‘ch4_tif’


In [None]:
# run to iterate over the links saved in 'ch4_tifs.txt' and download the files one by one
!wget -r --convert-links --no-parent --wait=8 --execute='robots = off' --span-hosts --user=<EOSDIS username> --password=<EOSDIS password> -i ch4_tifs.txt