In [41]:
BUCKET_URL = 'http://micropasts-palstaves2.s3.amazonaws.com/'
TO_WHERE = 'palstaves2' # directory we'll end up downloading things to

At the URL above is an XML document which points to all the rest of the data. We firstly need to download the data. Python comes with a web grabber built in. In Python 3 this is part of the urllib.request library. In Python 2 it's part of the urllib2 library.

In [7]:
from urllib.request import urlopen

We can use ``urlopen`` to fetch a document from the Intertrons.

In [8]:
with urlopen(BUCKET_URL) as f:
    document_text = f.read()

Let's test this by printing the first 200 characters:

In [9]:
print(document_text[:200])

b'<?xml version="1.0" encoding="UTF-8"?>\n<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Name>micropasts-palstaves2</Name><Prefix></Prefix><Marker></Marker><MaxKeys>1000</MaxKeys><IsT'


We now need to parse this as XML. Python comes with its own XML parser. The parser takes a file-like object. Fortunately the ``urlopen`` function returns one. We can thus download and parse all in one go!

In [15]:
import xml.etree.ElementTree as ET
with urlopen(BUCKET_URL) as f:
    document = ET.parse(f)
document_root = document.getroot()

The document root is a ``<ListBucketResult>`` element which has each file listed in a separate ``<Contents>`` element. Each element is namespaced which would be a pain to type out each time. Hence we'll create a custom prefix ``s3`` for Amazon S3 elements.

In [23]:
ns = { 's3': 'http://s3.amazonaws.com/doc/2006-03-01/' }

We can now use the namespace prefix to match elements without having to specify the full URL which is tiresome. Test this by counting the number of ``Contents`` tags:

In [26]:
print('Number of files: {}'.format(len(document_root.findall('s3:Contents', ns))))

Number of files: 609


Seems to work. Let's create an array of elements:

In [29]:
contents = document_root.findall('s3:Contents', ns)
assert len(contents) > 0 # Throws an exception if there are no files to download!

Each ``Contents`` element has a ``Key`` element which stores the actual path to the file. Use a list comprehension to extract a list of keys:

In [35]:
keys = [c.find('s3:Key', ns).text for c in contents]
print('Number of keys: {}'.format(len(keys)))

Number of keys: 609


Now we need to form a URL for each key. Again Python has us covered. ``urllib.parse`` has a handy ``urljoin`` function which does the hard work for us. (In Python 2 this is in the ``urlparse`` module I *think*.)

In [42]:
from urllib.parse import urljoin

urls = [urljoin(BUCKET_URL, key) for key in keys]

Let's take a look at the first few:

In [43]:
for u in urls[:5]:
    print(u)

http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/
http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/
http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3517.JPG
http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3518.JPG
http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3519.JPG


OK, getting there. We now need to download all of the files. If the URL ends in a ``/``, it's a directory. If it doesn't we'll assume it's a file.

In [47]:
import os
import shutil

for key in keys[:10]:
    # Get the URL for this key
    url = urljoin(BUCKET_URL, key)
    
    # Compute the filename to download this to
    path = os.path.join(TO_WHERE, key)
    dirname = os.path.dirname(path)
    filename = os.path.basename(path)
    
    # Skip URLs corresponding only to directories
    if url.endswith('/'):
        continue
    
    # Does the destination directory exist? If not, make it so (#1).
    if not os.path.isdir(dirname):
        os.makedirs(dirname)
        
    # Now we need to download the file...
    print('Downloading {}...'.format(url))
    with urlopen(url) as src, open(path, 'wb') as dst:
        shutil.copyfileobj(src, dst)

Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3517.JPG...
Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3518.JPG...
Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3519.JPG...
Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3520.JPG...
Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3521.JPG...
Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3522.JPG...
Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3523.JPG...
Downloading http://micropasts-palstaves2.s3.amazonaws.com/2013T482_Lower_Hardres_Canterbury/Axe1/IMG_3524.JPG...
