In this notebook we will attempt to sample randomly from collections or granules. This will give us a starting database to work with, but also provide us with details about complications of working with different datatypes.

In [1]:
from pyCMR.pyCMR import CMR
cmr = CMR("cmr.cfg")

Idea: sample randomly using a uniform index into the data. What is the maximum index?

In [2]:
# Get all CMR results - how high can we boost the page number?
results = cmr.searchCollection(page_num=10000000)
len(results)

100

In [None]:
import json
js1 = json.dumps(results[0])
# print js1

This method is not working – it's all from Bowen Island with slightly different revision dates?

New strategy: randomly sample from around the globe.

In [13]:
import random

TRIALS = 10
BB_SIZE = 5 # 5x5 random bounding box

sampled_results_place = []

for trial in xrange(TRIALS):
    
    # Generate a random lat/lng bounding box
    lowerLat = random.uniform(-90.0, 90.0 - BB_SIZE)
    lowerLng = random.uniform(-180.0, 180.0 - BB_SIZE)
    upperLat = lowerLat + BB_SIZE
    upperLng = lowerLng + BB_SIZE
    bounding_box = "%d,%d,%d,%d" % (lowerLng, lowerLat, upperLng, upperLat)
    
    # Request URL
    query_results = cmr.searchCollection(bounding_box=bounding_box)
    sampled_results_place.append(query_results)
    
len(results)


100

The same 10 are always returned. This is not an effective way to sample. The creation dates of the datasets may be a better way.

In [3]:
"""
WARNING: ERASES
sampled_results = []
short_names = set([])
"""

# Pick up from where we left off
import pickle
short_names, sampled_results = pickle.load(open('metadata.p', 'rb'))


In [21]:
import random
import time
from datetime import datetime as dt

TRIALS = 50
TIME_WINDOW_DAYS = 365
FORMAT = '%Y-%m-%dT%H:%M:%SZ'

new_found = 0
for trial in xrange(TRIALS):
    
    print '*',
    
    # Generate a random time window
    
    range_start = time.mktime(time.strptime('1995-01-01T12:00:00Z', FORMAT))
    range_end = time.mktime(time.strptime('2017-01-01T12:00:00Z', FORMAT))
    start_time = range_start + random.random() * (range_end - range_start)
    end_time = start_time + 86400 * TIME_WINDOW_DAYS
        
    query_time1 = time.strftime(FORMAT, time.gmtime(start_time))
    query_time2 = time.strftime(FORMAT, time.gmtime(end_time))
    # print '[%s TO\n%s]' % (query_time1, query_time2)
    
    query_results = cmr.searchCollection(created_at=query_time1 + ',' + query_time2)
    
    # Add if they have a unique short name
    for collection in query_results:
        sn = collection['Collection']['ShortName']
        if sn not in short_names:
            sampled_results.append(collection)
            short_names.add(sn)
            new_found += 1
    
    # Don't overload the API
    time.sleep(random.random() * 2.0)
    
print 'Found %d new results' % new_found

    

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Found 337 new results


In [None]:
# For debugging
"""range2 = '1997-11-17T22:25:35Z,1997-12-17T22:25:35Z'
results = cmr.searchCollection(created_at=range2)
len(results)"""

In [None]:
short_names

Now that we have an adequate sampling method, lets save some of this data to a "database" (right now just a file).

In [22]:
len(sampled_results)

2136

In [23]:
# Save our progress

import pickle
pickle.dump((short_names, sampled_results), open('metadata.p', 'wb'))