# Sentinel 2 Demo - Kenya

The following notebook demonstrates how to access Sentinel 2 data using the Planet OS API. Note that the data and API endpoints used in this notebook are intended to be used for demonstration purposes only, and may be updated or removed without notice.

We'll be working with two Sentinel 2 datasets from the Planet OS Datahub, [RGB bands](http://data.planetos.com/datasets/sentinel2_kenya_clouds_rgb) and an [NDVI](http://data.planetos.com/datasets/sentinel2_kenya_clouds_ndvi) product calculated by Planet OS. Both datasets provide spatial coverage in Kenya, include cloud coverage percentages, and have had a cloud mask applied to remove cloudy data points.

If you'd prefer to work with the original data that includes clouds, you can replace the Planet OS dataset IDs used later in this notebook accordingly:

* sentinel2_kenya_clouds_rgb => [sentinel2_kenya_rgb](http://data.planetos.com/datasets/sentinel2_kenya_rgb)
* sentinel2_kenya_clouds_ndvi => [sentinel2_kenya_ndvi](http://data.planetos.com/datasets/sentinel2_kenya_ndvi)

In [1]:
%matplotlib inline

import folium
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
import dateutil.parser
from shapely.geometry import Point
import simplejson as json
from urllib.parse import urlencode
from urllib.request import urlopen, Request

A Planet OS API key is required to run this notebook. Keys are displayed in the [account settings](http://data.planetos.com/account/settings/) page on the Planet OS Datahub. If you do not have a Planet OS account, you can [sign up for free](http://data.planetos.com/plans).

In [2]:
apikey = 'YOUR-API-KEY-GOES-HERE'

## API Queries

Below we setup some query functions using the Planet OS API.

We'll be using two API endpoints, the Dataset endpoint to query for metadata and available variables, and the Point endpoint to query for actual data. We recommend reviewing the [API documentation](http://docs.planetos.com/#rest-api-v1) for more details on response format and available query parameters.

Note that we'll also be using some undocumented features in this demo, so the required endpoints may change in the future as features such as polygon selection evolve.

In [3]:
def api_query(url):
#     print("API Url: %s" % url) # debug helper: display API urls in output
    request = Request(url)
    response = urlopen(request)
    return json.loads(response.read())

def api_dataset_query(id, params):
    '''
    Queries the Planet OS API dataset endpoint and returns the
    response as a JSON object.
    id (str): Planet OS dataset ID
    params (dict): Dict of API query parameters
    '''
    query = urlencode(params)
    url = "http://api.planetos.com/v1/datasets/%s?%s" % (id, query)
    return api_query(url)

def api_point_query(id, params):
    '''
    Queries the Planet OS API point endpoint and returns the
    response as a JSON object.
    
    id (str): Planet OS dataset ID
    params (dict): Dict of API query parameters
    '''
    query = urlencode(params)
    url = "http://api.planetos.com/v1/datasets/%s/point?%s" % (id, query)
    return api_query(url)

## Dataset Metadata

Let's query the API for the RGB and NDVI datasets and display some information about them.

In [4]:
# The Planet OS dataset ids are required in the query string.
# Dataset IDs can be found on the dataset detail pages in the right-hand
# data access column once a user has authenticated.
ds_rgb_id = 'sentinel2_kenya_clouds_rgb'
ds_ndvi_id = 'sentinel2_kenya_clouds_ndvi'

# All queries require a valid API key
params = {'apikey': apikey, }

# Query the dataset endpoint for both RGB and NDVI
ds_rgb_json = api_dataset_query(ds_rgb_id, params)
ds_ndvi_json = api_dataset_query(ds_ndvi_id, params)

In [5]:
# Let's summarize the results in ASCII
def ds_summary(ds):
    '''
    Prints a summary of a dataset including title, variable names,
    variable long names, and unit.
    
    ds (json): Planet OS API dataset response in JSON format
    '''
    print(ds['Title'])
    print('-' * 80)
    print("{0:<30} {1:<40} {2:<10}".format("Variable", "Long Name", "Unit"))
    print('-' * 80)
    for v in ds['Variables']:
        name = v['name'] or '-'
        long_name =  v['longName'] or '-'
        unit = v['unit'] or '-'
        print("{0:<30} {1:<40} {2:<10}".format(name, long_name, unit))

ds_summary(ds_rgb_json)
print()
ds_summary(ds_ndvi_json)

Sentinel2 RGB (Kenya)
--------------------------------------------------------------------------------
Variable                       Long Name                                Unit      
--------------------------------------------------------------------------------
blue                           Band 2 blue                              -         
lon                            longitude                                degrees_east
green                          Band 3 green                             -         
context_time_lat_lon           -                                        -         
mx_dataset                     -                                        -         
crs                            -                                        -         
lat                            latitude                                 degrees_north
cloudy_pixels_percentage       cloudy_pixels_percentage                 percentage
red                            Band 4 red                       

## Requesting RGB Data at a Point

From the output above, we can see the available variables. Variable names can be used with the `var` parameter to request values for just one or more (comma separated) variables within the dataset.

Let's request RGB values using the Point endpoint from the Sentinel-2 RGB (Kenya) dataset at the nearest available point from a specific coordinate.

In [6]:
# Select a point in decimal degrees to query. We'll use the centroid of a farm in Kenya.
# This particular point is interesting because it falls within 4 unique Sentinel 2
# grid tiles. As a result, our response will return values from each of the 4 tiles.
lon = 36.62117917585404
lat = -0.9584909201199898

params = {'apikey': apikey, # always required
          'count': 1, # number of values to return per classifier (e.g. tile)
          'lat': lat, # latitude of interest
          'lon': lon, # longitude of interest
          'max_count': 'true', # return total count of available values
          'nearest': 'true', # return data from the nearest available point
          'time_order': 'desc', # return data in descending chronological order
          'var': 'red,green,blue', # return red, green and blue variables
         }

ds_rgb_point = api_point_query(ds_rgb_id, params)
# print(json.dumps(ds_rgb_point, indent=2))

In [7]:
# The raw response contains two top level elements: 'entries' which contains the values
# and 'stats' which contains some metadata about the values. We'll create a Pandas
# dataframe with the values in 'entries'.

df = pd.io.json.json_normalize(ds_rgb_point['entries'])
print(df.count())
df.head()

axes.latitude       4
axes.longitude      4
axes.time           4
classifiers.tile    4
context             4
data.blue           2
data.green          2
data.red            2
dtype: int64


Unnamed: 0,axes.latitude,axes.longitude,axes.time,classifiers.tile,context,data.blue,data.green,data.red
0,-0.958516,36.62122,2016-08-15T08:08:34,36MZE,time_lat_lon,,,
1,-0.958492,36.621159,2016-08-12T07:57:10,37MBV,time_lat_lon,14153.0,12845.0,11983.0
2,-0.958452,36.621212,2016-08-15T08:08:34,36MZD,time_lat_lon,,,
3,-0.958468,36.621216,2016-08-12T07:57:10,37MBU,time_lat_lon,14574.0,13190.0,12168.0


The query above requested the most recent values at a point nearest our coordinate of interest. The response values are shown with their repective axes (`axes.latitude`, `axes.longitude`, and `axes.time`), as well as the tile (`classifiers.tile`) the value was sourced from.

Now depending on _when_ this query was performed, there may have been sufficient cloud cover to prevent the acquisition of data in the RGB bands. This would result in `NaN` values for the `data.blue`, `data.green`, and `data.red` variables.

If instead we wanted to gather the last four passes instead of the most recent pass, we can do so using the `count` parameter.

In [8]:
# Let's increase the value count per tile to 5.
# Note that we could also use the maxCount value to acquire all available values as well.
# params['count'] = ds_rgb_point['stats']['maxCount']
params['count'] = 5

ds_rgb_point = api_point_query(ds_rgb_id, params)

In [9]:
df = pd.io.json.json_normalize(ds_rgb_point['entries'])
print(df.count())
df.head()

axes.latitude       17
axes.longitude      17
axes.time           17
classifiers.tile    17
context             17
data.blue           10
data.green          10
data.red            10
dtype: int64


Unnamed: 0,axes.latitude,axes.longitude,axes.time,classifiers.tile,context,data.blue,data.green,data.red
0,-0.958516,36.62122,2016-08-15T08:08:34,36MZE,time_lat_lon,,,
1,-0.958516,36.62122,2016-08-12T07:57:10,36MZE,time_lat_lon,13241.0,12131.0,11451.0
2,-0.958516,36.62122,2016-08-05T08:03:48,36MZE,time_lat_lon,,,
3,-0.958516,36.62122,2016-08-02T07:55:56,36MZE,time_lat_lon,167.0,384.0,441.0
4,-0.958516,36.62122,2016-07-26T08:06:25,36MZE,time_lat_lon,,,


Let's cleanup the `NaN` values and use the `describe()` method to print some stats about the data.

In [10]:
# drop NaN values and save as clean dataframe
df_clean = df.dropna()

# index by time using the axes.time column and sort descending
pd.to_datetime(df_clean["axes.time"])
df_clean.set_index('axes.time', inplace=True)
df_clean = df_clean.sort_index(ascending=False)

print(df_clean.describe())
df_clean.head()

       axes.latitude  axes.longitude     data.blue   data.green      data.red
count      10.000000       10.000000     10.000000     10.00000     10.000000
mean       -0.958482       36.621199   5852.800000   5426.60000   5052.200000
std         0.000024        0.000028   7217.431842   6442.11051   5972.401542
min        -0.958516       36.621159    167.000000    384.00000    357.000000
25%        -0.958492       36.621172    262.000000    437.00000    425.250000
50%        -0.958480       36.621214    344.500000    505.00000    486.500000
75%        -0.958468       36.621216  13925.000000  12666.50000  11850.000000
max        -0.958452       36.621220  14929.000000  13441.00000  12344.000000


Unnamed: 0_level_0,axes.latitude,axes.longitude,classifiers.tile,context,data.blue,data.green,data.red
axes.time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-08-12T07:57:10,-0.958516,36.62122,36MZE,time_lat_lon,13241.0,12131.0,11451.0
2016-08-12T07:57:10,-0.958492,36.621159,37MBV,time_lat_lon,14153.0,12845.0,11983.0
2016-08-12T07:57:10,-0.958452,36.621212,36MZD,time_lat_lon,14929.0,13441.0,12344.0
2016-08-12T07:57:10,-0.958468,36.621216,37MBU,time_lat_lon,14574.0,13190.0,12168.0
2016-08-02T07:55:56,-0.958516,36.62122,36MZE,time_lat_lon,167.0,384.0,441.0


## Additional API Query Parameters

The code below highlights some additional parameters that can be used in an API query.

* `buffer` is used to query a bounding box around the point of interest. The value is expressed in decimal degrees.
* `classifier:tile` can be used to request data from a single Sentinel-2 tile.
* `grouping` is used to compact data values along a particular axis. Currently we support the `location` value, which will return an array of values with indices corresponding to the latitude and longitude axes.

Let's use a combination of these parameters to request a compact response of data from the `37MBU` tile that resides within `0.001` degrees of our latitude and longitude.

In [11]:

params_compact = {'apikey': apikey, # always required
                  'buffer': 0.001, # return data from all points within 0.001 degree bounding box centered on lat/lon
                  'classifier:tile': '37MBU', # only return values from 37MBU tile
                  'count': 10, # number of values to return per classifier (e.g. tile)
                  'grouping': 'location', # compact into 2-d array by location axis (lat/lon)
                  'lat': lat, # latitude of interest
                  'lon': lon, # longitude of interest
                  'time_order': 'desc', # return data in descending chronological order
                  'var': 'red,green,blue', # return red, green and blue variables
                 }
ds_rgb_point_compact = api_point_query(ds_rgb_id, params_compact)
print(json.dumps(ds_rgb_point_compact))

{"entries": [{"indexAxes": [["latitude", [-0.9594591856002808, -0.9593690633773804, -0.95927894115448, -0.9591888189315796, -0.9590986967086792, -0.9590085744857788, -0.9589184522628784, -0.958828330039978, -0.9587382078170776, -0.9586480855941772, -0.9585579633712769, -0.9584679007530212, -0.9583777785301208, -0.9582876563072205, -0.9581975340843201, -0.9581074118614197, -0.9580172896385193, -0.9579271674156189, -0.9578370451927185, -0.9577469229698181, -0.9576568007469177, -0.9575666785240173]], ["longitude", [36.62022399902344, 36.62031555175781, 36.62040710449219, 36.6204948425293, 36.62058639526367, 36.62067794799805, 36.620765686035156, 36.62085723876953, 36.62094497680664, 36.621036529541016, 36.62112808227539, 36.6212158203125, 36.621307373046875, 36.62139892578125, 36.62148666381836, 36.621578216552734, 36.621665954589844, 36.62175750732422, 36.621849060058594, 36.6219367980957, 36.62202835083008, 36.62211990356445]]], "data": {"green": [[13694.0, 13655.0, 13607.0, 13668.0, 13

## Cloud Cover

Cloud coverage is calculated on a per-tile, per-timestamp basis and stored in the `cloudy_pixels_percentage` variable. Because the coverage is on a per-tile basis, there are no latitude or longitude axes, only a time axis. Let's grab the cloud cover data and associate it with our RGB values from our original query.

In [12]:
# Use the same params query, but update 'var' to request cloud percentage
params['var'] = 'cloudy_pixels_percentage'
print(params, '\n') # output to refresh our memory

# Request the data and store in a dataframe
ds_rgb_clouds = api_point_query(ds_rgb_id, params)
df_clouds = pd.io.json.json_normalize(ds_rgb_clouds['entries'])

print(df_clouds.count())
df_clouds.head()

{'time_order': 'desc', 'var': 'cloudy_pixels_percentage', 'max_count': 'true', 'nearest': 'true', 'lon': 36.62117917585404, 'count': 5, 'lat': -0.9584909201199898, 'apikey': '535871ea3d554497843ab250a497e626'} 

axes.time                        641
classifiers.tile                 641
context                          641
data.cloudy_pixels_percentage    641
dtype: int64


Unnamed: 0,axes.time,classifiers.tile,context,data.cloudy_pixels_percentage
0,2016-08-09T07:49:17,37MEV,time,88.624298
1,2016-07-30T07:43:34,37MEV,time,0.2132
2,2016-07-20T07:49:17,37MEV,time,15.8947
3,2016-08-09T07:42:40,37NFD,time,51.823601
4,2016-07-30T07:43:34,37NFD,time,24.503901


In order to merge our RGB and cloud coverage datasets we'll first index both by time and tile.

In [13]:
# Index rgb dataframe by time and tile, sort in descending order
df_tt = df.set_index(['axes.time','classifiers.tile']).sort_index(ascending=False)
print(df_tt.count())
df_tt.head()

axes.latitude     17
axes.longitude    17
context           17
data.blue         10
data.green        10
data.red          10
dtype: int64


Unnamed: 0_level_0,Unnamed: 1_level_0,axes.latitude,axes.longitude,context,data.blue,data.green,data.red
axes.time,classifiers.tile,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-08-15T08:08:34,36MZE,-0.958516,36.62122,time_lat_lon,,,
2016-08-15T08:08:34,36MZD,-0.958452,36.621212,time_lat_lon,,,
2016-08-12T07:57:10,37MBV,-0.958492,36.621159,time_lat_lon,14153.0,12845.0,11983.0
2016-08-12T07:57:10,37MBU,-0.958468,36.621216,time_lat_lon,14574.0,13190.0,12168.0
2016-08-12T07:57:10,36MZE,-0.958516,36.62122,time_lat_lon,13241.0,12131.0,11451.0


In [14]:
# Index cloud coverage dataframe by time and tile, sort in descending order
df_clouds_tt = df_clouds.set_index(['axes.time','classifiers.tile']).sort_index(ascending=False)
print(df_clouds_tt.count())
df_clouds_tt.head()

context                          641
data.cloudy_pixels_percentage    641
dtype: int64


Unnamed: 0_level_0,Unnamed: 1_level_0,context,data.cloudy_pixels_percentage
axes.time,classifiers.tile,Unnamed: 2_level_1,Unnamed: 3_level_1
2016-08-15T08:08:34,36MZE,time,10.8187
2016-08-15T08:08:34,36MZD,time,55.430099
2016-08-15T08:08:34,36MZC,time,61.775501
2016-08-15T08:08:34,36MYE,time,7.707
2016-08-15T08:08:34,36MYD,time,6.3943


In [15]:
# Concatenate the RGB and cloud percentage dataframes using an inner join 
df_rgbc = pd.concat([df_tt, df_clouds_tt], axis=1, join='inner')

# Drop the context columns to clean up the dataframe
df_rgbc.drop(['context'], axis=1, inplace=True)

# Sort by descending time
df_rgbc.sort_index(ascending=False, inplace=True)

print(df_rgbc.count())
df_rgbc.head(10)

axes.latitude                    17
axes.longitude                   17
data.blue                        10
data.green                       10
data.red                         10
data.cloudy_pixels_percentage    17
dtype: int64


Unnamed: 0_level_0,Unnamed: 1_level_0,axes.latitude,axes.longitude,data.blue,data.green,data.red,data.cloudy_pixels_percentage
axes.time,classifiers.tile,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-08-15T08:08:34,36MZE,-0.958516,36.62122,,,,10.8187
2016-08-15T08:08:34,36MZD,-0.958452,36.621212,,,,55.430099
2016-08-12T07:57:10,37MBV,-0.958492,36.621159,14153.0,12845.0,11983.0,58.606201
2016-08-12T07:57:10,37MBU,-0.958468,36.621216,14574.0,13190.0,12168.0,92.040802
2016-08-12T07:57:10,36MZE,-0.958516,36.62122,13241.0,12131.0,11451.0,34.561001
2016-08-12T07:57:10,36MZD,-0.958452,36.621212,14929.0,13441.0,12344.0,73.080704
2016-08-05T08:03:48,36MZE,-0.958516,36.62122,,,,99.553596
2016-08-05T08:03:48,36MZD,-0.958452,36.621212,,,,32.312599
2016-08-02T07:55:56,37MBV,-0.958492,36.621159,237.0,470.0,512.0,4.1371
2016-08-02T07:55:56,37MBU,-0.958468,36.621216,283.0,387.0,385.0,0.0


The dataframe `df_rgbc` now contains RGB and cloud coverage data. We can again use the `dropna` method to remove rows with `NaN` values for RGB.

In [16]:
df_rgbc_clean = df_rgbc.dropna()
print(df_rgbc_clean.count())
df_rgbc_clean.head()

axes.latitude                    10
axes.longitude                   10
data.blue                        10
data.green                       10
data.red                         10
data.cloudy_pixels_percentage    10
dtype: int64


Unnamed: 0_level_0,Unnamed: 1_level_0,axes.latitude,axes.longitude,data.blue,data.green,data.red,data.cloudy_pixels_percentage
axes.time,classifiers.tile,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-08-12T07:57:10,37MBV,-0.958492,36.621159,14153.0,12845.0,11983.0,58.606201
2016-08-12T07:57:10,37MBU,-0.958468,36.621216,14574.0,13190.0,12168.0,92.040802
2016-08-12T07:57:10,36MZE,-0.958516,36.62122,13241.0,12131.0,11451.0,34.561001
2016-08-12T07:57:10,36MZD,-0.958452,36.621212,14929.0,13441.0,12344.0,73.080704
2016-08-02T07:55:56,37MBV,-0.958492,36.621159,237.0,470.0,512.0,4.1371


In [17]:
# We can also use groupby to determine statistics over all available timestamps within each unique point.
df_rgbc_clean.groupby(['axes.latitude','axes.longitude']).describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,data.blue,data.cloudy_pixels_percentage,data.green,data.red
axes.latitude,axes.longitude,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
-0.958516,36.62122,count,2.0,2.0,2.0,2.0
-0.958516,36.62122,mean,6704.0,19.514601,6257.5,5946.0
-0.958516,36.62122,std,9244.714057,21.278823,8306.383359,7785.245661
-0.958516,36.62122,min,167.0,4.4682,384.0,441.0
-0.958516,36.62122,25%,3435.5,11.9914,3320.75,3193.5
-0.958516,36.62122,50%,6704.0,19.514601,6257.5,5946.0
-0.958516,36.62122,75%,9972.5,27.037801,9194.25,8698.5
-0.958516,36.62122,max,13241.0,34.561001,12131.0,11451.0
-0.958492,36.621159,count,3.0,3.0,3.0,3.0
-0.958492,36.621159,mean,4881.666667,23.5995,4618.333333,4318.666667
