### Purpose:

In order to get PaddockTS running without dependency on DEA, we need to find a new way to download analaysis-ready Sentinel data (or raw Sentinel and then pre-process)

A prime candidate is from microft planetary computer (MPC), becuase anyone can get an accound and access Sentinel1 and 2 data. 
https://planetarycomputer.microsoft.com/docs/quickstarts/reading-stac/

But it seems that the S2 data from MCP might have slightly different pre-processing to DEA. It mentioned a Bottom of Atmosphere correction. And the cloud mask is not automatically applied. 

This notebook will evaluate S2 time series downloaded using the MPC and DEA for the same time and place of interest. 

### Tests:
1. Are the same captures included using both methods?
2. Are the band values correlated? Visalise side-by-side and check biplots.
3. Can the s2cloudless mask be applied to the data downloaded from MPC to make it compatible?

### Step 1: prep and download some sample data:

In [1]:
import geopandas as gpd
import pandas as pd
import numpy as np
import pickle
import xarray as xr
import rioxarray  # activate the rio accessor
import matplotlib.pyplot as plt 
import seaborn as sns
import rasterio


In [2]:
stub = 'TEST4'
out_dir = '/g/data/xe2/John/Data/PadSeg/'

In [3]:
## DEA Sentinel 2 data
with open(out_dir+stub+'_ds2.pkl', 'rb') as handle:
    ds2_DEA = pickle.load(handle)

print(ds2_DEA)

<xarray.Dataset>
Dimensions:                     (time: 10, y: 205, x: 194)
Coordinates:
  * time                        (time) datetime64[ns] 2019-06-25T00:27:29.882...
  * y                           (y) float64 -4.424e+06 -4.424e+06 ... -4.426e+06
  * x                           (x) float64 1.388e+07 1.388e+07 ... 1.388e+07
    spatial_ref                 int32 6933
Data variables: (12/31)
    nbart_coastal_aerosol       (time, y, x) float32 57.0 57.0 ... 263.0 263.0
    nbart_blue                  (time, y, x) float32 221.0 219.0 ... 584.0 332.0
    nbart_green                 (time, y, x) float32 413.0 441.0 ... 916.0 560.0
    nbart_red                   (time, y, x) float32 579.0 558.0 ... 614.0
    nbart_red_edge_1            (time, y, x) float32 861.0 861.0 ... 1.267e+03
    nbart_red_edge_2            (time, y, x) float32 1.298e+03 ... 2.395e+03
    ...                          ...
    oa_nbart_contiguity         (time, y, x) uint8 1 1 1 1 1 1 1 ... 1 1 1 1 1 1
    oa_s2cloud

In [4]:
## MPC Sentinel 2 data
with open(out_dir+stub+'_ds2-L2A.pkl', 'rb') as handle:
    ds2_L2A = pickle.load(handle)

print(ds2_L2A)

<xarray.Dataset>
Dimensions:      (y: 205, x: 194, time: 30)
Coordinates:
  * y            (y) float64 -4.424e+06 -4.424e+06 ... -4.426e+06 -4.426e+06
  * x            (x) float64 1.388e+07 1.388e+07 ... 1.388e+07 1.388e+07
    spatial_ref  int32 6933
  * time         (time) datetime64[ns] 2019-06-05T00:20:59.024000 ... 2019-10...
Data variables:
    B01          (time, y, x) float32 5.115e+03 5.115e+03 ... 459.0 225.0
    B02          (time, y, x) float32 5.036e+03 5.176e+03 ... 691.0 449.0
    B03          (time, y, x) float32 4.676e+03 4.688e+03 ... 1.13e+03 690.0
    B04          (time, y, x) float32 4.32e+03 4.284e+03 ... 1.36e+03 830.0
    B05          (time, y, x) float32 4.467e+03 4.38e+03 ... 1.94e+03 1.121e+03
    B06          (time, y, x) float32 4.424e+03 4.33e+03 ... 2.671e+03 2.652e+03
    B07          (time, y, x) float32 4.315e+03 4.181e+03 ... 3.121e+03 3.32e+03
    B08          (time, y, x) float32 4.744e+03 4.828e+03 ... 3.692e+03
    B8A          (time, y, x) float3

In [5]:
## MPC Sentinel 1 data
with open(out_dir+stub+'_ds1.pkl', 'rb') as handle:
    ds1 = pickle.load(handle)


In [6]:
print("N. images")
print('Sentinel 2 DEA:', len(ds2_DEA.time))
print('Sentinel 2 MPC:',len(ds2_L2A.time))
print('Sentinel 1 MPC:',len(ds1.time))


N. images
Sentinel 2 DEA: 10
Sentinel 2 MPC: 30
Sentinel 1 MPC: 30


In [7]:
#ds2_DEA.time.values

In [8]:
#ds2_L2A.time.values

In [9]:
#ds1.time.values

### Normalise data:

### Step 2 - Compare the S2 data from both sources

In [10]:
import numpy as np
import xarray as xr
from s2cloudless import S2PixelCloudDetector

def add_cloud_probability_and_mask(ds, 
                                   bands=['B02', 'B03', 'B04', 'B05', 'B06', 
                                          'B07', 'B08', 'B8A', 'B11', 'B12'],
                                   scale_factor=10000,
                                  threshold = 0.3):
    """
    Adds cloud probability and binary cloud mask variables to a Sentinel-2 time series dataset.
    
    Parameters:
      ds (xarray.Dataset): Input dataset with Sentinel-2 time series data.
                           It must have dimensions 'time', 'y', and 'x' and contain
                           the specified bands.
      bands (list of str): List of band names to use.
                           For 10-band mode, use:
                           ['B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B8A', 'B11', 'B12'].
      scale_factor (float): Factor to convert digital numbers to reflectance.
                            Default is 10000.
      threshold (float): Cloud probability threshold (passed to the cloud detector).
    
    Returns:
      xarray.Dataset: A new dataset with two additional variables:
                      - 'cloud_probability' (float32, range [0,1])
                      - 'cloud_mask' (int, 0=clear, 1=cloud)
                      Both with shape (time, y, x).
    """
    # Initialize the cloud detector in 10-band mode (all_bands=False is default)
    cloud_detector = S2PixelCloudDetector(threshold=threshold, average_over=4, dilation_size=2, all_bands=False)

    # Lists to hold cloud probability maps and cloud masks for each time step
    cloud_prob_list = []
    cloud_mask_list = []
    
    # Loop over each time step
    for t in ds.time:
        band_arrays = []
        for band in bands:
            # Select the data for the current time, convert to float32, and scale to reflectance.
            arr = ds[band].sel(time=t).values.astype(np.float32) / scale_factor
            band_arrays.append(arr)
        # Stack the bands along the last axis; resulting shape: (y, x, len(bands))
        img = np.stack(band_arrays, axis=-1)
        
        # The detector expects an input with a leading dimension for the number of images,
        # so add an extra axis: shape becomes (1, y, x, len(bands))
        img_expanded = np.expand_dims(img, axis=0)
        
        # Compute the cloud probability map.
        # This returns an array of shape (N, y, x). Since N=1, take the first element.
        cp = cloud_detector.get_cloud_probability_maps(img_expanded)[0]
        cloud_prob_list.append(cp)
        
        # Compute the binary cloud mask using the probability map.
        # The method expects an array of shape (N, y, x), so add a leading axis to cp.
        mask = cloud_detector.get_mask_from_prob(np.expand_dims(cp, axis=0), threshold=threshold)[0]
        cloud_mask_list.append(mask)
    
    # Stack the lists along the time axis; resulting shape: (time, y, x)
    cloud_prob_array = np.stack(cloud_prob_list, axis=0)
    cloud_mask_array = np.stack(cloud_mask_list, axis=0)
    
    # Create a new dataset (or copy the original) and add the new variables
    ds_out = ds.copy()
    ds_out['cloud_probability'] = (('time', 'y', 'x'), cloud_prob_array.astype(np.float32))
    ds_out['cloud_mask'] = (('time', 'y', 'x'), cloud_mask_array.astype(np.int8))
    
    return ds_out

# Example usage:
# ds_with_clouds = add_cloud_probability_and_mask(your_xarray_dataset)
# print(ds_with_clouds)



In [None]:
ds2_L2A_cm = add_cloud_probability_and_mask(ds2_L2A)
print(ds2_L2A_cm)

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def plot_l2a_clouds(ds2_L2A_cm):
    """
    Create a subplot for each time step in ds2_L2A_cm showing:
      - Left: RGB composite using bands B04 (red), B03 (green), B02 (blue)
      - Middle: Cloud Probability (with a colorbar)
      - Right: Cloud Mask (binary image)
      
    Parameters:
        ds2_L2A_cm (xarray.Dataset): Dataset with Sentinel-2 L2A data including 
                                     B02, B03, B04, cloud_probability, and cloud_mask.
    """
    # Determine the number of time steps
    n_time = ds2_L2A_cm.dims['time']

    # Create subplots: one row per time step, three columns (RGB, Cloud Probability, Cloud Mask)
    fig, axes = plt.subplots(n_time, 3, figsize=(15, 2.5 * n_time))

    # Ensure axes is 2D even if there's only one time step
    if n_time == 1:
        axes = np.array([axes])
    
    # Loop over each time step
    for i, t in enumerate(ds2_L2A_cm.time.values):
        # --- Construct RGB image ---
        red = ds2_L2A_cm['B04'].sel(time=t).values
        green = ds2_L2A_cm['B03'].sel(time=t).values
        blue = ds2_L2A_cm['B02'].sel(time=t).values
        rgb = np.stack([red, green, blue], axis=-1)
        # Normalize the RGB image; adjust the divisor as needed (e.g., 3000 here)
        rgb_norm = np.clip(rgb / 3000.0, 0, 1)
        
        # --- Plot RGB (Left Column) ---
        ax_rgb = axes[i, 0]
        ax_rgb.imshow(rgb_norm)
        ax_rgb.set_title(f"RGB {np.datetime_as_string(t, unit='D')}")
        ax_rgb.axis('off')
        
        # --- Extract and Plot Cloud Probability (Middle Column) ---
        cp = ds2_L2A_cm['cloud_probability'].sel(time=t).values
        ax_cp = axes[i, 1]
        im_cp = ax_cp.imshow(cp, cmap='viridis')
        ax_cp.set_title("Cloud Probability")
        ax_cp.axis('off')
        fig.colorbar(im_cp, ax=ax_cp, fraction=0.046, pad=0.04)
        
        # --- Extract and Plot Cloud Mask (Right Column) ---
        cm = ds2_L2A_cm['cloud_mask'].sel(time=t).values
        ax_cm = axes[i, 2]
        ax_cm.imshow(cm, cmap='gray')
        ax_cm.set_title("Cloud Mask")
        ax_cm.axis('off')

    plt.tight_layout()
    plt.show()




In [None]:
plot_l2a_clouds(ds2_L2A_cm)

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def compare_sentinel_data(ds2_DEA, ds2_L2A_cm):
    """
    Compare Sentinel-2 data from two processing chains by plotting four panels for each time step,
    matching on the date (ignoring time-of-day).
    
    The four panels (columns) are:
      1. L2A RGB composite (using B04, B03, B02) from ds2_L2A_cm.
      2. DEA (NBART) RGB composite (using nbart_red, nbart_green, nbart_blue) from ds2_DEA (if available for that date).
      3. Cloud Mask from ds2_L2A_cm.
      4. Scatter plot comparing the red band values (L2A vs. DEA) if DEA data is available.
      
    Parameters:
      ds2_DEA (xarray.Dataset): DEA-processed dataset with bands nbart_red, nbart_green, nbart_blue.
      ds2_L2A_cm (xarray.Dataset): L2A dataset with bands B04, B03, B02 and cloud_mask.
    """
    # Build a dictionary mapping dates (YYYY-MM-DD) to the corresponding time value in ds2_DEA.
    dea_date_dict = {}
    for t in ds2_DEA.time.values:
        date_str = np.datetime_as_string(t, unit='D')
        # If multiple DEA observations exist for the same date, we keep the first occurrence.
        if date_str not in dea_date_dict:
            dea_date_dict[date_str] = t
    
    # Use the time values from ds2_L2A_cm as the reference timeline.
    l2a_times = ds2_L2A_cm.time.values
    num_times = len(l2a_times)
    
    # Create a figure with one row per time step and 4 columns.
    fig, axes = plt.subplots(num_times, 4, figsize=(16, 4 * num_times))
    
    # Ensure axes is 2D even if there's only one time step.
    if num_times == 1:
        axes = np.expand_dims(axes, axis=0)
    
    # Simple normalization function: scales image using the 2nd and 98th percentiles.
    def normalize(img):
        img_min, img_max = np.percentile(img, [2, 98])
        return np.clip((img - img_min) / (img_max - img_min), 0, 1)
    
    # Loop over each time step from ds2_L2A_cm.
    for i, t in enumerate(l2a_times):
        # Use the full timestamp for display and the date-only string for matching.
        time_str = np.datetime_as_string(t, unit='s')
        date_str = np.datetime_as_string(t, unit='D')
        
        # --- Column 1: L2A RGB Composite ---
        ds2_L2A_time = ds2_L2A_cm.sel(time=t)
        l2a_red = ds2_L2A_time['B04'].values
        l2a_green = ds2_L2A_time['B03'].values
        l2a_blue = ds2_L2A_time['B02'].values
        l2a_rgb = np.stack((l2a_red, l2a_green, l2a_blue), axis=-1)
        
        # Normalize each channel for display.
        l2a_rgb_norm = np.zeros_like(l2a_rgb, dtype=float)
        for band in range(3):
            l2a_rgb_norm[..., band] = normalize(l2a_rgb[..., band])
        
        ax0 = axes[i, 0]
        ax0.imshow(l2a_rgb_norm)
        ax0.set_title(f"L2A RGB\n{time_str}")
        ax0.axis('off')
        
        # --- Column 2: DEA RGB Composite (if available for the matching date) ---
        if date_str in dea_date_dict:
            ds2_DEA_time = ds2_DEA.sel(time=dea_date_dict[date_str])
            dea_red = ds2_DEA_time['nbart_red'].values
            dea_green = ds2_DEA_time['nbart_green'].values
            dea_blue = ds2_DEA_time['nbart_blue'].values
            dea_rgb = np.stack((dea_red, dea_green, dea_blue), axis=-1)
            
            dea_rgb_norm = np.zeros_like(dea_rgb, dtype=float)
            for band in range(3):
                dea_rgb_norm[..., band] = normalize(dea_rgb[..., band])
            
            ax1 = axes[i, 1]
            ax1.imshow(dea_rgb_norm)
            ax1.set_title(f"DEA RGB\n{date_str}")
            ax1.axis('off')
        else:
            ax1 = axes[i, 1]
            ax1.text(0.5, 0.5, 'Missing Data', ha='center', va='center', fontsize=12)
            ax1.axis('off')
        
        # --- Column 3: Cloud Mask from L2A ---
        ax2 = axes[i, 2]
        cloud_mask = ds2_L2A_time['cloud_mask'].values
        ax2.imshow(cloud_mask, cmap='gray')
        ax2.set_title("Cloud Mask")
        ax2.axis('off')
        
        # --- Column 4: Scatter Plot (Red band: L2A vs. DEA) ---
        ax3 = axes[i, 3]
        if date_str in dea_date_dict:
            # Flatten arrays so each point represents a pixel.
            dea_red_flat = dea_red.flatten()
            l2a_red_flat = l2a_red.flatten()
            ax3.scatter(l2a_red_flat, dea_red_flat, s=1, alpha=0.5)
            ax3.set_xlabel("L2A Red")
            ax3.set_ylabel("DEA Red")
            ax3.set_title("Red Band Scatter")
        else:
            ax3.text(0.5, 0.5, 'Missing Data', ha='center', va='center', fontsize=12)
            ax3.axis('off')
    
    plt.tight_layout()
    plt.show()


In [None]:
compare_sentinel_data(ds2_DEA, ds2_L2A_cm)

# Trying Copermicus download method:

In [11]:
from sentinelsat import SentinelAPI, geojson_to_wkt
import datetime

In [12]:
# connect to the API
# NOT PROPERLY SET UP

from sentinelsat import SentinelAPI, read_geojson, geojson_to_wkt
from datetime import date

api = SentinelAPI('john.burley@anu.edu.au', 'CANola$$$2024', 'https://apihub.copernicus.eu/apihub')

ROI = '/home/106/jb5097/Projects/PaddockTS/Planet_dl/ARBO.geojson'
# search by polygon, time, and SciHub query keywords
footprint = geojson_to_wkt(read_geojson(ROI))

products = api.query(footprint,
                     date=('20151219', date(2015, 12, 29)),
                     platformname='Sentinel-2')

# convert to Pandas DataFrame
products_df = api.to_dataframe(products)

# sort and limit to first 5 sorted products
products_df_sorted = products_df.sort_values(['cloudcoverpercentage', 'ingestiondate'], ascending=[True, True])
products_df_sorted = products_df_sorted.head(5)

# download sorted and reduced products
api.download_all(products_df_sorted.index)

ConnectTimeout: HTTPSConnectionPool(host='apihub.copernicus.eu', port=443): Max retries exceeded with url: /apihub/search?format=json&rows=100&start=0&q=beginPosition%3A%5B%222015-12-19T00%3A00%3A00Z%22+TO+%222015-12-29T00%3A00%3A00Z%22%5D+platformname%3A%22Sentinel-2%22+footprint%3A%22Intersects%28GEOMETRYCOLLECTION%28POLYGON%28%28149.0413+-35.2574%2C149.0413+-35.3157%2C149.1131+-35.3157%2C149.1131+-35.2574%2C149.0413+-35.2574%29%29%29%29%22 (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x14ec08de3c50>, 'Connection to apihub.copernicus.eu timed out. (connect timeout=None)'))

In [None]:
import os
import json
import glob
import zipfile
import datetime
import numpy as np
import xarray as xr
import rioxarray
from sentinelsat import SentinelAPI, geojson_to_wkt
from s2cloudless import S2PixelCloudDetector



In [None]:
api = SentinelAPI('john.burley@anu.edu.au', 'CANola$$$2024', 'https://apihub.copernicus.eu/apihub')

ROI = '/home/106/jb5097/Projects/PaddockTS/Planet_dl/ARBO.geojson'

# Load area of interest from a GeoJSON file
with open(ROI) as f:
    roi_geojson = json.load(f)
# Convert the first feature to a WKT string
footprint = geojson_to_wkt(roi_geojson['features'][0])

# Define date range (YYYYMMDD format)
start_date = '20220101'
end_date   = '20220131'
date_range = (start_date, end_date)

# Query for Sentinel-2 Level-2A products with moderate cloud cover
products = api.query(footprint,
                     date=date_range,
                     platformname='Sentinel-2',
                     processinglevel='Level-2A',
                     cloudcoverpercentage=(0, 30))

# Optionally inspect results
products_df = api.to_dataframe(products)
print(products_df.head())


In [None]:

# For this example, pick the first product from the query
product_id = list(products.keys())[0]

# Download the product into a directory called 'downloads'
download_dir = 'downloads'
os.makedirs(download_dir, exist_ok=True)
api.download(product_id, directory_path=download_dir)

# --- Step 2: Extract & Load Selected Bands ---

# The product is downloaded as a zip file.
# Its name is typically the product title with a .zip extension.
product_title = products_df.loc[product_id]['title']
zip_path = os.path.join(download_dir, product_title + '.zip')

# Extract the SAFE archive
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(download_dir)

# Locate the SAFE directory (ends with .SAFE)
safe_dir = os.path.join(download_dir, product_title + '.SAFE')

# Sentinel-2 L2A stores data in the GRANULE directory.
# For 10m resolution bands (B02, B03, B04, B08) the files are in the R10m folder.
granule_dir = glob.glob(os.path.join(safe_dir, 'GRANULE', '*'))[0]
img_data_dir = os.path.join(granule_dir, 'IMG_DATA', 'R10m')

# Build a dictionary mapping band names to file paths.
# (Adjust the glob pattern if needed.)
band_files = {
    'B02': glob.glob(os.path.join(img_data_dir, '*_B02_10m.jp2'))[0],
    'B03': glob.glob(os.path.join(img_data_dir, '*_B03_10m.jp2'))[0],
    'B04': glob.glob(os.path.join(img_data_dir, '*_B04_10m.jp2'))[0],
    'B08': glob.glob(os.path.join(img_data_dir, '*_B08_10m.jp2'))[0],
}

# Open each band as a DataArray with rioxarray and merge into one Dataset.
da_list = []
for band, path in band_files.items():
    da = rioxarray.open_rasterio(path)
    # Remove the band dimension if it exists and rename the DataArray
    da = da.squeeze('band').rename(band)
    da_list.append(da)

# Merge into an xarray Dataset
ds = xr.merge(da_list)

# Convert digital numbers to reflectance.
# Sentinel-2 L2A products are typically scaled by 10000.
ds = ds / 10000.0

# --- Step 3: Compute and Apply the s2cloudless Mask ---

# Prepare the image for s2cloudless:
# s2cloudless expects an array of shape (height, width, channels)
# Ensure the bands are in the order: B02, B03, B04, B08
img_array = np.stack([ds[band].values for band in ['B02', 'B03', 'B04', 'B08']], axis=-1)

# Initialize the s2cloudless cloud detector (tweak threshold as needed)
cloud_detector = S2PixelCloudDetector(threshold=0.4, average_over=4, dilation_size=2)

# Generate the binary cloud mask (shape: height x width, dtype=bool)
cloud_mask = cloud_detector.get_cloud_mask(img_array)

# Add the cloud mask as a new variable in the dataset.
# Here we assume the spatial dimensions are named 'y' and 'x' (as set by rioxarray).
ds['cloud_mask'] = (('y', 'x'), cloud_mask)

# Now, ds is an xarray Dataset containing your selected Sentinel-2 bands and a cloud mask.
print(ds)