# SENTINEL2 Level-1C Data Preprocessor
## Capstone - Fall 2020
### TP Goter

This notebook is used to preprocess the Sentinel2 images and irrigation labels extracted from Google Earth Engine using a modified version of the [Irrigation30 tool](https://github.com/AngelaWuGitHub/irrigation30) developed by a capstone team from Summer 2020. In order to minimize the number of files being written to Google Drive, predictions and images were gathered for areas of land that are large than what we want to use for training our model. The purpose therefore for this script is:
1. Combine the separate multispectrum bands for each large image (12 channels)
2. Split the MSI into non-overlapping subregions. 
3. Convert cluster numbers into irrigated/non-irrigated labels by month
4. Create dataframe for each initial image. The dataframe will contain a row for every sub-image created along with the prediction array and some metadata.
5. Serialize the dataframe as a pickle file

In [9]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import cv2
import os
import rasterio

In [162]:
os.listdir()

['irrigation30',
 'S2_2018_35.625_-119.125',
 'test01.png',
 'S2_2018_35.625_-119.375',
 'GDAL python samples',
 'test02.png',
 'S2_2018_35.875_-119.375',
 'S2_2018_35.375_-119.875',
 'test03.png',
 'S2_2018_35.875_-119.125',
 '.DS_Store',
 'S2_2018_35.125_-119.625',
 'test04.png',
 'preprocessor.ipynb',
 'S2_2018_35.875_-119.625',
 'S2_2018_36.375_-119.625',
 'S2_2018_35.625_-119.625',
 'S2_2018_36.625_-119.875.csv',
 'S2_2018_36.375_-119.375.csv',
 'S2_2018_35.125_-119.375',
 'S2_2018_36.125_-119.875',
 'gather_california_data.ipynb',
 'S2_2018_35.125_-119.125',
 'S2_2018_35.375_-119.625',
 'BigEarthData',
 'S2_2018_36.125_-119.125',
 'data_qa.ipynb',
 'S2_2018_35.125_-119.875',
 'S2_2018_36.125_-119.375',
 '.ipynb_checkpoints',
 'S2_2018_35.625_-119.875',
 'S2_2018_35.375_-119.125',
 'S2_2018_36.375_-119.875',
 'S2_2018_35.875_-119.875',
 'S2_2018_35.375_-119.375',
 'S2_2018_36.125_-119.625',
 'S2_2018_36.375_-119.125.csv']

## General Setup

In [218]:
# Use all bands but B10
BAND_NAMES = ['B1', 'B2', 'B3', 'B4', 'B5', 'B6', 'B7', 'B8', 'B8A', 'B9', 'B11', 'B12']

# According to Sentinel guide all MSI data is scaled by factor of 10000
SCALE_FACTOR = 10000

## Process File

The function below does everything described in the initial markdown cell of this notebook. 

### Load Predictions
The cluster predictions from Irrigation30 are stored in a tiff file. This gives us irrigation by month essentially. A problem was noted with these labels in that cluster 0 can correspond to irrigated land. However zero is also assigned to non-cropland. In order to properly account for this, we must load a second tiff file which has simply irrigated vs not irrigated. 

### Decode the Clusters
During the running of Irrigation30 for each image, we parse the cluster labels to generate the decoder key. This was stored as a dataframe. We can use this to expand our cluster tif into irrigated/not irrigated (i.e., 1 or 0) predictions by month.

### Chunk the MSI and Label Data
Use the numpy split method to break the large image into many small, non-overlapping images. 

### Store data with metadata
Use a dataframe to store the data with some metadata, also serialize the file.


In [224]:
def pandatize_image(base_path, band_names=BAND_NAMES, img_dim=100, scale = SCALE_FACTOR):
    '''
    base_path: String - Path to folder with tif file for each spectral band and month
    band_names: List - List of bands to iterate over
    img_dim: int -  Dimension of final images desired - assumes square i.e., nxn
    scale: numeric - Value by which to divide the arrays of data in the tiff file
    
    '''
    # Get the cluster labels by pixel
    with rasterio.open(f'{base_path}/{base_path}.tif') as pred_ds:
        predictions = pred_ds.read(1)
    
    # Get irrigated/not irrigated by pixel 
    with rasterio.open(f'{base_path}/{base_path}_RI.tif') as mask_ds:
        rain_irr = mask_ds.read(1)  
    
    # Read the cluster key df
    key_df = pd.read_csv(f'{base_path}/{base_path}.csv').rename({'Unnamed: 0': 'cluster'}, axis=1)
    #print(key_df)
    
    # Final desired dimension of each image
    img_array_dim = (img_dim, img_dim, len(band_names))

    # List to store temporary dataframes
    dfs = []
    
    # Iterate over all months
    for month in range(1,13):
        ia_list = []
        pred_list = []

        # Find the cluster labels that had irrigation for the current month
        irrigated_clusters = key_df[key_df.iloc[:, month] > 0].iloc[:,month].index.values
        
        # Initialize our prediction matrix
        month_preds = np.zeros_like(predictions)
        
        # Loop over the irrigated cluster
        for cluster in irrigated_clusters:
            # Identify the pixels that have the cluster of interest. Turn their label to 1
            print(f'Irrigation detected in cluster {cluster} for month {month}')
            month_preds = month_preds + np.where(predictions == cluster, 1, 0)
        
        # Account for the non-cropland areas (multiply by zero)
        month_preds = month_preds * rain_irr

        try:
            # Loop over all bands and read the actual MSI data in and create 3d array
            for b, band in enumerate(band_names):
                with rasterio.open(f'{base_path}/{base_path}_msi_{band}_{month}.tif') as data_ds:
                    data = data_ds.read(1)
                    if b == 0:
                        combined_data = np.zeros((data.shape[0], data.shape[1], len(band_names)))
                    combined_data[:,:,b] = data

            # Split the large 3d array into many subpieces - first in the row direction
            rows = np.split(combined_data, np.arange(img_dim, data.shape[0], img_dim))
            pred_rows = np.split(month_preds, np.arange(img_dim, data.shape[0], img_dim))

            # Loop over the broken up rows and split up columns
            for c, col_chunk in enumerate(rows[:-1]):
                img_arrays = np.split(col_chunk, np.arange(img_dim, data.shape[1], img_dim), axis=1)
                pred_arrays = np.split(pred_rows[c], np.arange(img_dim, data.shape[1], img_dim), axis=1)

                # Store the small MSI and prediction data to a list
                for i, ia in enumerate(img_arrays[:-1]):
                    ia_list.append(np.float32(ia / scale))
                    pred_list.append(np.int16(pred_arrays[i]))

            # Create temporary dataframe with msi data, predictions and metadata
            temp_df = pd.DataFrame({"msi": ia_list, "predictions" : pred_list})
            temp_df['month'] = month
            temp_df['lat'] = base_path.split('_')[2]
            temp_df['lon'] = base_path.split('_')[3]

            dfs.append(temp_df)
            del temp_df
        
        except:
                print(f"Missing Data for Month {month}")


    # Concatenate the list of monthly dataframes
    data_df = pd.concat(dfs).reset_index(drop=True)
    del dfs

    # Count number of irrigated pixels by sub-image
    data_df['tot_irr_locs'] = data_df.predictions.map(lambda x: np.float32(x.sum()))
    
    data_df.to_pickle(f'{base_path}/{base_path}.pkl')
    
    return data_df

In [225]:
# List of folders to iterate over
base_paths = ['S2_2018_35.875_-119.875',
              'S2_2018_35.875_-119.625',
              'S2_2018_35.875_-119.375',
              'S2_2018_35.875_-119.125']

for base_path in base_paths:
    df = pandatize_image(base_path)
    print(50*'=')
    #print(df.tot_irr_locs.value_counts())

Irrigation detected in cluster 1 for month 1
Irrigation detected in cluster 3 for month 2
Irrigation detected in cluster 8 for month 2
Irrigation detected in cluster 2 for month 5
Irrigation detected in cluster 9 for month 5
Irrigation detected in cluster 5 for month 6
Irrigation detected in cluster 6 for month 7
Irrigation detected in cluster 2 for month 10
Irrigation detected in cluster 9 for month 10
Irrigation detected in cluster 3 for month 1
Irrigation detected in cluster 5 for month 2
Irrigation detected in cluster 7 for month 2
Irrigation detected in cluster 4 for month 4
Irrigation detected in cluster 8 for month 6
Irrigation detected in cluster 1 for month 7
Irrigation detected in cluster 3 for month 7
Irrigation detected in cluster 2 for month 8
Irrigation detected in cluster 1 for month 1
Missing Data for Month 1
Irrigation detected in cluster 0 for month 3
Irrigation detected in cluster 4 for month 3
Irrigation detected in cluster 6 for month 3
Irrigation detected in clust

In [227]:
len(data_df)

1188

## Note on missing data in January

At first I thought this was odd but I went to the [Sentinel Hub playground](https://apps.sentinel-hub.com/sentinel-playground/) and saw the image below. After looking at the January 2018 satellite imagery for that region, it appears the very southern tip of the Central Vally got really lucky at the end of January which results in us actually having usable satellite imagery. However for the rest of the region we should not expect to see much during that month.

![Central Valley](./Sentinel-2_L1C_2018-01-25.jpg)

1188