# Preporcessing Functions:

### The following functions are designed to serve as a preprocessing pipeline in order to extract 1km squared images of a given country (represented by a shapefile) and the corresponding radiance level of that region of the earth. This process is divided into three main steps applied to a single country or region in question: 1) Extracting the Radiance Level, 2) Extracting Sentinel Images, and 3) Mapping the Radiance levels to Satelite Images.

## 1) Extracting Radiance Level (From NOAA Lights at Night tiff)

#### The NOAA Lights at Night tiff provides the radiance level (serving as a basic representation of energy consumption) for the entire globe over the course of a year. This data can be accessed at https://eogdata.mines.edu/download_dnb_composites.html and is segmented into 1/6th regions of the world. With the appropriate segment downloaded, to be of use to the scope of this project, it must be segmented down to the area of a specifc country or region of interest and converted to a CSV containing latitute, longitude and radiance level of a specific portion of the country or region of interest. The following functions perform these steps and aggregate the levels 1km squared regions with the latitude and longitude denoting the bottom left hand corner of the 1km squared region:

In [2]:
def cut_raster_with_shape (tiff, shape, outfile):
    '''This function takes a larger tiff file, a shapefile representing a boundary (of a country or region), and 
    returns a tiff file to the path denoted by 'outfile' contianing just the information from the original tiff within
    the boundary of the shape file.'''
    with fiona.open(shape, "r") as shapefile: # open the shape file 
        shapes = [feature["geometry"] for feature in shapefile]
        
    with rasterio.open(tiff) as src: # open the tiff file
        out_image, out_transform = rasterio.mask.mask(src, shapes, crop=True, nodata = -10000) #Mask the larger tiff file with the country shapefile replacing values outside the shapefile with -10,000
        out_meta = src.meta
        
    out_meta.update({"driver": "GTiff",
                 "height": out_image.shape[1],
                 "width": out_image.shape[2],
                 "transform": out_transform})
    
    with rasterio.open(outfile, "w", **out_meta) as dest: # write the masked tiff to the outfile.
        dest.write(out_image)

In [3]:
def raster_to_csv (filename):
    '''This function takes a raster file (without the .tif ending) and returns the path to a csv written in the 
    same directory containing the XYZ values from the raster file (latitude, longitude and corresponding radiance level)'''
    inDs = gdal.Open('{}.tif'.format(filename))
    outDS = gdal.Translate('{}.xyz'.format(filename), inDs, format='XYZ')
    os.rename('{}.xyz'.format(filename), '{}.csv'.format(filename))
    
    return (f'{filename}.csv')

In [4]:
def pre_process_csv(df,df_bool = True):
    '''This function takes a path to a csv file and returns a cleaned version limited only to the 
    land area of that region / country. (The CSV should be in the format Lon Lat Radiance as extraced 
    by raster_to_csv above.)
    
    If df_bool is false you are passing a file to be read aa well as preporcessed.'''

    if df_bool is False:
        df = pd.read_csv(path,header = None) # read a pandas df from a file name if df is not already passed in.
    
    df.columns = ['Lon','Lat','Radiance'] # rename columns
    df['Radiance'].replace(-10000.0, np.nan, inplace=True) # Replace NoData values (set as -10,000 in cut raster with shape) with NaN and drop all NaN values.
    df = df.dropna()
    df['Radiance'] = df['Radiance'].apply(lambda row: max(0,row)) #make negative values 0
    df.reset_index(drop = True,inplace = True)
    
    return df


In [17]:
def is_outlier (row, df, n_sig = None):
    '''This function deals with outliers in the radiance values of a data frame with columns Lon, Lat, Radiance.
    A point is considered an outlier if it is greater than 100 or more than 3 standard deviations away from the 
    average radiance of a 1 km squared block surrounding the row in question'''
    
    if n_sig != None: # if a significance level is passed use that instead of 3 standard deviations to determine outliers.
        thresh = df['Radiance'].std()*n_sig + df['Radiance'].mean()
        if row[-1] >= thresh:
            return False
    
    # calculate the mean and standard deviation of 1km squared regions surrounding the row in question. 
    square = df[(df['Lon'] > row[0] - .05) & (df['Lon'] < row[0] + .05) & 
               (df['Lat'] > row[1] - .05) & (df['Lat'] < row[1] +.05)]
    mean = square.Radiance.mean()
    stdev = square.Radiance.std()
    
    # if the radiance is more 3 sigma away from the mean of its tile then it is an outlier
    if abs(row[-1] - mean) > stdev *3:
        return False

    # If not the above condition then the point is not an outlier and we keep it
    else:
        return True

In [7]:
def remove_outliers (df,n_sig = None):
    '''This function applies the is_outlier test defined above to each row in the radiance levels df and drop 
    any row that is returned as an outlier.'''
    
    df['Drop_Outlier'] = df.swifter.apply(lambda row: is_outlier(row, df,n_sig), axis = 1)
    df_clean = df[df.Drop_Outlier]
    df_clean = df_clean.drop(['Drop_Outlier'],axis = 1)
    
    return df_clean

In [8]:
def round_down(number:float, decimals:int=1):
    """Returns a value rounded down to a specific number of decimal places (default is one decimal place)"""
    
    new_val = math.floor(number * (10**decimals)) / (10**decimals)
    
    return new_val

In [9]:
def agg_radiance (df, agg_size):
    '''This function takes a DataFrame with columns Lon, Lat, and Radiance (all of type float) and aggregates the 
    radiance column to regions of size agg_size in km squared (i.e. if you want to aggregate to 1 km squared enter
    1, 10 for 10km squared and so forth) - The parameter agg_size only takes integers of the form 10^x.'''
    
    size_dict = {1: 2, 10: 1, 100: 0, 1000: -1}
    new_df = df.sort_values('Lon',ascending = True)
    new_df['Lon'] = new_df['Lon'].apply(lambda num: round_down(num,size_dict[agg_size]))
    new_df['Lat'] = new_df['Lat'].apply(lambda num: round_down(num,size_dict[agg_size]))
    new_df = new_df.groupby(['Lon','Lat'],as_index=False).mean()
    
    return new_df

## 2) Extracting Sentinel Images (Implementing Sentinel Hub API)

#### With the radiance levels extracted for a specific country, cleaned and aggregated to 1km squared regions we now need one image for every radiance level. To improve run time to extract these images, and given the use constraints of the sentinel hub (there is a cap on number of monthly image extractions) the following code will determine the coordinates of 10km squared regions of the country (covering all the 1km squared regions) that can later be segmented and mapped to the the radiance level df created in part 1.

In [10]:
def make_boxes (row, step = .0999999):
    '''This function writes the coordinates of a region in wgs84 format in order to be passed to the sentinel hub API.
    The API will use this coordinate representation to extract the appropriate satelite image. The step is defaulted to 
    create boxes of size 10km squared (adding or removing a leading decimal 0 will increase or decrease the area size 
    respectively.)'''
    
    box = [row[0],row[1], row[0]+step, row[1] + step]
    
    return box

In [11]:
def get_boxes (df):
    '''This function takes a df with columns Lon, Lat, and Radiance aggregated to a factor of 10 and aggregates 
    these radiance labels to size 10 km squared (maximum size you can download from sentinel api at 10m level)'''
    
    df_new = df.sort_values('Lon',ascending = True) # sort the radiance df
    df_new['Lon'] = df_new['Lon'].apply(lambda num: round_down(num,1)) # round the lon down to represent the bottom left corner of the 10km squared region
    df_new['Lat'] = df_new['Lat'].apply(lambda num: round_down(num,1)) # round the lat down to represent the bottom left corner of the 10km squared region
    df_new = df_new.groupby(['Lon','Lat'],as_index=False).mean()
    df_new['boxes'] = df_new.apply(lambda row: make_boxes(row), axis = 1) # create wgs84 boxes for each 10km squared region
    
    return df_new

In [12]:
def get_sentinel_image (row, client_id, secret, res = 10):
    '''This function takes a list representing a latitude longitude box of approximately size 10km squared (this
    box should be a list of len 4) and a desired image resolution (10m, 20m or 60m with default 10m) in order to 
    return an array representing an image that corresponds to the lat lon box. '''
    
    config = SHConfig()
    config.sh_client_id = client_id
    config.sh_client_secret = secret
    
    # The below eval script defaults our image preferences from sentinel hub to visible light (Bands "B02", "B03", "B04")
    evalscript_true_color = """
    //VERSION=3

        function setup() {
            return {
                input: [{
                    bands: ["B02", "B03", "B04"]
                }],
                output: {
                    bands: 3
                }
            };
        }

        function evaluatePixel(sample) {
            return [3.5*sample.B04, 3.5*sample.B03, 3.5*sample.B02];
        }
    """

    row_coords_wgs84 = row[-1]
    row_bbox = BBox(bbox=row_coords_wgs84, crs=CRS.WGS84)
    row_size = bbox_to_dimensions(row_bbox, resolution=res) #calculate the number of pixels requested based on area and resolution size.
    if row_size[0] <= 2500 and row_size[1]<= 2500: # Sentinel Hub requires images extracted to be within this size.
        request_true_color = SentinelHubRequest( #make a request to the sentinel hub API
            evalscript=evalscript_true_color,
            input_data=[
                SentinelHubRequest.input_data(
                    data_collection=DataCollection.SENTINEL2_L1C, #request level 1C images
                    time_interval=('2019-01-01', '2019-12-31'), # From the calendar year of 2019
                    mosaicking_order='leastCC' #choosing the images that have the least cloud cover.
                 )
             ],
            responses=[
                SentinelHubRequest.output_response('default', MimeType.PNG)
            ],
            bbox=row_bbox,
            size=row_size,
            config=config
        )
        return request_true_color.get_data()[0] # return the image
    else:
        print ('Region is too large to access, please limit image size and try again') #Raise an error if the region is too large
        return None

In [13]:
def map_sentinel_images (df, client_id, secret):
    '''Apply the get_sentinel_image function to an entire dataframe and return that dataframe (dataframe should
    be of columns Lon, Lat, Radiance, and Boxes). Client_id and secert are given by sentinel hub api.'''
    
    df['image'] = df.swifter.apply(lambda row: get_sentinel_image(row, client_id = client_id,
                                                    secret = secret), axis = 1)
    
    return df

In [1]:
def save_photos (df, directory, make = True):
    '''This function will save the extracted satelite images corresponding to 10km squared regions, represented as 
    ndArrays, as jpeg photos in a specified directory. If make equals True a new directory will be made under the 
    path specified under directory.'''
    
    if  not os.path.isdir(directory):
        os.mkdir(directory)
    for i in range(len(df)):
        lon = df['Lon'][i]
        lat = df['Lat'][i]
        rad = df['Radiance'][i]
        im_label = f'{lon}_{lat}_{rad}.jpeg'
        im = Image.fromarray(df['image'][i])
        full_label = directory +'/'+im_label
        im.save(directory +'/'+im_label)

## 3) Map the Sentinel Images to Corresponding Radiance Levels

#### For the training of the final neural network we want to have images representing approximately 1km squared regions of a country and the corresponding radiance level for that region. The following functions will take the 10km squared images extracted from sentinel hub and map a portion of each picture to its given radiance level leaving a final data set of images, radiance levels, and coordinate pairs representing the bottom left corner of the area. 

In [14]:
def map_image_to_labels (labels, images):
    '''This function takes two dataframes: Labels should be a dataframe with columns Lon Lat and Radiance (as given by
    agg_radiance) while images should contain Lon, Lat, Radiance, boxes, and image (as given by map_sentinel_images). 
    This function will map the large images contained in the images dataframe to each class label in labels contained 
    within the geographical bounds of that larger image.'''
    
    # Write a string representation of the larger 10km squared region that each 1km squared region should belong too.
    labels = labels.copy()
    labels.index = labels.apply(lambda row: str(round_down(row[0], 1)) + " " + str(round_down(row[1], 1)), axis = 1)
    
    labels['image_large'] = np.nan
    
    # Using the strings as keys a large 10km squared image is mapped to each radiance level that exists within its bounds
    image_dict = {}
    for i in range(len(images)):
        index = str(images['Lon'][i]) + " " + str(images['Lat'][i])
        key = images.index[i]
        labels['image_large'][index] = key
        image_dict[key] = images['image'][i]
        
    labels['image_large'] = labels['image_large'].map(image_dict, na_action='ignore') 
    
    labels.reset_index(level=0, inplace=True)
    
    return labels

In [15]:
def extract_image (row):
    '''This function takes a row and proportionally maps the radiance level to a specific portion of a larger image
    it is assocaited with (from map_images_to_labesl) based on latitude and longitude'''
    
    # Extract the image from the row and find its shape
    img = row[4]
    img_shp = img.shape
    
    # Determine the wanted lon and lat of the 1km squared region
    tile = row[0]
    tile_lon = float(tile.split(" ")[0])
    tile_lat = float(tile.split(" ")[1])

    lon = row[1]
    lat = row[2]
    
    # Find the proportional location of the lon and lat in terms of pixels of the larger image
    lon_loc = round((lon - tile_lon) *10000)
    lat_loc = round((lat - tile_lat) *10000)

    # 1km squared will be approximately 1/10th of the larger image so find this step from the locs found above.
    lon_step = (.1*(img_shp[1]))
    lat_step = (.1*(img_shp[0]))
    
    lon_prop = (lon_loc//100)
    lat_prop = (lat_loc//100)

    # Use the above info to find the bottom left corner (in terms of NDarray Indices) within the larger image
    lon_index_bottom = round(lon_step*lon_prop)
    lat_index_top = img_shp[0] - math.floor(lat_step*lat_prop)

    # If the desired image is the last edge (i.e. right hand side of the photo) take everything up to the edge of the larger image.
    if lon_prop != 9:
        lon_index_top = round(lon_index_bottom + lon_step)
    else: 
        lon_index_top = round(lon_index_bottom + lon_step) +1
        
    lat_index_bottom = math.floor(lat_index_top - lat_step)

    # If the image is a 3d array (lon, lat, pixel) then extract the appropriate 1km squared image. Otherwise return NaN (this deals with special cases encountered with sentinel hub api)
    if len(img_shp) == 3:
        mapped_img = img[lat_index_bottom: lat_index_top, lon_index_bottom: lon_index_top]
        return mapped_img
    else:
        return None


## 4) Read in Image / Labels for use:

#### With the computational requirements of the above preporcessing pipeline relatively large, it is a good idea to save the extracted images (using save photos above). The following functions convert these saved JPEG photos back into a a pandas df containing the image (as an NDarray), coordinate pair, and radiance level of each photo in order to be used in the Neural Network.

In [16]:
def extract_info_from_image_label (row):
    row = np.array(row)
    img = row[0][:-5]
    img = img.split('/')[-1]
    row_out = [float(img.split('_')[0]),float(img.split('_')[1]),float(img.split('_')[2])]
    if len(row_out) == 3:
        return pd.Series([row_out[0],row_out[1],row_out[2]])
    else:
        return(np.nan,np.nan,np.nan)

In [18]:
def photo_df (path):
    if  os.path.isdir(path):
        image_labels = [f'{path}/{f}' for f in os.listdir(path) 
                        if os.path.splitext(f)[-1] == '.jpeg']

        data = pd.DataFrame(image_labels, columns = ['image_label'])

        data[['Lon','Lat','Radiance']] = data.apply(lambda row: 
                                extract_info_from_image_label(row), axis = 1)

        return(data)
    else:
        print("Directory Not Found! Please enter a valid directory.")
        return None
        

In [None]:
# Required imports for all of the above functions: 
# import pandas as pd
# import numpy as np
# import requests
# from requests.auth import HTTPBasicAuth
# from sentinelhub import MimeType, CRS, BBox, SentinelHubRequest, SentinelHubDownloadClient, \
#     DataCollection, bbox_to_dimensions, DownloadRequest, SentinelHubBatch
# import os
# import datetime
# import matplotlib.pyplot as plt
# from sentinelhub import SHConfig
# from datetime import datetime
# import math
# import re, ast
# import struct
# import csv
# import gdal
# import fiona
# import rasterio
# import rasterio.mask
# import swifter
# from PIL import Image