<a href="https://colab.research.google.com/github/kode-git/Copernicus-river-discharges/blob/main/Initial_Exploratory_Spatial_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="./src/copernicus-logo.png"><span style="margin-left: 40px"></span><img src="./src/cds-logo.jpeg">

# Initial Exploratory Spatial Data Analysis

During the initial exploratory of spatial data, we will focus on analyzing the data format. This notebook is essential for data management, and it is an introduction to the data aggregation phase due to the representations of metadata. Furthermore, we can specify the internal structure of a NETCDF4 file representing a well-known N-dimensional collection. Information mined from the data structures will be vital for critical decisions on the next steps. 


## Library Dependencies

In [1]:
# installation of dependencies for remote notebook (Jupyter or Google Colab)
# !pip install xarray 
# !pip install netCDF4 dask bottleneck
# !pip install pandas
# !pip install geopandas
# !pip install cdsapi

# installation of dependencies for local notebook
%pip install xarray 
%pip install netCDF4 dask bottleneck
%pip install pandas
%pip install geopandas
%pip install cdsapi

You should consider upgrading via the '/usr/local/opt/python@3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.
You should consider upgrading via the '/usr/local/opt/python@3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.
You should consider upgrading via the '/usr/local/opt/python@3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.
You should consider upgrading via the '/usr/local/opt/python@3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.
You should consider upgrading via the '/usr/local/opt/python@3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
# collections libraries
import netCDF4 as nc4
from netCDF4 import Dataset
import xarray as xr
import numpy as np

# file management
from glob import glob

# utilities 
from datetime import datetime as dt

# remove the comment on it only if use google colaboratory
# colab 
# from google.colab import drive

In [5]:
# remove the comment on it only if use google colaboratory
# drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Initial Exploratory on Spatial River Discharge Data (RDH)

RDH is the first of the two datasets we will use during the project. The original name is "River discharge and related historical data from the European Flood Awareness System." It represents a geographical distribution of river discharges on the whole European surface in $m^3/s$. The amount of data on this dataset overextends the real and not nullable information. The original dataset is unique; we have been split it due to the huge dimensions of files and the data limits for downloading on the Climate Data Store (CDS). During the exploratory, we will discuss only a subset of information for a time interval of 2 years inside the 2021-2022 events due to the expensive time spent on big data management and because data don't change metadata and layout over time; it is a good practice specified that the amount of data i two years available for the bound 2021 and 2022 isn't complete because the analysis is on the present and past data. We will match geographical locations with precipitations and temperatures after possible data shifting to match the measurements during the data aggregation phase. A complete and well-documented data source should be available on the Climate Data Store (CDS) by clicking on the following <a href="https://cds.climate.copernicus.eu/cdsapp#!/dataset/efas-historical?tab=overview">link</a>. 


In [140]:
nc_dis = Dataset("samples/rdh-2021-2022-6.nc", "r", "NETCDF4" )
print(nc_dis.variables)

{'y': <class 'netCDF4._netCDF4.Variable'>
float64 y(y)
    _FillValue: nan
    units: Meter
    long_name: Y coordinate of projection
    standard_name: projection_Y_coordinate
unlimited dimensions: 
current shape = (950,)
filling on, 'x': <class 'netCDF4._netCDF4.Variable'>
float64 x(x)
    _FillValue: nan
    units: Meter
    long_name: x coordinate of projection
    standard_name: projection_x_coordinate
unlimited dimensions: 
current shape = (1000,)
filling on, 'time': <class 'netCDF4._netCDF4.Variable'>
int64 time(time)
    long_name: initial time of forecast
    standard_name: forecast_reference_time
    units: seconds since 1970-01-01
    calendar: proleptic_gregorian
unlimited dimensions: 
current shape = (498,)
filling on, default _FillValue of -9223372036854775806 used, 'step': <class 'netCDF4._netCDF4.Variable'>
float64 step()
    _FillValue: nan
    long_name: time since forecast_reference_time
    standard_name: forecast_period
    units: hours
unlimited dimensions: 
curre

### Data Information

In [116]:
print('Variables:')
print("River Discharges: {}".format(nc_dis.variables.keys()))
print('-'*10)
print("Dimensions:")
print("River Discharges {}".format(nc_dis.dimensions.keys()))

Variables:
River Discharges: dict_keys(['y', 'x', 'time', 'step', 'surface', 'latitude', 'longitude', 'valid_time', 'dis06', 'lambert_azimuthal_equal_area', 'land_binary_mask', 'upArea'])
----------
Dimensions:
River Discharges dict_keys(['y', 'x', 'time'])


We can find values corresponding to the keys of the dataset.

In [119]:
dis = nc_dis.variables['dis06']
print("Discharge variables metadata:\n{}".format(dis))

Discharge variables metadata:
<class 'netCDF4._netCDF4.Variable'>
float32 dis06(time, y, x)
    _FillValue: nan
    GRIB_paramId: 240023
    GRIB_dataType: sfo
    GRIB_numberOfPoints: 950000
    GRIB_typeOfLevel: surface
    GRIB_stepUnits: 1
    GRIB_stepType: avg
    GRIB_gridType: lambert_azimuthal_equal_area
    GRIB_NV: 0
    GRIB_cfName: unknown
    GRIB_cfVarName: dis06
    GRIB_gridDefinitionDescription: Lambert azimuthal equal area projection
    GRIB_missingValue: 9999
    GRIB_name: Mean discharge in the last 6 hours
    GRIB_shortName: dis06
    GRIB_units: m**3 s**-1
    long_name: Mean discharge in the last 6 hours
    units: m**3 s**-1
    standard_name: unknown
    grid_mapping: lambert_azimuthal_equal_area
    coordinates: time step surface latitude longitude valid_time
unlimited dimensions: 
current shape = (498, 950, 1000)
filling on


The discharge value is an integer value formed by 3 dimensions - $[time, x, y]$. We can also get some other information about metadata like coordinates, common names, and units of semantic measurement, and we can also check their size.

In [123]:
for d in nc_dis.dimensions.items():
  print("Items on dimensions: {}".format(d))

Items on dimensions: ('y', <class 'netCDF4._netCDF4.Dimension'>: name = 'y', size = 950)
Items on dimensions: ('x', <class 'netCDF4._netCDF4.Dimension'>: name = 'x', size = 1000)
Items on dimensions: ('time', <class 'netCDF4._netCDF4.Dimension'>: name = 'time', size = 498)


It is essential to specify that $(x,y)$ is a projection of the point in a geographical environment, and they do not represent any coordinates on the final data. Projection points are the total size of points in a geographical map $(N, M)$ and not geographical coordinates of points.

In [105]:
print("River Discharge dimensions: {}".format(dis.dimensions))

River Discharge dimensions: ('time', 'y', 'x')


Printing the dimensions of the discharge variable, we can check how it is formed to confirm our first assumption. 

In [124]:
print('Total shape: {}'.format(dis.shape))

Total shape: (498, 950, 1000)


Similarly, we can also inspect the variables associated with each dimension:

In [125]:
time = nc_dis.variables['time']
x,y = nc_dis.variables['x'], nc.variables['y']
print("Time variables :->", time)
print("X coordinate :->", x)
print("Y coordinate :->", y)

Time variables :-> <class 'netCDF4._netCDF4.Variable'>
int64 time(time)
    long_name: initial time of forecast
    standard_name: forecast_reference_time
    units: seconds since 1970-01-01
    calendar: proleptic_gregorian
unlimited dimensions: 
current shape = (498,)
filling on, default _FillValue of -9223372036854775806 used
X coordinate :-> <class 'netCDF4._netCDF4.Variable'>
float64 x(x)
    _FillValue: nan
    units: Meter
    long_name: x coordinate of projection
    standard_name: projection_x_coordinate
unlimited dimensions: 
current shape = (1000,)
filling on
Y coordinate :-> <class 'netCDF4._netCDF4.Variable'>
float64 y(y)
    _FillValue: nan
    units: Meter
    long_name: Y coordinate of projection
    standard_name: projection_Y_coordinate
unlimited dimensions: 
current shape = (950,)
filling on


Here, we obtained some information about each of the three dimensions. The time is related to the initial moment of the forecast of the discharges. Meanwhile, x and y are the coordinates in meters of the geographical projection, and the dimensions are in 1D. So, we can access it directly as a NumPy array:

In [128]:
tm = time[:]
print("First 20 times: {}".format(tm[0:20]))
print("Shape of time: {}".format(time.shape))

First 20 times: [1609502400 1609588800 1609675200 1609761600 1609848000 1609934400
 1610020800 1610107200 1610193600 1610280000 1610366400 1610452800
 1610539200 1610625600 1610712000 1610798400 1610884800 1610971200
 1611057600 1611144000]
Shape of time: (498,)


This property is similar to other dimensions:

In [130]:
X = x[:]
Y = y[:]
print("First 20 x: {}".format(X[0:10]))
print("First 20 y: {}".format(Y[0:10]))
print('Shape x : {}; shape y : {}'.format(X.shape, Y.shape))

First 20 x: [2502500. 2507500. 2512500. 2517500. 2522500. 2527500. 2532500. 2537500.
 2542500. 2547500.]
First 20 y: [5497500. 5492500. 5487500. 5482500. 5477500. 5472500. 5467500. 5462500.
 5457500. 5452500.]
Shape x : (1000,); shape y : (950,)


$x$ and $y$ are not the geographical coordinates related to the river discharges but only for the 2D graphical projection. In the starting snippet, we analyzed data structure, and we also saw a geographical reference for real coordinates: latitude and longitude.

In [132]:
lat, lon = nc_dis.variables['latitude'], nc_dis.variables['longitude']
print("Latitude :-> {}".format(lat))
print("Longitude :-> {}".format(lon))
print('Latitude values:')
print(lat[:10])
print('Longitude values:')
print(lon[:10])

Latitude :-> <class 'netCDF4._netCDF4.Variable'>
float32 latitude(y, x)
    _FillValue: nan
    grid_mapping: lambert_azimuthal_equal_area
    long_name: latitude
    standard_name: latitude
    units: degrees_north
    esri_pe_string: PROJCS["ETRS_1989_LAEA",GEOGCS["GCS_ETRS_1989",DATUM["D_ETRS_1989",SPHEROID["GRS_1980",6378137.0,298.257222101]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]],PROJECTION["Lambert_Azimuthal_Equal_Area"],PARAMETER["false_easting",4321000.0],PARAMETER["false_northing",3210000.0],PARAMETER["central_meridian",10.0],PARAMETER["latitude_of_origin",52.0],UNIT["Meter",1.0]]
unlimited dimensions: 
current shape = (950, 1000)
filling on
Longitude :-> <class 'netCDF4._netCDF4.Variable'>
float32 longitude(y, x)
    _FillValue: nan
    grid_mapping: lambert_azimuthal_equal_area
    long_name: longitude
    standard_name: longitude
    units: degrees_east
    esri_pe_string: PROJCS["ETRS_1989_LAEA",GEOGCS["GCS_ETRS_1989",DATUM["D_ETRS_1989",SPHEROID["GRS_1

## Initial Exploratory on Spatial Temperatures and Precipitations Data (TPI)

TPI is the second of the two used datasets. The original name is "Temperature and precipitation gridded data for global and regional domains derived from in-situ and satellite observations". It represents a geographical measurement of precipitations and temperatures in the European area over time. The original dataset is unique; we have been split it due to the vast dimensions of files and the data limits for downloading on the Climate Data Store (CDS). During the exploratory, we will discuss only a subset of information for a time interval of 1 year inside the 2021 events due to the expensive time spent on big data management and because data do not change metadata and layout over time. We can download the data source following the <a href="https://cds.climate.copernicus.eu/cdsapp#!/dataset/insitu-gridded-observations-global-and-regional?tab=overview">link</a>.

In [133]:
nc_temp = Dataset("samples/tpi-temp-2021-10.nc", "r", "NETCDF4")
nc_prec = Dataset("samples/tpi-prec-2021-10.nc", "r", "NETCDF4")

In [135]:
print("Temperatures metadata:\n{}".format(nc_temp))

Temperatures metadata:
<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4_CLASSIC data model, file format HDF5):
    CDI: Climate Data Interface version 1.9.6 (http://mpimet.mpg.de/cdi)
    Conventions: CF-1.6
    CDO: Climate Data Operators version 1.9.6 (http://mpimet.mpg.de/cdo)
    creation_date: 2020-02-14T09:31:29ZCET+0100
    NCO: netCDF Operators version 4.7.7 (Homepage = http://nco.sf.net, Code = http://github.com/nco/nco)
    acknowledgements: This work was performed within Copernicus Climate Change Service - C3S_424_SMHI, https://climate.copernicus.eu/operational-service-water-sector, on behalf of ECMWF and EU.
    contact: Hydro.fou@smhi.se
    data_quality: Testing of EURO-CORDEX data performed by ESGF nodes. In the contract C3S_424_SMHI additional tests were performed during the bias adjustment and the prodcution of the CII.
    domain: EUR-11
    institution: SMHI, www.smhi.se
    invar_bc_institution: Swedish Meteorological and Hydrological Institute
    invar_bc_me

We can see that the dataset has the exact dimensions of the RDH. It is good information because we can focus directly on our statistical hypothesis after the data aggregation if data references are the same for each point. This process should be much easier than procedures with clustering and geographical approximations like k-means techniques.

In [141]:
print("Precipitations metadata:\n{}".format(nc_prec))

Precipitations metadata:
<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4_CLASSIC data model, file format HDF5):
    CDI: Climate Data Interface version 1.9.6 (http://mpimet.mpg.de/cdi)
    Conventions: CF-1.6
    CDO: Climate Data Operators version 1.9.6 (http://mpimet.mpg.de/cdo)
    creation_date: 2020-02-14T09:03:09ZCET+0100
    NCO: netCDF Operators version 4.7.7 (Homepage = http://nco.sf.net, Code = http://github.com/nco/nco)
    acknowledgements: This work was performed within Copernicus Climate Change Service - C3S_424_SMHI, https://climate.copernicus.eu/operational-service-water-sector, on behalf of ECMWF and EU.
    contact: Hydro.fou@smhi.se
    domain: EUR-11
    institution: SMHI, www.smhi.se
    invar_bc_institution: Swedish Meteorological and Hydrological Institute
    invar_bc_method: TimescaleBC, description in deliverable C3S_D424.SMHI.1.3b
    invar_bc_method_id: TimescaleBC v1.02
    invar_bc_observation: EFAS-Meteo, https://ec.europa.eu/jrc/en/publication/eur

In [114]:
print('Variables:')
print("Temperature: {}".format(nc_temp.variables.keys()) )
print("Precipitations: {}".format(nc_prec.variables.keys()))
print('-'*10)
print("Dimensions:")
print("Temperature: {}".format(nc_temp.dimensions.keys()))
print("Precipitations: {}".format(nc_prec.dimensions.keys()))

Variables:
Temperature: dict_keys(['time', 'lon', 'lat', 'tasAdjust', 'height'])
Precipitations: dict_keys(['time', 'lon', 'lat', 'prAdjust'])
----------
Dimensions:
Temperature: dict_keys(['time', 'x', 'y'])
Precipitations: dict_keys(['time', 'x', 'y'])


In [144]:
print('Precipitation variable: {}'.format(nc_prec['prAdjust']))

Precipitation variable: <class 'netCDF4._netCDF4.Variable'>
float32 prAdjust(time, y, x)
    _FillValue: 1e+20
    missing_value: 1e+20
    cell_measures: area:areacella
    cell_methods: time:mean
    long_name: precipitation
    standard_name: mean_precipitation_index_per_time_period
    units: kg m-2 s-1
    variable: prAdjust
    coordinates: lat lon
unlimited dimensions: time
current shape = (365, 950, 1000)
filling on


The precipitation variables have three dimensions, like river discharges. The current shape is similar to the RDH. However, we need to analyze the current shape rotation for correct geographical references during the measurements comparisons and data aggregation with RDH. Measurements are in $mm$ per $ m^2/s$; the current bound is limited for the point projection and unlimited for the time.

In [145]:
print('Temperature variable: {}'.format(nc_temp['tasAdjust']))

Temperature variable: <class 'netCDF4._netCDF4.Variable'>
float32 tasAdjust(time, y, x)
    _FillValue: 1e+20
    missing_value: 1e+20
    cell_measures: area:areacella
    cell_methods: time:mean
    long_name: mean air temperature
    standard_name: mean_temperature_index_per_time_period
    units: K
    variable: tasAdjust
    coordinates: lat lon height
unlimited dimensions: time
current shape = (365, 950, 1000)
filling on


The temperature shape is the same as the precipitation, and this is good information about the integration between precipitations and temperatures on geographical points. On the standard data definition and metadata descriptions, temperatures and precipitations are a mean on a specific geographical area corresponding to a specific point. So, we can smooth the variance of temperatures and precipitations around a specific point with their mean directly.

## Comparison between RDH and TPI on coordinates

During the exploratory on RDH and TPI and from the dataset description on the CDS, we saw how data are distributed and which dimensions they are. Furthermore, we saw that the size for each dimension is the same over time, but projections coordinates are not. We need to find a function $f: f(x,y) = (x', y') : x,y \in RDH \land x',y' \in TPI$ or vice-versa. According to the official datasets documentation, each measurement on TPI and RDH are values related to the same dimension time (one time per day at 12:00 a.m. for each day in a year). In conclusion, we can analyze the main features for each measurement in the same way and compares possible results as additional information which should be helpful for critical decision during data management. 

In [182]:
print("-"*20)
print('Temperatures variables size for 1 year: ')
print("Size latitude: {}".format(len(nc_temp.variables['lat'][:][:])))
print("Size longitude: {} ".format(len(nc_temp.variables['lon'][:][:])))
print("Size time for each point: {}".format(len(nc_temp.variables['time'][:][:])))

print("-"*20)
print('Temperatures variables size for 1 year: ')
print("Size latitude: {}".format(len(nc_prec.variables['lat'][:][:])))
print("Size longitude: {} ".format(len(nc_prec.variables['lon'][:][:])))
print("Size time for each point: {}".format(len(nc_prec.variables['time'][:][:])))

print("-"*20)
print('Discharges variables size for 2 year: ')
print("Size latitude: {}".format(len(nc_dis.variables['latitude'][:][:])))
print("Size longitude: {} ".format(len(nc_dis.variables['longitude'][:][:])))
print("Size time for each point: {}".format(len(nc_dis.variables['time'][:][:])))
print("-"*20)


--------------------
Temperatures variables size for 1 year: 
Size latitude: 950
Size longitude: 950 
Size time for each point: 365
--------------------
Temperatures variables size for 1 year: 
Size latitude: 950
Size longitude: 950 
Size time for each point: 365
--------------------
Discharges variables size for 2 year: 
Size latitude: 950
Size longitude: 950 
Size time for each point: 498
--------------------


The size of the time on a point for the discharge data is only $489$ because 2022 is the present, and Copernicus provides only data without predictions.

Let us now analyze how geographical latitude and longitude are related between TPI and RDH. The scope of the following analysis is to explore location distributions and find possible data correlations on geographical references between TPI and RDH measurements. The result of this phase is a function that maps TPI measurements to the same point of RDH values considering latitude and longitude pairs.

In [164]:
print('Variables Precipitations: {}'.format(nc_prec.variables.keys()))
print('Variables Temperatures: {}'.format(nc_temp.variables.keys()))

Variables Precipitations: dict_keys(['time', 'lon', 'lat', 'prAdjust'])
Variables Temperatures: dict_keys(['time', 'lon', 'lat', 'tasAdjust', 'height'])


In [167]:
print('Size of the precipitations and temperatures on a same location over time:')
print("Precipitations: {}".format(len(nc_prec.variables['prAdjust'][:][:][:])))
print("Temperatures: {}".format(len(nc_temp.variables['tasAdjust'][:][:][:])))

Size of the precipitations and temperatures on a same location over time:
Precipitations: 365
Temperatures: 365


We have the same amount of measurement for temperatures and precipitations over time, and the only observation for this pair of values is coordinated possible mismatches. First of all, we need to check if the geographical correlation focuses on an identity function between precipitations and temperatures:

$f(x,y) = (x',y') : x = x' \land y = y'\ \forall\ x,y \in T\ \land \forall\ x',y' \in P$ with $T$ the set of temperatures measurements and $P$ the precipitations values.

In [196]:
size_prec_lon = len(nc_prec.variables['lon'][:][:][:]) 
size_temp_lon = len(nc_prec.variables['lon'][:][:][:]) 
size_prec_lat = len(nc_prec.variables['lat'][:][:][:]) 
size_temp_lat = len(nc_prec.variables['lat'][:][:][:]) 
if size_prec_lon != size_temp_lon or size_prec_lat != size_temp_lat:
    print('Geographical references missmatch on the number of locations.')

flag = True
x = 950
y = 1000
for i in range(x):
    # on y constant 
    lat_prec = nc_prec.variables['lat'][i][0] 
    lat_temp = nc_temp.variables['lat'][i][0] 
    lon_prec = nc_prec.variables['lon'][i][0]
    lon_temp = nc_temp.variables['lon'][i][0]
    if(lat_prec != lat_temp or lon_prec != lon_temp):
        flag = False
        break

for j in range(y):
    # on x constant 
    lat_prec = nc_prec.variables['lat'][0][j] 
    lat_temp = nc_temp.variables['lat'][0][j] 
    lon_prec = nc_prec.variables['lon'][0][j]
    lon_temp = nc_temp.variables['lon'][0][j]
    if(lat_prec != lat_temp or lon_prec != lon_temp):
        flag = False
        break

print(flag)

True


Given $(time, x, y)$, we can test only on one dimension over time. So, we put x and y sequentially constant on multiple controls. The output confirms that the correlation between coordinates on spatial dimensions focuses on the identity function. In other words, we have the same references on latitude and longitude for each measurement on the dataset over time. This notice is suitable for data management and aggregation because we can directly merge temperature and precipitations on the exact coordinates over time dimensions. 

Finally, we can analyze the possible correlation on geographical references between TPI and RDH measurements, which is the last part that we didn't have yet. Thanks to the identity function on the correlation between temperatures and precipitations for spatial dimensions, we can compare measurements references only on one of these two values with the river discharge monitoring. 

In [198]:
size_prec_lon = len(nc_prec.variables['lon'][:][:][:]) 
size_dis_lon = len(nc_dis.variables['longitude'][:][:][:]) 
size_prec_lat = len(nc_prec.variables['lat'][:][:][:]) 
size_dis_lat = len(nc_dis.variables['latitude'][:][:][:]) 
if size_prec_lon != size_dis_lon or size_prec_lat != size_dis_lat:
    print('Geographical references missmatch on the number of locations.')

flag = True
x = 950
y = 1000
for i in range(x):
    # on y constant 
    lat_prec = nc_prec.variables['lat'][i][0] 
    lat_dis = nc_dis.variables['latitude'][i][0] 
    lon_prec = nc_prec.variables['lon'][i][0]
    lon_dis = nc_dis.variables['longitude'][i][0]
    if(lat_prec != lat_dis or lon_prec != lon_dis):
        flag = False
        break

for j in range(y):
    # on x constant 
    lat_prec = nc_prec.variables['lat'][0][j] 
    lat_dis = nc_dis.variables['latitude'][0][j] 
    lon_prec = nc_prec.variables['lon'][0][j]
    lon_dis = nc_dis.variables['longitude'][0][j]
    if(lat_prec != lat_dis or lon_prec != lon_dis):
        flag = False
        break

print(flag)

    

False


Unfortunately, TPI and RDH didn't have a spatial correlation equal to the identity functions as precipitations and temperatures do. So, we need to find some spatial referent on latitude and longitude and find a possible process to map TPI locations on RDH locations and aggregate their measurements. We will test only on some samples, and if we have a positive result, we will extend verification to the entire domain. Our hypothesis is related to possible rotations on coordinates on the x and y dimensions because we had positive feedback on the spatial dimensions size comparisons.

In [201]:
if nc_prec.variables['lat'][0][0] == nc_dis.variables['latitude'][949][949]:
    print("f(x,y) = (-x,-y) hypothesis possible")

if nc_prec.variables['lat'][0][0] == nc_dis.variables['latitude'][949][0]:
    print("f(x,y) = (-x, y) hypothesis possible")

if nc_prec.variables['lat'][0][0] == nc_dis.variables['latitude'][0][949]:
    print("f(x,y) = (x, -y) hypothesis possible")

f(x,y) = (-x, y) hypothesis possible


We tested every possible rotation function on the spatial matrix (x,y). The identity function ($f(x,y) = (x,y)$) failed on the first hypothesis described in the previous snippet. Due to the uniform distribution over time of TPI and RDH, we can verify or refuse the mathematical hypothesis on the correlation between coordinates of TPI and RDH considering only our subsample dataset.

In [206]:
flag = True
x = 950
y = 1000
for i in range(x):
    lat_prec = nc_prec.variables['lat'][i][0] 
    lat_dis = nc_dis.variables['latitude'][x-i-1][0] 
    lon_prec = nc_prec.variables['lon'][i][0]
    lon_dis = nc_dis.variables['longitude'][x-i-1][0]
    if(lat_prec != lat_dis or lon_prec != lon_dis):
        flag = False
        break

print(flag)
    

True


In [209]:
# latitude and longitude are 950 for a specific unit of time, this control returns some 
for i in range(0,10):
    print("({},{}) : {} <-> ({},{}) : {}".format(i, 10, nc_prec.variables['lat'][i][10], 949-i, 10, nc_dis.variables['latitude'][949-i][10])) # f(i,j) == (-i, j)

(0,10) : 27.906766891479492 <-> (949,10) : 27.906766891479492
(1,10) : 27.951608657836914 <-> (948,10) : 27.951608657836914
(2,10) : 27.996444702148438 <-> (947,10) : 27.996444702148438
(3,10) : 28.041275024414062 <-> (946,10) : 28.041275024414062
(4,10) : 28.08609962463379 <-> (945,10) : 28.08609962463379
(5,10) : 28.130918502807617 <-> (944,10) : 28.130918502807617
(6,10) : 28.175731658935547 <-> (943,10) : 28.175731658935547
(7,10) : 28.220539093017578 <-> (942,10) : 28.220539093017578
(8,10) : 28.265338897705078 <-> (941,10) : 28.265338897705078
(9,10) : 28.310134887695312 <-> (940,10) : 28.310134887695312


In conclusion, the initial exploratory of spatial data retrieves some essential features about measurement distributions, size, and dimensions of data, multi-views analysis, and metadata extrapolations. These kinds of information should be helpful in data aggregation over the time dimension and facilitate some data manipulations due to the discovery of the rotation functions on the spatial matrix of TPI on RDH.