<font size="8"> **Adding environmental data from available observations to unique background points** </font>  
In this notebook, we will extract environmental data from sea ice and sea surface temperature observations and add it to our data frame containing unique crabeater sightings per month and grid cell (see `04b_Creating_background_masks.ipynb` for more information.

# Setting working directory
In order to ensure these notebooks work correctly, we will set the working directory. We assume that you have saved a copy of this repository in your home directory (represented by `~` in the code chunk below). If you have saved this repository elsewhere in your machine, you need to ensure you update this line with the correct filepath where you saved these notebooks.

In [1]:
import os
os.chdir(os.path.expanduser('~/Chapter2_Crabeaters/Scripts'))

# Loading other relevant libraries

In [2]:
from dask.distributed import Client
from glob import glob
#Accessing model data
import cosima_cookbook as cc
#Useful functions
import UsefulFunctions as uf
#Dealing with data
import xarray as xr
import pandas as pd
import numpy as np
#Data visualisation
import matplotlib.pyplot as plt
import cartopy.crs as ccrs

# Paralellising work 

In [3]:
client = Client()
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: /proxy/46269/status,

0,1
Dashboard: /proxy/46269/status,Workers: 7
Total threads: 14,Total memory: 63.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:36445,Workers: 7
Dashboard: /proxy/46269/status,Total threads: 14
Started: Just now,Total memory: 63.00 GiB

0,1
Comm: tcp://127.0.0.1:33809,Total threads: 2
Dashboard: /proxy/40231/status,Memory: 9.00 GiB
Nanny: tcp://127.0.0.1:38093,
Local directory: /jobfs/90982293.gadi-pbs/dask-scratch-space/worker-eiulkdpt,Local directory: /jobfs/90982293.gadi-pbs/dask-scratch-space/worker-eiulkdpt

0,1
Comm: tcp://127.0.0.1:44271,Total threads: 2
Dashboard: /proxy/33169/status,Memory: 9.00 GiB
Nanny: tcp://127.0.0.1:40555,
Local directory: /jobfs/90982293.gadi-pbs/dask-scratch-space/worker-yw6181an,Local directory: /jobfs/90982293.gadi-pbs/dask-scratch-space/worker-yw6181an

0,1
Comm: tcp://127.0.0.1:45623,Total threads: 2
Dashboard: /proxy/41775/status,Memory: 9.00 GiB
Nanny: tcp://127.0.0.1:38057,
Local directory: /jobfs/90982293.gadi-pbs/dask-scratch-space/worker-sjmsc8_i,Local directory: /jobfs/90982293.gadi-pbs/dask-scratch-space/worker-sjmsc8_i

0,1
Comm: tcp://127.0.0.1:33909,Total threads: 2
Dashboard: /proxy/41199/status,Memory: 9.00 GiB
Nanny: tcp://127.0.0.1:40893,
Local directory: /jobfs/90982293.gadi-pbs/dask-scratch-space/worker-b_8ndqzd,Local directory: /jobfs/90982293.gadi-pbs/dask-scratch-space/worker-b_8ndqzd

0,1
Comm: tcp://127.0.0.1:44375,Total threads: 2
Dashboard: /proxy/41639/status,Memory: 9.00 GiB
Nanny: tcp://127.0.0.1:38847,
Local directory: /jobfs/90982293.gadi-pbs/dask-scratch-space/worker-smteyg7r,Local directory: /jobfs/90982293.gadi-pbs/dask-scratch-space/worker-smteyg7r

0,1
Comm: tcp://127.0.0.1:42965,Total threads: 2
Dashboard: /proxy/42633/status,Memory: 9.00 GiB
Nanny: tcp://127.0.0.1:36399,
Local directory: /jobfs/90982293.gadi-pbs/dask-scratch-space/worker-viqdc03c,Local directory: /jobfs/90982293.gadi-pbs/dask-scratch-space/worker-viqdc03c

0,1
Comm: tcp://127.0.0.1:34997,Total threads: 2
Dashboard: /proxy/37803/status,Memory: 9.00 GiB
Nanny: tcp://127.0.0.1:40837,
Local directory: /jobfs/90982293.gadi-pbs/dask-scratch-space/worker-emzlx4o1,Local directory: /jobfs/90982293.gadi-pbs/dask-scratch-space/worker-emzlx4o1


# Loading unique crabeater seal observations data frame

In [13]:
#Loading dataset as pandas data frame
crabeaters = pd.read_csv('../Cleaned_Data/unique_background_20x_obs_grid.csv')

#Ensuring date column is formatted correctly (year-month)
crabeaters['date'] = crabeaters.apply(lambda x: f'{x.year}-{str(x.month).zfill(2)}', axis = 1)

#Checking results
crabeaters

Unnamed: 0,date,year,month,yt_ocean,xt_ocean,yu_ocean,xu_ocean,season_year,life_stage,decade,sector,zone,presence
0,1999-12,1999,12,-64.157,85.55,-64.135,85.5,summer,weaning,1990,East Indian,Antarctic,0
1,1999-12,1999,12,-66.156,75.45,-66.135,75.5,summer,weaning,1990,Central Indian,Antarctic,0
2,1999-12,1999,12,-64.547,142.85,-64.568,142.8,summer,weaning,1990,Central Indian,Antarctic,0
3,1999-12,1999,12,-64.761,109.25,-64.739,109.2,summer,weaning,1990,Central Indian,Antarctic,0
4,1999-12,1999,12,-64.547,102.45,-64.568,102.4,summer,weaning,1990,Central Indian,Antarctic,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
35561,1998-11,1998,11,-65.565,74.95,-65.543,75.0,autumn,weaning,1990,Central Indian,Antarctic,0
35562,1987-11,1987,11,-62.955,108.05,-62.932,108.0,autumn,weaning,1980,Central Indian,Antarctic,0
35563,1997-12,1997,12,-62.864,91.25,-62.841,91.2,summer,weaning,1990,Central Indian,Antarctic,0
35564,1998-11,1998,11,-64.461,88.55,-64.439,88.5,autumn,weaning,1990,Central Indian,Antarctic,0


# Loading crabeater seal masks

In [132]:
mask_all = xr.open_dataarray('/g/data/v45/la6889/Chapter2_Crabeaters/mask_background_20x_obs_ocean_grid.nc')

# Adding values for static variables only
Static variables referred to any physical variables that do not change over time (at least not during the time period of our interest). Examples include depth of the water column and distance to coastline. Given that we only have one value for these variables, the process of extracting data is relatively simple. We do not need to take into account the date observations were collected.

## Defining dictionary with information about static variables
This dictionary contains the column labels for each and the name of the files for each static variable to be included in our analysis. We will also define a variable containing the full path to the folder where all static variables are stored.

In [6]:
#Full path to static variables
base_dir_static = '/g/data/v45/la6889/Chapter2_Crabeaters/Static_Variables/'

#List of static variables
varDict = {'bottom_slope_deg': 'bathy_slope_GEBCO_2D.nc',
           'dist_shelf_km': 'distance_shelf.nc',
           'dist_coast_km': 'distance_coastline.nc',
           'depth_m': 'bathy_GEBCO_2D.nc'}

## Extracting data for each observation and adding it to a new column in crabeater data

In [16]:
#Looping through dictionary keys
for var in varDict:
    #Creating full path to file of interest
    file_path = os.path.join(base_dir_static, varDict[var])
    #Load as raster
    ras = xr.open_dataarray(file_path).sel(yt_ocean = slice(-80, -45))
    ras.name = var
    #Applying mask
    ras_masked = ras.where(mask_all == 0)
    #Transforming masked array into data frame
    ras_df = ras_masked.to_series().dropna().reset_index()
    #Rounding up coordinate values
    ras_df = ras_df.round({'yt_ocean': 3, 'xt_ocean': 3})
    #Renaming masked data before merging to observations
    ras_df.rename(columns = {0: var}, inplace = True)
    #Adding to crabeater observations data frame
    crabeaters = crabeaters.merge(ras_df, on = ['yt_ocean', 'xt_ocean'], how = 'left').sort_values(['yt_ocean', 'xt_ocean'])
    
#Checking results
crabeaters

Unnamed: 0,date,year,month,yt_ocean,xt_ocean,yu_ocean,xu_ocean,season_year,life_stage,decade,sector,zone,presence,bottom_slope_deg,dist_shelf_km,dist_coast_km,depth_m
0,1999-12,1999,12,-69.239,75.05,-69.260,75.1,summer,weaning,1990,East Indian,Antarctic,0,89.812088,504.080212,140.255558,777.866638
1,1999-12,1999,12,-69.239,75.85,-69.260,75.9,summer,weaning,1990,Central Indian,Antarctic,0,89.953110,506.158223,143.956368,503.000000
2,1999-12,1999,12,-69.239,76.05,-69.260,76.1,summer,weaning,1990,East Indian,Antarctic,0,89.948738,507.046161,145.996809,522.900024
3,1999-12,1999,12,-69.239,77.25,-69.260,77.3,summer,weaning,1990,Central Indian,Antarctic,0,,509.340574,166.208399,
4,1999-12,1999,12,-69.155,73.95,-69.134,74.0,summer,weaning,1990,Central Indian,Antarctic,0,,494.994640,130.924977,599.817383
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35561,1999-12,1999,12,-59.442,69.25,-59.468,69.3,summer,weaning,1990,Central Indian,Antarctic,0,89.832825,556.221299,688.617402,4624.497070
35562,1999-12,1999,12,-59.442,71.95,-59.468,72.0,summer,weaning,1990,East Indian,Antarctic,0,89.652817,536.647811,699.380306,4429.308105
35563,1987-11,1987,11,-59.340,70.85,-59.366,70.9,autumn,weaning,1980,Central Indian,Antarctic,0,89.979530,551.535259,703.061410,4767.583496
35564,1998-11,1998,11,-59.340,71.55,-59.366,71.6,autumn,weaning,1990,Central Indian,Antarctic,0,89.929642,548.570655,707.329724,4608.486328


## Saving data frame with static variables
Given that the dynamic variables take some time to extract. We will save intermediary results to avoid having to extract them again.

In [17]:
#Defining output folder
folder_out = '../Cleaned_Data/Env_obs'
#Checking folder exists
os.makedirs(folder_out, exist_ok = True)

crabeaters.to_csv(os.path.join(folder_out, 'unique_background_20x_obs_static_env.csv'), index = False)

# Adding values for dynamic variables
Given the amount of crabeater seal observations and the time period covered by this dataset, the extraction of these values may take some time. It is recommended to save the data frame after every time a new variable is extracted. This way we can avoid losing data.

In [133]:
crabeaters = pd.read_csv('../Cleaned_Data/Env_obs/unique_background_20x_obs_static_env.csv')
#Ensuring date column is formatted correctly (year-month)
crabeaters['date'] = crabeaters.apply(lambda x: f'{x.year}-{str(x.month).zfill(2)}', axis = 1)
crabeaters

Unnamed: 0,date,year,month,yt_ocean,xt_ocean,yu_ocean,xu_ocean,season_year,life_stage,decade,sector,zone,presence,bottom_slope_deg,dist_shelf_km,dist_coast_km,depth_m
0,1999-12,1999,12,-69.239,75.05,-69.260,75.1,summer,weaning,1990,East Indian,Antarctic,0,89.812088,504.080212,140.255558,777.86664
1,1999-12,1999,12,-69.239,75.85,-69.260,75.9,summer,weaning,1990,Central Indian,Antarctic,0,89.953110,506.158223,143.956368,503.00000
2,1999-12,1999,12,-69.239,76.05,-69.260,76.1,summer,weaning,1990,East Indian,Antarctic,0,89.948738,507.046161,145.996809,522.90000
3,1999-12,1999,12,-69.239,77.25,-69.260,77.3,summer,weaning,1990,Central Indian,Antarctic,0,,509.340574,166.208399,
4,1999-12,1999,12,-69.155,73.95,-69.134,74.0,summer,weaning,1990,Central Indian,Antarctic,0,,494.994640,130.924977,599.81740
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35561,1999-12,1999,12,-59.442,69.25,-59.468,69.3,summer,weaning,1990,Central Indian,Antarctic,0,89.832825,556.221299,688.617402,4624.49700
35562,1999-12,1999,12,-59.442,71.95,-59.468,72.0,summer,weaning,1990,East Indian,Antarctic,0,89.652817,536.647811,699.380306,4429.30800
35563,1987-11,1987,11,-59.340,70.85,-59.366,70.9,autumn,weaning,1980,Central Indian,Antarctic,0,89.979530,551.535259,703.061410,4767.58350
35564,1998-11,1998,11,-59.340,71.55,-59.366,71.6,autumn,weaning,1990,Central Indian,Antarctic,0,89.929642,548.570655,707.329724,4608.48630


## Loading environmental data from observations

In [159]:
#Creating dictionary with useful information
varDict = {'var_name': 'dist_ice_edge_km',
           #Folder containing obs
           'obs_main': '/g/data/v45/la6889/Chapter2_Crabeaters/SeaIceObs/Distance_Edge/*.nc',
           #Output folder
           'base_out': '../Cleaned_Data'}

In [160]:
#Getting list of all obs in folder
files_var = sorted(glob(varDict['obs_main']))

#Loading all data into single dataset
var_df = xr.open_mfdataset(files_var)
var_df = var_df.rename_vars({'dist_km': 'dist_ice_edge_km'})
# var_df = var_df.rename_vars({'__xarray_dataarray_variable__': 'SST_degC'})
var_df = var_df.dist_ice_edge_km.rename({'lon': 'xt_ocean', 'lat': 'yt_ocean'})

#Subsetting data to Indian sectors
var_df = var_df.sel(yt_ocean = slice(-80, -40), xt_ocean = slice(30, 170))

#Checking results
var_df

Unnamed: 0,Array,Chunk
Bytes,3.67 GiB,7.62 MiB
Shape,"(494, 713, 1400)","(1, 713, 1400)"
Dask graph,494 chunks in 990 graph layers,494 chunks in 990 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 3.67 GiB 7.62 MiB Shape (494, 713, 1400) (1, 713, 1400) Dask graph 494 chunks in 990 graph layers Data type float64 numpy.ndarray",1400  713  494,

Unnamed: 0,Array,Chunk
Bytes,3.67 GiB,7.62 MiB
Shape,"(494, 713, 1400)","(1, 713, 1400)"
Dask graph,494 chunks in 990 graph layers,494 chunks in 990 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


# Cropping mask to cover the Indian sectors only

In [136]:
mask_all = mask_all.sel(yt_ocean = slice(-80, -40), xt_ocean = slice(30, 170))

## Subsetting variables by time to match observations
By the subsetting the original dataset, we will reduce computing time.

In [161]:
#Getting years and month available in the env data from observations
var_dates = [f'{y}-{str(m).zfill(2)}' for y, m in zip(var_df.time.dt.year.values.tolist(), 
                                                      var_df.time.dt.month.values.tolist())]

#Matching with unique dates when crabeaters where observed
timesteps = sorted([d for d in crabeaters.date.unique() if d in var_dates])

#Creating an empty list to keep the subset model data
var_df_int = []
#Looping through each time
for t in timesteps:
    var_df_int.append(var_df.sel(time = t))
    
#Creating a new data array with the time steps of interest
var_df_int = xr.concat(var_df_int, dim = 'time')
#Dealing original model data
del var_df

#Checking results - Time steps now match mask
var_df_int

Unnamed: 0,Array,Chunk
Bytes,190.39 MiB,7.62 MiB
Shape,"(25, 713, 1400)","(1, 713, 1400)"
Dask graph,25 chunks in 1016 graph layers,25 chunks in 1016 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 190.39 MiB 7.62 MiB Shape (25, 713, 1400) (1, 713, 1400) Dask graph 25 chunks in 1016 graph layers Data type float64 numpy.ndarray",1400  713  25,

Unnamed: 0,Array,Chunk
Bytes,190.39 MiB,7.62 MiB
Shape,"(25, 713, 1400)","(1, 713, 1400)"
Dask graph,25 chunks in 1016 graph layers,25 chunks in 1016 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


## Applying crabeater observations mask

In [162]:
#Applying mask
var_masked = var_df_int.where(~np.isnan(mask_all), drop = True)

## Transforming masked data array into data frame
This will return values for all grid cells identified in the mask.

In [163]:
#Converting to pandas data frame
var_pd = var_masked.to_series().dropna().reset_index()

#Adding year and month column prior to merging with crabeater observations
var_pd['year'] = var_pd.apply(lambda i: i.time.year, axis = 1)
var_pd['month'] = var_pd.apply(lambda i: i.time.month, axis = 1)

#Finding name of columns to round up
round_cols = [i for i in var_pd.columns if 'ocean' in i]
#Rounding coordinate values prior to merging
var_pd = var_pd.round({round_cols[0]: 3, round_cols[1]: 3})
#Removing time column that is not needed
var_pd = var_pd.drop(columns = 'time')

#Checking results
var_pd.head()

Unnamed: 0,yt_ocean,xt_ocean,dist_ice_edge_km,year,month
0,-69.239,75.05,634.139191,1981,12
1,-69.239,75.85,626.247187,1981,12
2,-69.239,76.05,624.569508,1981,12
3,-69.239,77.25,617.057024,1981,12
4,-69.155,73.95,629.533499,1981,12


In [164]:
#Getting column names for merging
cols = var_pd.drop(columns = varDict['var_name']).columns.tolist()
cols

['yt_ocean', 'xt_ocean', 'year', 'month']

## Joining masked data frame with background data frame
We will use the grid cell coordinates and dates to perform this join.

In [165]:
crabeaters = crabeaters.merge(var_pd, on = cols, how = 'left')
crabeaters

Unnamed: 0,date,year,month,yt_ocean,xt_ocean,yu_ocean,xu_ocean,season_year,life_stage,decade,...,zone,presence,bottom_slope_deg,dist_shelf_km,dist_coast_km,depth_m,SIC,SST_degC,lt_pack_ice,dist_ice_edge_km
0,1999-12,1999,12,-69.239,75.05,-69.260,75.1,summer,weaning,1990,...,Antarctic,0,89.812088,504.080212,140.255558,777.86664,,-1.506463,0.000000,724.237926
1,1999-12,1999,12,-69.239,75.85,-69.260,75.9,summer,weaning,1990,...,Antarctic,0,89.953110,506.158223,143.956368,503.00000,0.955706,-1.604321,0.535714,708.715428
2,1999-12,1999,12,-69.239,76.05,-69.260,76.1,summer,weaning,1990,...,Antarctic,0,89.948738,507.046161,145.996809,522.90000,0.945087,-1.573719,0.547619,705.058053
3,1999-12,1999,12,-69.239,77.25,-69.260,77.3,summer,weaning,1990,...,Antarctic,0,,509.340574,166.208399,,,,0.000000,685.100828
4,1999-12,1999,12,-69.155,73.95,-69.134,74.0,summer,weaning,1990,...,Antarctic,0,,494.994640,130.924977,599.81740,,-1.313767,0.000000,740.003921
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35561,1999-12,1999,12,-59.442,69.25,-59.468,69.3,summer,weaning,1990,...,Antarctic,0,89.832825,556.221299,688.617402,4624.49700,0.000000,-0.626995,0.000000,289.197201
35562,1999-12,1999,12,-59.442,71.95,-59.468,72.0,summer,weaning,1990,...,Antarctic,0,89.652817,536.647811,699.380306,4429.30800,0.000000,-0.988253,0.035714,276.617004
35563,1987-11,1987,11,-59.340,70.85,-59.366,70.9,autumn,weaning,1980,...,Antarctic,0,89.979530,551.535259,703.061410,4767.58350,0.044812,-1.306596,0.000000,28.266072
35564,1998-11,1998,11,-59.340,71.55,-59.366,71.6,autumn,weaning,1990,...,Antarctic,0,89.929642,548.570655,707.329724,4608.48630,0.038312,-1.382793,0.035714,32.938866


## Saving data frame to disk

In [167]:
crabeaters.to_csv('../Cleaned_Data/Env_obs/unique_background_20x_obs_all_env.csv', index = False)