<font size="8"> **Adding environmental data from available observations to unique background points** </font>  
In this notebook, we will extract environmental data from sea ice and sea surface temperature observations using the coordinates of the reported crabeater sightings.

# Setting working directory
In order to ensure these notebooks work correctly, we will set the working directory. We assume that you have saved a copy of this repository in your home directory (represented by `~` in the code chunk below). If you have saved this repository elsewhere in your machine, you need to ensure you update this line with the correct filepath where you saved these notebooks.

In [1]:
import os
os.chdir(os.path.expanduser('~/Chapter2_Crabeaters/Scripts'))

# Loading other relevant libraries

In [2]:
from dask.distributed import Client
from glob import glob
#Accessing model data
import cosima_cookbook as cc
#Useful functions
import UsefulFunctions as uf
#Dealing with data
import xarray as xr
import pandas as pd
import numpy as np
#Data visualisation
import matplotlib.pyplot as plt
import cartopy.crs as ccrs

# Paralellising work 

In [3]:
client = Client()
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: /proxy/37059/status,

0,1
Dashboard: /proxy/37059/status,Workers: 7
Total threads: 14,Total memory: 56.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:44613,Workers: 7
Dashboard: /proxy/37059/status,Total threads: 14
Started: Just now,Total memory: 56.00 GiB

0,1
Comm: tcp://127.0.0.1:34295,Total threads: 2
Dashboard: /proxy/39903/status,Memory: 8.00 GiB
Nanny: tcp://127.0.0.1:35743,
Local directory: /jobfs/101571382.gadi-pbs/dask-scratch-space/worker-03xef547,Local directory: /jobfs/101571382.gadi-pbs/dask-scratch-space/worker-03xef547

0,1
Comm: tcp://127.0.0.1:33485,Total threads: 2
Dashboard: /proxy/46143/status,Memory: 8.00 GiB
Nanny: tcp://127.0.0.1:41887,
Local directory: /jobfs/101571382.gadi-pbs/dask-scratch-space/worker-u25aswpy,Local directory: /jobfs/101571382.gadi-pbs/dask-scratch-space/worker-u25aswpy

0,1
Comm: tcp://127.0.0.1:33177,Total threads: 2
Dashboard: /proxy/35167/status,Memory: 8.00 GiB
Nanny: tcp://127.0.0.1:37457,
Local directory: /jobfs/101571382.gadi-pbs/dask-scratch-space/worker-9rbak0fw,Local directory: /jobfs/101571382.gadi-pbs/dask-scratch-space/worker-9rbak0fw

0,1
Comm: tcp://127.0.0.1:39637,Total threads: 2
Dashboard: /proxy/46449/status,Memory: 8.00 GiB
Nanny: tcp://127.0.0.1:34999,
Local directory: /jobfs/101571382.gadi-pbs/dask-scratch-space/worker-9neacnbb,Local directory: /jobfs/101571382.gadi-pbs/dask-scratch-space/worker-9neacnbb

0,1
Comm: tcp://127.0.0.1:36821,Total threads: 2
Dashboard: /proxy/39515/status,Memory: 8.00 GiB
Nanny: tcp://127.0.0.1:33583,
Local directory: /jobfs/101571382.gadi-pbs/dask-scratch-space/worker-9ufs20ki,Local directory: /jobfs/101571382.gadi-pbs/dask-scratch-space/worker-9ufs20ki

0,1
Comm: tcp://127.0.0.1:38389,Total threads: 2
Dashboard: /proxy/40993/status,Memory: 8.00 GiB
Nanny: tcp://127.0.0.1:42183,
Local directory: /jobfs/101571382.gadi-pbs/dask-scratch-space/worker-akv_x18b,Local directory: /jobfs/101571382.gadi-pbs/dask-scratch-space/worker-akv_x18b

0,1
Comm: tcp://127.0.0.1:37053,Total threads: 2
Dashboard: /proxy/34387/status,Memory: 8.00 GiB
Nanny: tcp://127.0.0.1:34027,
Local directory: /jobfs/101571382.gadi-pbs/dask-scratch-space/worker-62i44c5x,Local directory: /jobfs/101571382.gadi-pbs/dask-scratch-space/worker-62i44c5x


# Adding values for dynamic variables
Given the amount of crabeater seal observations and the time period covered by this dataset, the extraction of these values may take some time. It is recommended to save the data frame after every time a new variable is extracted. This way we can avoid losing data.

In [4]:
#Loading dataset as pandas data frame
crabeaters = pd.read_csv('../Biological_Data/BG_points/unique_background_20x_obs_grid.csv')

#Ensuring date column is formatted correctly (year-month)
crabeaters['date'] = crabeaters.apply(lambda x: f'{x.year}-{str(x.month).zfill(2)}', axis = 1)

#Checking results
crabeaters

Unnamed: 0,date,year,sector,longitude,latitude,xt_ocean,yt_ocean,zone,month,season_year,life_stage,decade,presence,bottom_slope_deg,dist_shelf_km,dist_coast_km,depth_m
0,1987-11,1987,Central Indian,71.45,-69.65,71.45,-69.662,Antarctic,11,autumn,weaning,1980,0,,,,
1,1987-11,1987,Central Indian,73.05,-69.65,73.05,-69.662,Antarctic,11,autumn,weaning,1980,0,,,,
2,1996-11,1996,Central Indian,74.45,-69.65,74.45,-69.662,Antarctic,11,autumn,weaning,1990,0,89.985,-550.881,187.224,518.817
3,1998-11,1998,Central Indian,76.55,-69.55,76.55,-69.535,Subantarctic,11,autumn,weaning,1990,0,,,,
4,1996-11,1996,Central Indian,73.75,-69.45,73.75,-69.451,Antarctic,11,autumn,weaning,1990,0,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30666,1989-11,1989,Central Indian,74.05,-59.45,74.05,-59.442,Antarctic,11,autumn,weaning,1980,0,89.919,543.100,725.785,1618.292
30667,1998-11,1998,Central Indian,71.75,-59.25,71.75,-59.238,Antarctic,11,autumn,weaning,1990,0,89.976,559.524,720.119,4481.736
30668,1996-11,1996,Central Indian,76.35,-59.25,76.35,-59.238,Antarctic,11,autumn,weaning,1990,0,89.961,506.485,705.324,1267.889
30669,1989-11,1989,Central Indian,73.95,-58.85,73.95,-58.827,Subantarctic,11,autumn,weaning,1980,0,89.971,610.227,790.155,2223.069


## Loading environmental data from observations

In [14]:
#Creating dictionary with useful information
varDict = {'var_original': 'dist_ice',
           #Folder containing obs
           'obs_main': '/g/data/v45/la6889/Chapter2_Crabeaters/SeaIceObs/Distance_Edge/north_south/*.nc',
           #Name to store variable in final data frame
           'var_short': 'dist_ice_edge_km',
           #Output folder
           'base_out': '../Environmental_Data/Env_obs'}

In [15]:
#Getting list of all obs in folder
files_var = sorted(glob(varDict['obs_main']))
var = varDict['var_original']

#Loading all data into single dataset
var_df = xr.open_mfdataset(files_var)[var]
#Renaming data array variable
var_df.name = varDict['var_short']
#Changing coordinates names if needed
if 'xt_ocean' not in var_df.coords:
    var_df = var_df.rename({'lon': 'xt_ocean', 'lat': 'yt_ocean'})
#Selecting dates between 1981 and 2013 and for the Indian sectors
var_df = var_df.sel(time = slice('1981-11', '2013-12'), xt_ocean = slice(30, 170))
#Rechunking dataset
var_df = var_df.chunk((1, 135, 180))

#Checking results
var_df

Unnamed: 0,Array,Chunk
Bytes,2.98 GiB,189.84 kiB
Shape,"(386, 740, 1400)","(1, 135, 180)"
Dask graph,18528 chunks in 87 graph layers,18528 chunks in 87 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.98 GiB 189.84 kiB Shape (386, 740, 1400) (1, 135, 180) Dask graph 18528 chunks in 87 graph layers Data type float64 numpy.ndarray",1400  740  386,

Unnamed: 0,Array,Chunk
Bytes,2.98 GiB,189.84 kiB
Shape,"(386, 740, 1400)","(1, 135, 180)"
Dask graph,18528 chunks in 87 graph layers,18528 chunks in 87 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


## Extracting environmental data
We will use the `latitude` and `longitude` columns together with the `event_date` column from the crabeater seal observations to find the corresponding grid cell in the model outputs and extract the value of the environmental factor of our interest.

In [18]:
#Getting coordinates from the crabeater data
lat = xr.DataArray(crabeaters.latitude)
lon = xr.DataArray(crabeaters.longitude)
#Getting data of observation from the crabeater data
time = xr.DataArray(crabeaters.apply(lambda x: pd.to_datetime(f'{x.date}-16'), axis = 1))

## Extracting data

In [19]:
#Extracting data
var_sub = var_df.sel(time = time, yt_ocean = lat, xt_ocean = lon, method = 'nearest')

#Transforming to data frame
var_pd = var_sub.to_dataframe().sort_values(['time', 'xt_ocean', 'yt_ocean'])
#Adding year and month
var_pd['year'] = var_pd.time.dt.year
var_pd['month'] = var_pd.time.dt.month
#Removing time column that is no longer needed
var_pd.drop(columns = 'time', inplace = True)
#Finding name of columns to round up
round_cols = [i for i in var_pd.columns if 'ocean' in i]
#Rounding coordinate values prior to merging
var_pd = var_pd.round({round_cols[0]: 3, round_cols[1]: 3})
#Getting column names for merging
cols = var_pd.drop(columns = varDict['var_short']).columns.tolist()

#Checking results
print(cols); var_pd

['yt_ocean', 'xt_ocean', 'year', 'month']


Unnamed: 0_level_0,yt_ocean,xt_ocean,dist_ice_edge_km,year,month
dim_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
313,-67.465,75.15,-442.943011,1981,12
20658,-64.331,75.35,-104.973925,1981,12
7426,-65.269,77.15,-186.808554,1981,12
2782,-66.240,77.65,-284.395662,1981,12
9918,-65.058,77.65,-156.536614,1981,12
...,...,...,...,...,...
17295,-64.633,146.95,-82.078142,2013,12
11107,-65.058,149.05,-158.077744,2013,12
20526,-64.461,149.95,-115.941450,2013,12
34218,-63.136,150.05,15.851961,2013,12


## Joining masked data frame with background data frame
We will use the grid cell coordinates and dates to perform this join.

In [20]:
crabeaters = crabeaters.merge(var_pd, on = cols, how = 'left')
crabeaters = crabeaters.drop_duplicates()
crabeaters

Unnamed: 0,date,year,sector,longitude,latitude,xt_ocean,yt_ocean,zone,month,season_year,...,decade,presence,bottom_slope_deg,dist_shelf_km,dist_coast_km,depth_m,SIC,SST_degC,lt_pack_ice,dist_ice_edge_km
0,1987-11,1987,Central Indian,71.45,-69.65,71.45,-69.662,Antarctic,11,autumn,...,1980,0,,,,,,-1.410222,0.000000,
1,1987-11,1987,Central Indian,73.05,-69.65,73.05,-69.662,Antarctic,11,autumn,...,1980,0,,,,,,-1.548782,0.000000,
2,1996-11,1996,Central Indian,74.45,-69.65,74.45,-69.662,Antarctic,11,autumn,...,1990,0,89.985,-550.881,187.224,518.817,,,0.000000,
3,1998-11,1998,Central Indian,76.55,-69.55,76.55,-69.535,Subantarctic,11,autumn,...,1990,0,,,,,,,0.000000,
4,1996-11,1996,Central Indian,73.75,-69.45,73.75,-69.451,Antarctic,11,autumn,...,1990,0,,,,,,-1.581244,0.000000,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37880,1989-11,1989,Central Indian,74.05,-59.45,74.05,-59.442,Antarctic,11,autumn,...,1980,0,89.919,543.100,725.785,1618.292,0.153513,-1.342190,0.047619,-48.185653
37881,1998-11,1998,Central Indian,71.75,-59.25,71.75,-59.238,Antarctic,11,autumn,...,1990,0,89.976,559.524,720.119,4481.736,0.030800,-1.371573,0.035714,37.979983
37882,1996-11,1996,Central Indian,76.35,-59.25,76.35,-59.238,Antarctic,11,autumn,...,1990,0,89.961,506.485,705.324,1267.889,0.078127,-1.264371,0.071429,8.036973
37883,1989-11,1989,Central Indian,73.95,-58.85,73.95,-58.827,Subantarctic,11,autumn,...,1980,0,89.971,610.227,790.155,2223.069,0.095507,-1.304765,0.000000,5.751460


## Saving data frame to disk

In [21]:
#Ensure output folder exists
os.makedirs(varDict['base_out'], exist_ok = True)

#Create file path where data will be saved
file_out = os.path.join(varDict['base_out'], 'unique_background_20x_obs_all_env.csv')

#Saving as csv file
crabeaters.to_csv(file_out, index = False)