# Xtractopy - `xoak` incorporation
*Andrew Chin, 11/13/2021*

draft of incorporating the `xoak` package, per reccommendation at OHW2021 feedback session. `xoak` readily matches oceanographic and atmospheric data, but only between `xarray` files. This is testing an `xoak` workflow where:

1. the track df is converted into an xarray,
2. then concatenated with the environmental data,
3. then flipped it back into a df so it will read easily into a csv.


Inputs remain the same as the prior version:
1. tag data of latitude, longitude, and date/time from the animal in a `pandas` df
2. environmental dataset(s) as an `xarray` DataSet
3. filename

In addition, I will (try) to incorporate a workflow skeleton for generalized subsetting of environmental data, to focus on a region of interest. Will need some help from the spatial ecologists of the group to figure out inputs.

*Outputs*
1. a `.csv` file containing tag coordinates and corresponding environmental variables

## Generalized function `xtractopy()`

In [None]:
# necessary packages
import datetime as dt
import xarray as xr
import numpy as np
import pandas as pd
from typing import Dict, Union
import fsspec
import matplotlib.pyplot as plt
from datetime import datetime 
import xoak as xoak

## Tutorial
Below is an example of an `xtractopy` workflow from OHW 2021. We will be working with tiger sharks (*Galeocerdo cuvier*) tagged in the Gulf Stream system of the Western Atlantic Ocean.

![tigershark](tigershark_lauramcdonnell.png)

First, let's load in the track data:

In [None]:
shark_dir = "shark track data/track_shark144020.csv"
track_ex = pd.read_csv(shark_dir, parse_dates=['datetime']) # in pandas, read_csv

# track_ex["lon"] = np.where(
#     track_ex["lon"] < 180,
#     track_ex["lon"] + 360,
#     track_ex["lon"])

lat_min = track_ex["lat"].min() - 2.0
lat_max = track_ex["lat"].max() + 2.0
lon_min = track_ex["lon"].min() - 2.0
lon_max = track_ex["lon"].max() + 2.0

xy_bbox = dict(latitude=slice(lat_min,lat_max), longitude=slice(lon_min,lon_max))

plt.plot(track_ex.lon,track_ex.lat)

xy_bbox

In [None]:
track_ex

In [None]:
# grab track data for a few tag datapoints
track_2014 = track_ex.iloc[0:100]
track_2014

# load in environmental data
We want to retrieve high resolution data from web repositories and servers and load them into the Python environment as an xarray. In addition, we recommend subsetting the data to the particular study grid for faster run-times.

here is the SST from MUR, available [here](https://registry.opendata.aws/mur/).

Also, ensure that your coordinate names match the naming conventions of the function, which use abbreviated versions of latitude and longitude as "lat" and "lon", respectively. This can be done with the `rename` function:

```
ds_env_data_renamed = ds_env_data.rename({'latitude':'lat', 'longitude':'lon', 'time':'time'}) # "old name" : "new name"
ds_env_data_renamed
```

In [None]:
# bring in data for SST
file_location = 's3://mur-sst/zarr'
ikey = fsspec.get_mapper(file_location, anon=True)
ds_sst = xr.open_zarr(ikey,consolidated=True)
ds_sst

In [None]:
# Subset of Gulf Stream 
max_lon_glf = -70
min_lon_glf = -82

### generalized data subset function

In [None]:
def subset_area(env_data,
                max_lon,
                min_lon):
    subset_lon = (env_data.lon >= min_lon) & (env_data.lon <= max_lon)
    subset_env_data = env_data.where(subset_lon, drop=True)
    return subset_env_data

In [None]:
gulf_stream_sst = subset_area(ds_sst, max_lon_glf, min_lon_glf)
gulf_stream_sst

## convert shark tracks to an xarray

In [97]:
### write variables to coordinates
track_2014_xarray =track_2014.to_xarray()
track_2014_xarray = track_2014_xarray.set_coords(("lat", 'lon', 'datetime'))

track_2014_xarray.coords

Coordinates:
  * index     (index) int64 0 1 2 3 4 5 6 7 8 9 ... 91 92 93 94 95 96 97 98 99
    lon       (index) float64 -78.98 -78.95 -78.92 ... -78.32 -78.47 -78.9
    lat       (index) float64 27.19 27.17 27.16 27.15 ... 27.62 27.96 28.0 27.66
    datetime  (index) datetime64[ns] 2014-11-15 2014-11-16 ... 2015-03-06

In [98]:
#  Coordinates {coords} must all have the same dimensions in the same order
track_2014_xarray = track_2014_xarray.set_index(index=["lat","lon",'datetime'])
track_2014_xarray

## run `xoak` on the xarrays STUCK HERE

In [99]:
gulf_stream_sst.coords

Coordinates:
  * lat      (lat) float32 -89.99 -89.98 -89.97 -89.96 ... 89.97 89.98 89.99
  * lon      (lon) float32 -82.0 -81.99 -81.98 -81.97 ... -70.02 -70.01 -70.0
  * time     (time) datetime64[ns] 2002-06-01T09:00:00 ... 2020-01-20T09:00:00

In [100]:
gulf_stream_sst_index = gulf_stream_sst.xoak.set_index(["lat","lon","time"], "sklearn_geo_balltree")

ValueError: Coordinates {coords} must all have the same dimensions in the same order

In [None]:
## run xoak

ds_selection = gulf_stream_sst_index.xoak.sel(
lat = track_2014_xarray.lat,
lon = track_2014_xarray.lon,
time = track_2014_xarray.datetime
)

ds_selection

## concatenate variables, convert, and write

In [None]:
## concatenate and convert to df, then write to csv

In [None]:
### DONT TOUCH - TEMPLATE FOR REFERENCE

In [None]:

def xtractopy(envdata,
              tagdata: pd.DataFrame,
             filename: [str]):
    """
    envdata: environmental data in an DataArray format
    tagdata: tag data in a pandas format
    filename: the name of the file .csv output file, as a "string"
    """
    def fuction_dataset_point(**kwargs) -> Dict[str, Union[float, int]]:
        pass

    def extract(function_dataset_point, 
                df: tagdata, 
                map_coordinates: Dict[str, str], 
                rename_variables: Dict[str, str]
               ) -> pd.DataFrame:
        """
        function_dataset_point: environmental data in a point format, to be transformed
        map_coordinates: key is name of column in dataframe, value is the name of the coordinate in dataset
        rename_variables: TBD
        """
    
        def get_row(row) -> Dict[str, Union[float, int]]:
            extract_coordinates = {}
        
            for key, val in map_coordinates.items():
                extract_coordinates[val] = row[key]
        
            result = function_dataset_point(**extract_coordinates)
        
            # rename variables here and transform result TBD
            return result
    
        return df.apply(
            lambda row: get_row(row), axis=1, result_type="expand"
        )


    def envdata_point(lat, lon, time) -> Dict[str, Union[float, int]]:
        ds = envdata.sel(lat=lat, lon=lon, time=time, method="nearest")

        results = {}
    
        for var in ds.variables:
            if var not in ds.coords:
                results[var] = ds[var].values
    
        return results

    combined_dat = pd.concat([tagdata, 
                        extract(envdata_point,
                                tagdata, 
                                {"lat": "lat", "lon": "lon", "datetime": "time"},  # tagdata_label:envdata_point_label
                                {}
                               )
                       ], axis=1)
    combined_dat.to_csv("".join([filename, ".csv"])) # need to figure out how to paste the title into the csv file
    return combined_dat


In [None]:
# test
xtractopy(gulf_stream_sst, track_2014, "test_sst")