# Accessing the OBIS thermal dataset from Python

The `obistherm` dataset includes OBIS occurrence data matched with multiple sources of monthly temperature. Temperature data is extracted for each occurrence based on the date it was collected, at the recorded depth or across multiple depths. See how to download it here and how to use it here. The current version of `obistherm` is based on the OBIS full export of 2024-07-23 and covers the period of 1986 to 2024.

## Accessing the dataset

### Download a local copy

The final dataset is available through the OBIS AWS S3 bucket `s3://obis-products/obistherm`. If you have the **AWS** CLI program installed in your computer, you can run the following in the command line:

``` bash
aws s3 cp --recursive s3://obis-products/obistherm . --no-sign-request
```
What will download all files to your local folder. Alternatively, on Python you can use the `boto3` library:

In [None]:
import boto3
import os

# AWS S3 configuration
bucket_name = "obis-products"
s3_folder = "obistherm"
local_folder = "obistherm"

# Create a local folder if it doesn't exist
os.makedirs(local_folder, exist_ok=True)

# Initialize S3 client
s3_client = boto3.client('s3')

# List objects in the specified S3 folder
response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=s3_folder)
s3_objects = response.get('Contents', [])

# Download objects
total = len(s3_objects)
for i, obj in enumerate(s3_objects, start=1):
    s3_key = obj['Key']
    
    # Skip folders (keys ending with "/")
    if not s3_key.endswith('/'):
        local_file = os.path.join(local_folder, os.path.relpath(s3_key, s3_folder))
        
        # Ensure the local directory exists
        os.makedirs(os.path.dirname(local_file), exist_ok=True)
        
        # Download the object
        print(f"Downloading {i} out of {total}: {s3_key}")
        s3_client.download_file(bucket_name, s3_key, local_file)
        print(f"Downloaded: {s3_key} to {local_file}")

Once you have downloaded the data to a local folder, you can then open it using `pyarrow`

In [14]:
import pandas as pd
import pyarrow as pa
import pyarrow.dataset as ds

# Important to add "partitioning='hive'" so that the year information is captured
dataset = ds.dataset(local_folder, format="parquet", partitioning="hive")

This does not open the data on memory, what enables you to work with such a big dataset (more than 103 million records) seamlessly on Python. The function `dataset` will open a representation of the dataset on Python, which you can later filter and do other operations. You can learn more about `arrow` [here](https://arrow.apache.org/docs/python/index.html) and on this [short tutorial](https://resources.obis.org/tutorials/arrow-obis/) (for R, but the principles are the same).

You can quickly see all columns that are available on the dataset by running this:

In [15]:
print(dataset.schema)

AphiaID: int32
species: string
family: string
id: string
dataset_id: string
occurrenceID: string
datasetID: string
minimumDepthInMeters: double
maximumDepthInMeters: double
decimalLongitude: double
decimalLatitude: double
eventDate: string
date_mid: int64
month: int32
surfaceTemperature: double
midTemperature: double
deepTemperature: double
bottomTemperature: double
midDepth: double
deepDepth: double
minimumDepthTemperature: double
maximumDepthTemperature: double
minimumDepthClosestDepth: double
maximumDepthClosestDepth: double
coraltempSST: double
murSST: double
ostiaSST: double
flag: int32
geometry: binary
h3_7: string
year: int32
-- schema metadata --
r: 'A
3
262913
197888
5
UTF-8
531
2
531
2
16
1
262153
8
geometry
781
29
N' + 1837
geo: '{"primary_column":"geometry","columns":{"geometry":{"crs":"GEOGCRS[' + 1212


### Accessing through the S3 storage

Downloading a local copy is the best solution to speed up any operation you need to do. However, it is also possible to access the dataset directly from the S3 storage:

In [None]:
s3_storage = "s3://obis-products/obistherm/"
ds_s3 = ds.dataset(s3_storage)
print(ds_s3.schema)

The speed of any operation done with the S3 version will depend on your internet connection and the type of operation. **The dataset is organized by year (note that we use a [hive structure in the files](https://arrow.apache.org/docs/r/articles/dataset.html)), and any operation that filter the data for a single year will be faster, because Arrow will only need to read one file.**

## Filtering and aggregating

Once you opened the dataset, you can quickly generate summaries and filter the data. Let's start by looking at the number of records for the family Ocypodidae across years.

In [None]:
%%time
# Filter directly on the dataset to keep operations on disk
filtered_dataset = dataset.to_table(
    filter=(
        (ds.field("family") == "Ocypodidae") &  # Filter where family is "Ocypodidae"
        (~ds.field("species").is_null())        # Filter out rows where species is NA/None
    )
)

# Convert the filtered dataset to a Pandas DataFrame for further grouping
df = filtered_dataset.to_pandas()

# Perform grouping
ocyp_recs = (
    df.groupby(["species", "year"])
    .size()  # Count occurrences
    .reset_index(name="count")  # Reset index and name the count column
)

print(ocyp_recs)

#CPU times: user 22.8 s, sys: 5.15 s, total: 28 s
#Wall time: 3.82 s

Note how quick the operation is. What we have done:

1. We start by filtering the data. In this case we filter by the family "Ocypodidae"  
2. We filter to remove those with no "species" name, that would be not at the species rank  
3. We convert this result to table, and then to `pandas`
4. Then we group our data by _species_ and _year_  
5. We then use `.size()` to count the number of records  
 
Let's work with the 4 species with the largest number of records. We did the counts by year, so we will aggregate to get the total. 

In [22]:
# Identify the top 4 species
top_ocyp = (
    ocyp_recs.groupby("species")["count"]  # Group by 'species' and sum 'count'
    .sum()
    .reset_index(name="total")  
    .sort_values(by="total", ascending=False)  
    .head(4) 
)

print(top_ocyp)


Top Ocypodidae Species:
                    species  total
24          Leptuca thayeri    260
4           Austruca lactea    132
79      Ucides occidentalis    123
37  Ocypode ceratophthalmus     86


We will now filter the data for those species to check the temperatures across time.

In [None]:
top_species_list = top_ocyp["species"].tolist()

top_ocyp_data = (
    df[df["species"].isin(top_species_list)]  
    .loc[:, [ 
        "species", "surfaceTemperature", "coraltempSST", 
        "murSST", "ostiaSST", "year", "month", 
        "decimalLongitude", "decimalLatitude"
    ]]
)

# Rename column
top_ocyp_data = top_ocyp_data.rename(columns={"surfaceTemperature": "glorysSST"})

print(top_ocyp_data.head())

print(top_ocyp_data.describe(include="all"))

Here we use `.loc` to select only the columns that we are going to use. We also rename the surfaceTemperature column (which is the GLORYS product) to glorysSST.

CoralTemp is the most complete product in this case, so we will focus on it. We can quickly produce a plot of temperature over time for those 4 species.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from plotnine import (
    ggplot, aes, geom_point, facet_wrap, theme_light, geom_boxplot, geom_hline
)

# Add a 'date' column combining year and month
top_ocyp_data["date"] = pd.to_datetime(
    top_ocyp_data["year"].astype(str) + "-" +
    top_ocyp_data["month"].astype(str) + "-01"
)

(
    ggplot(top_ocyp_data, aes(x="date", y="coraltempSST")) +
    geom_point(aes(color="species")) +
    facet_wrap("~species") +
    theme_light()
)

We can also do a boxplot of the full data for each species. We will also get the .95 quantile, what we can use as an indication of thermal limit.

In [None]:
limits = (
    top_ocyp_data.groupby("species")["coraltempSST"]
    .quantile(0.95)
    .reset_index(name="top_limit")
)

(
    ggplot(top_ocyp_data, aes(x="species", y="coraltempSST")) +
    geom_boxplot(aes(fill="species")) +
    geom_hline(aes(yintercept="top_limit", color="species"), data=limits) +
    theme_light()
)

It appears that _Austruca lactea_ has the widest thermal range. Let's plot on a map the records:

In [None]:
import geopandas as gpd
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature

fig, ax = plt.subplots(
    1, 1, figsize=(12, 8), 
    subplot_kw={'projection': ccrs.PlateCarree()}  # WGS84 projection
)

ax.add_feature(cfeature.LAND, facecolor='grey')
ax.add_feature(cfeature.COASTLINE, edgecolor='black')

from shapely.geometry import Point

geometry = [Point(xy) for xy in zip(top_ocyp_data["decimalLongitude"], top_ocyp_data["decimalLatitude"])]
top_ocyp_data_geo = gpd.GeoDataFrame(top_ocyp_data, geometry=geometry, crs="EPSG:4326")

top_ocyp_data_geo.plot(
    ax=ax, 
    transform=ccrs.PlateCarree(), 
    marker="o", 
    column="species", 
    legend=True,
    cmap="Set2"
)

plt.title("Ocypodidae species records")
plt.show()


Check the README of the repository for more information about the dataset.