# NOAA Sea Surface Temperature (SST) Data Processing

A notebook that introduces working with netCDF files and generates average Sea Surface Temperature (SST) values per month between 1981 to 2025 based on netCDF files by the [NOAA Sea Surface Tempterature Dataset](https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/).

The notebook was created in the context of the modules Think 4 / Play 4 in Spring semester 2025 at Data Design + Art by Max Frischknecht for a project by Elisee Kukulu and Eva Brinksma about the relation between hurricane intensity and sea surface temperature. 


In [1]:
import xarray as xr
print(xr.__version__) # should output 2023.12.0

import netCDF4
print(netCDF4.__version__) # should output 1.7.2

from xarray.backends import list_engines
print(list_engines()) # check that 'netcdf4' or 'nc4' is part of the engine

# import pandas and altair for further processing & visualization
import pandas as pd
import altair as alt
alt.data_transformers.enable("vegafusion") ## for big data sets

import os # to read many files from the operating system


2023.12.0
1.7.2
{'netcdf4': <NetCDF4BackendEntrypoint>
  Open netCDF (.nc, .nc4 and .cdf) and most HDF5 files using netCDF4 in Xarray
  Learn more at https://docs.xarray.dev/en/stable/generated/xarray.backends.NetCDF4BackendEntrypoint.html, 'h5netcdf': <H5netcdfBackendEntrypoint>
  Open netCDF (.nc, .nc4 and .cdf) and most HDF5 files using h5netcdf in Xarray
  Learn more at https://docs.xarray.dev/en/stable/generated/xarray.backends.H5netcdfBackendEntrypoint.html, 'scipy': <ScipyBackendEntrypoint>
  Open netCDF files (.nc, .nc4, .cdf and .gz) using scipy in Xarray
  Learn more at https://docs.xarray.dev/en/stable/generated/xarray.backends.ScipyBackendEntrypoint.html, 'store': <StoreBackendEntrypoint>
  Open AbstractDataStore instances in Xarray
  Learn more at https://docs.xarray.dev/en/stable/generated/xarray.backends.StoreBackendEntrypoint.html}


## 1. Exploring netCDF and the Data Structure

This section explores how to work with netCDF files. 
First we open one example file to get to know the netCDF4 format a bit better. 

In [2]:
# load one file and print the data
ds = xr.open_dataset("./data/oisst-avhrr-v02r01.20250101_example.nc", engine='netcdf4')
print(ds)

<xarray.Dataset>
Dimensions:  (time: 1, zlev: 1, lat: 720, lon: 1440)
Coordinates:
  * time     (time) datetime64[ns] 2025-01-01T12:00:16.364011520
  * zlev     (zlev) float32 0.0
  * lat      (lat) float32 -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
  * lon      (lon) float32 0.125 0.375 0.625 0.875 ... 359.1 359.4 359.6 359.9
Data variables:
    sst      (time, zlev, lat, lon) float32 ...
    anom     (time, zlev, lat, lon) float32 ...
    err      (time, zlev, lat, lon) float32 ...
    ice      (time, zlev, lat, lon) float32 ...
Attributes: (12/37)
    Conventions:                CF-1.6, ACDD-1.3
    title:                      NOAA/NCEI 1/4 Degree Daily Optimum Interpolat...
    references:                 Reynolds, et al.(2007) Daily High-Resolution-...
    source:                     ICOADS, NCEP_GTS, GSFC_ICE, NCEP_ICE, Pathfin...
    id:                         oisst-avhrr-v02r01.20250101.nc
    naming_authority:           gov.noaa.ncei
    ...                         ...

SST (sea surface temperature) is 4-dimensional:
- time: 1 time point (for now, just the 20250101 file)
- zlev: 1 depth level (surface = 0m)
- lat and lon: your spatial grid

In [3]:
# get the sst data section and print it
sst = ds['sst']
print(sst)

<xarray.DataArray 'sst' (time: 1, zlev: 1, lat: 720, lon: 1440)>
[1036800 values with dtype=float32]
Coordinates:
  * time     (time) datetime64[ns] 2025-01-01T12:00:16.364011520
  * zlev     (zlev) float32 0.0
  * lat      (lat) float32 -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
  * lon      (lon) float32 0.125 0.375 0.625 0.875 ... 359.1 359.4 359.6 359.9
Attributes:
    long_name:  Daily sea surface temperature
    units:      Celsius
    valid_min:  -300
    valid_max:  4500


### Understanding Longitude and Latitude

The geographical latitude is measured from the equator to the north (0° to 90° north at the North Pole) and south (0° to 90° south at the South Pole), the geographical longitude from the prime meridian from 0° to 180° to the east and from 0° to 180° to the west. 

![](image.png)



The NOAA Dataset uses a different measurement system on the longitude: 

- Latitude (-89.875 to 89.875) is already standard (degrees north). No conversion needed.
- Longitude (0.125 to 359.875) uses a 0–360 system, but most mapping tools like QGIS or Google Maps use -180 to 180.

| 0–360° System | -180 to 180° System | Meaning                    |
|---------------|----------------------|-----------------------------|
| 0°            | 0°                   | Prime Meridian (Greenwich) |
| 90°           | 90°                  | 90° East                    |
| 180°          | 180°                 | International Date Line     |
| 270°          | -90°                 | 90° West                    |
| 360°          | 0°                   | Wraps back to Prime Meridian|


**Conversion Formula**

```python
def convert_longitude(lon):
    return lon - 360 if lon > 180 else lon

converted = [convert_longitude(lon) for lon in longitudes]
```

In [9]:
lat = ds["lat"]
lon = ds["lon"]
print(lat)
print("-" * 80)
print(lon)

<xarray.DataArray 'lat' (lat: 720)>
array([-89.875, -89.625, -89.375, ...,  89.375,  89.625,  89.875],
      dtype=float32)
Coordinates:
  * lat      (lat) float32 -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
Attributes:
    long_name:  Latitude
    units:      degrees_north
    grids:      Uniform grid from -89.875 to 89.875 by 0.25
--------------------------------------------------------------------------------
<xarray.DataArray 'lon' (lon: 1440)>
array([1.25000e-01, 3.75000e-01, 6.25000e-01, ..., 3.59375e+02, 3.59625e+02,
       3.59875e+02], dtype=float32)
Coordinates:
  * lon      (lon) float32 0.125 0.375 0.625 0.875 ... 359.1 359.4 359.6 359.9
Attributes:
    long_name:  Longitude
    units:      degrees_east
    grids:      Uniform grid from 0.125 to 359.875 by 0.25


### Extract the SST for a specific region (bounding box)

To focus on a region like the North Atlantic, you can slice the dataset like this:

```python
northAtlantic = ds.sst.sel(lat=slice(30, 60), lon=slice(-80, 0))
```

If we have an [bounding box](http://bboxfinder.com/#9.985143,-98.833008,45.673728,-12.963867), e.g. `-98.833008,9.985143,-12.963867,45.673728`

In [12]:
lon_min = -98.833008
lat_min = 9.985143
lon_max = -12.963867
lat_max = 45.673728

def to_0360(lon):
    return lon + 360 if lon < 0 else lon

lon_min = to_0360(-98.833008)  # 261.166992
lon_max = to_0360(-12.963867)  # 347.036133

print("selecting: ", lat_min, lat_max, lon_min, lon_max)
print("-" * 80)

northAtlantic = ds.sst.sel(lat=slice(lat_min, lat_max,), lon=slice(lon_min, lon_max))
print(northAtlantic)

selecting:  9.985143 45.673728 261.166992 347.036133
--------------------------------------------------------------------------------
<xarray.DataArray 'sst' (time: 1, zlev: 1, lat: 143, lon: 343)>
[49049 values with dtype=float32]
Coordinates:
  * time     (time) datetime64[ns] 2025-01-01T12:00:16.364011520
  * zlev     (zlev) float32 0.0
  * lat      (lat) float32 10.12 10.38 10.62 10.88 ... 44.88 45.12 45.38 45.62
  * lon      (lon) float32 261.4 261.6 261.9 262.1 ... 346.1 346.4 346.6 346.9
Attributes:
    long_name:  Daily sea surface temperature
    units:      Celsius
    valid_min:  -300
    valid_max:  4500


To see the actual temperature values you need to call the following. That output is the sea surface temperature (SST) in Celsius.

- Warm water (25°C) near the southern end of your region.
- Colder water (around 8-9°C) closer to 60°N.
- Some nan values (no data or land areas).

In [20]:
print(northAtlantic.values)

[[[[25.82       25.849998   25.46       ...         nan         nan
            nan]
   [25.699999   25.869999   25.599998   ...         nan         nan
            nan]
   [25.25       25.789999   25.75       ...         nan         nan
            nan]
   ...
   [-0.5        -0.17999999  0.06       ...  8.74        8.67
     8.54      ]
   [-0.74       -0.42999998 -0.14       ...  8.84        8.82
     8.72      ]
   [-0.89       -0.56       -0.21       ...  8.9         8.94
     8.9       ]]]]


### Transform the Values into a Pandas Dataframe

In [41]:
# Convert to DataFrame
df = northAtlantic.to_dataframe().reset_index()
df.head()

# Drop missing (NaN) values if needed (not tested, ChatGPT)
# df = df.dropna(subset=['sst'])

# close the dataset to free memory
ds.close()

## 2. Scaling Up

So far so good. However, the matter is more complex. The files from the [NOAA directory](https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/) are for each day of the month for each year (1981 to 2025) and we want to calculate an average for each month of the each year.

In [2]:
# define were the data is
BASE_DIR = './data/'

# define the time frame
START_YEAR = 1993
END_YEAR = 1995

In [3]:
#define the area
lon_min = -98.833008
lat_min = 9.985143
lon_max = -12.963867
lat_max = 45.673728

# function to transform 0-360 longitude values to -180-180 values
def to_0360(lon):
    return lon + 360 if lon < 0 else lon

lon_min = to_0360(lon_min)  # 261.166992
lon_max = to_0360(lon_max)  # 347.036133

print("selecting: ", lat_min, lat_max, lon_min, lon_max)

selecting:  9.985143 45.673728 261.166992 347.036133


The folders should be inside `./data/` like so: 
- 199301
- 199302
- 199303
- ...

In [4]:
def process_sst_data():
    # Initialize a list to store data frames
    data_list = []
    
    # Iterate over each month folder (YYYYMM) in the base directory
    for month_folder in sorted(os.listdir(BASE_DIR)):
        # Ensure the folder name is a valid number before extracting the year
        if not month_folder.isdigit() or len(month_folder) != 6:
            continue  # Skip invalid folders
        # Extract year from YYYYMM format
        year = int(month_folder[:4])
        
        # Skip folders outside the selected range
        if year < START_YEAR or year > END_YEAR:
            continue
        
        month_path = os.path.join(BASE_DIR, month_folder)
        
        if not os.path.isdir(month_path):
            continue  # Skip if it's not a directory

        print(year, month_folder, month_path)
        
        # List all .nc files within the month folder
        files = sorted(
            [os.path.join(month_path, file) for file in os.listdir(month_path) if file.endswith('.nc')],
            key=lambda x: x.split('.')[-2]  # Extract the date part from filename
        )
        
        # Process each file
        for file in files:
            current_ds = xr.open_dataset(file)
            
            if 'sst' not in current_ds.variables:
                print(f"Warning: 'sst' not found in {file}")
                continue  # Skip files missing SST variable
            
            # Select the North American ocean region
            area_select = current_ds.sst.sel(lat=slice(lat_min, lat_max), lon=slice(lon_min, lon_max))

            # Convert to DataFrame and reset index
            new_df = area_select.to_dataframe().reset_index()
            
            # Extract year and month
            new_df['year'] = year
            new_df['month'] = month_folder[-2:]  # Extract last 2 digits of YYYYMM folder
            
            # Append to list
            data_list.append(new_df)
            
            # Close dataset to free memory
            current_ds.close()
    
    # Combine all monthly data
    #print(data_list)
    final_df = pd.concat(data_list, ignore_index=True)
    
    # Calculate the mean SST for each (year, month, lat, lon)
    mean_sst_df = final_df.groupby(['year', 'month', 'lat', 'lon'])['sst'].mean().reset_index()
    
    return mean_sst_df

In [5]:
average_sst_per_month = process_sst_data()

1993 199301 ./data/199301
1993 199302 ./data/199302
1993 199303 ./data/199303
1993 199304 ./data/199304
1993 199305 ./data/199305
1993 199306 ./data/199306
1993 199307 ./data/199307
1993 199308 ./data/199308
1993 199309 ./data/199309
1993 199310 ./data/199310
1993 199311 ./data/199311
1993 199312 ./data/199312
1994 199401 ./data/199401
1994 199402 ./data/199402
1994 199403 ./data/199403
1994 199404 ./data/199404
1994 199405 ./data/199405
1994 199406 ./data/199406
1994 199407 ./data/199407
1994 199408 ./data/199408
1994 199409 ./data/199409
1994 199410 ./data/199410
1994 199411 ./data/199411
1994 199412 ./data/199412
1995 199501 ./data/199501
1995 199502 ./data/199502
1995 199503 ./data/199503
1995 199504 ./data/199504
1995 199505 ./data/199505
1995 199506 ./data/199506
1995 199507 ./data/199507
1995 199508 ./data/199508
1995 199509 ./data/199509
1995 199510 ./data/199510
1995 199511 ./data/199511
1995 199512 ./data/199512


In [48]:
average_sst_per_month.head()

Unnamed: 0,year,month,lat,lon,sst
0,1993,1,10.125,261.375,27.358709
1,1993,1,10.125,261.625,27.380644
2,1993,1,10.125,261.875,27.405806
3,1993,1,10.125,262.125,27.408064
4,1993,1,10.125,262.375,27.423548


## Furter Data Adjustments

Create a new column for the converted lon values

In [12]:
average_sst_per_month["lon_converted"] = average_sst_per_month["lon"].apply(lambda x: x - 360 if x > 180 else x)
average_sst_per_month.head()

Unnamed: 0,year,month,lat,lon,sst,lon_converted
0,1993,1,10.125,261.375,27.358709,-98.625
1,1993,1,10.125,261.625,27.380644,-98.375
2,1993,1,10.125,261.875,27.405806,-98.125
3,1993,1,10.125,262.125,27.408064,-97.875
4,1993,1,10.125,262.375,27.423548,-97.625


Remove every coordinate / row were SST has no value (land mass)

In [13]:
total_values = len(average_sst_per_month)
export_data = average_sst_per_month.dropna(subset=["sst"])

print(f"Original length: {len(average_sst_per_month)}")
print(f"No SST (Land Mass): {len(average_sst_per_month) - len(export_data)}")
print(f"Total Export Data: {len(export_data)}")
print("-" * 80)

export_data.head()

Original length: 1765764
No SST (Land Mass): 307800
Total Export Data: 1457964
--------------------------------------------------------------------------------


Unnamed: 0,year,month,lat,lon,sst,lon_converted
0,1993,1,10.125,261.375,27.358709,-98.625
1,1993,1,10.125,261.625,27.380644,-98.375
2,1993,1,10.125,261.875,27.405806,-98.125
3,1993,1,10.125,262.125,27.408064,-97.875
4,1993,1,10.125,262.375,27.423548,-97.625


Get only one year

In [16]:
export_1993 = export_data[export_data["year"] == 1993]
export_1993 = export_1993.copy() # make a full copy!
export_1993

Unnamed: 0,year,month,lat,lon,sst,lon_converted
0,1993,01,10.125,261.375,27.358709,-98.625
1,1993,01,10.125,261.625,27.380644,-98.375
2,1993,01,10.125,261.875,27.405806,-98.125
3,1993,01,10.125,262.125,27.408064,-97.875
4,1993,01,10.125,262.375,27.423548,-97.625
...,...,...,...,...,...,...
588583,1993,12,45.625,345.875,12.557096,-14.125
588584,1993,12,45.625,346.125,12.503225,-13.875
588585,1993,12,45.625,346.375,12.458064,-13.625
588586,1993,12,45.625,346.625,12.431613,-13.375


**Bin** the coordinates (set a resolution for the map)

In [17]:
bin_size = 1.0  # or 0.5
export_1993["lat_bin"] = (export_1993["lat"] // bin_size) * bin_size
export_1993["lon_bin"] = (export_1993["lon_converted"] // bin_size) * bin_size

export_1993.head()

Unnamed: 0,year,month,lat,lon,sst,lon_converted,lat_bin,lon_bin
0,1993,1,10.125,261.375,27.358709,-98.625,10.0,-99.0
1,1993,1,10.125,261.625,27.380644,-98.375,10.0,-99.0
2,1993,1,10.125,261.875,27.405806,-98.125,10.0,-99.0
3,1993,1,10.125,262.125,27.408064,-97.875,10.0,-98.0
4,1993,1,10.125,262.375,27.423548,-97.625,10.0,-98.0


In [20]:
binned_1993 = export_1993.groupby(["year", "month", "lat_bin", "lon_bin"]).agg({"sst": "mean"}).reset_index()
print(f"Binned length: {len(binned_1993)}")
print("-" * 80)

binned_1993.head()

Binned length: 31956
--------------------------------------------------------------------------------


Unnamed: 0,year,month,lat_bin,lon_bin,sst
0,1993,1,10.0,-99.0,27.372742
1,1993,1,10.0,-98.0,27.414032
2,1993,1,10.0,-97.0,27.713728
3,1993,1,10.0,-96.0,28.06641
4,1993,1,10.0,-95.0,28.274092


## Export

In [21]:
output_dir = './export/'
# export_data.to_csv(f'{output_dir}sst-{START_YEAR}-{END_YEAR}.csv', index=False)
export_1993.to_csv(f'{output_dir}sst-binned-{START_YEAR}-{END_YEAR}.csv', index=False)
# export_data.to_json(f'{output_dir}sst-{START_YEAR}-{END_YEAR}.json')