&nbsp;
# 01 - Retrieving raw data

---

&nbsp;

How should we select wich data feed is relevant ? That is the question we firstly asked ourselves. 

As the specifications notice it, our final goal is to "understand and simulate the evolution of the SST parameter in time, study the stability of the model and evaluate its capacity to reproduce or predict observed fluctuations". The success of this mission therefore depends first and foremost on the quality of the data on which we base our work. Based on this reasoning, we decided to download and compare two similar products from the following feeds: NOAA and Copernicus Marine.

&nbsp;

> ### How did we select the following products?

First, we focused on Level 4 (L4) products. The term “Lx” refers to the processing level of a dataset. L4 corresponds to a highly processed data product, in which multiple observation sources are combined, and optimal interpolation techniques are applied to fill data gaps (OI marker in the name of the product for NOAA). Additional corrections and quality controls may also be included.

As a result, L4 products provide spatially and temporally complete gridded fields, with no missing values and reduced inconsistencies. Since our objective is to model, predict, and quantitatively compare ocean surface temperature datasets, we consider L4 products to be the most appropriate choice for this study.

Last, we mainly focus on SST parameter(Sea Surface Temperature) for the moment, so we select SST marked product.

In [1]:
import sys
print(sys.executable)
import xarray as xr
import netCDF4
print(xr.__version__)


C:\Users\gaoks\Isen\ProjetM1\m1Project_SciML\venv\Scripts\python.exe
2025.12.0


In [None]:
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt

# test loading a part of COPERNICUS dataset using xarray


ds = xr.open_dataset("../../data/raw/unmerged/C3S-GLO-SST-L4-REP-OBS-SST_1767389520200_part1.nc")

ModuleNotFoundError: No module named 'xarray'

We manually downloaded NetCDF database on both official website of NOAA and Copernicus Marine. 

Concerning NOAA, we had to download the whole world database year by year, hence, we will need to merge those 10 files and only keep the data that interests us (Manche area, latitude between 51 and 48, longitude between 5 and -2). 

Concerning Copernicus Marine, website permits us to download only the data we need by applying spatial and time constraint, we will only need to merge 2 NetCDF files (since download from website is size restricted, we had to download it in 2 times, 2010-2015 and 2015-2020).

In [None]:
# We make sure Dask manager is set up or the download will fail

# Here we are instancing a list of names of the files we want to merge

filesNOAA = sorted([
    "../../data/raw/unmerged/sst.day.mean.2010_NOAA.nc",
    "../../data/raw/unmerged/sst.day.mean.2011_NOAA.nc",
    "../../data/raw/unmerged/sst.day.mean.2012_NOAA.nc",
    "../../data/raw/unmerged/sst.day.mean.2013_NOAA.nc",
    "../../data/raw/unmerged/sst.day.mean.2014_NOAA.nc",
    "../../data/raw/unmerged/sst.day.mean.2015_NOAA.nc",
    "../../data/raw/unmerged/sst.day.mean.2016_NOAA.nc",
    "../../data/raw/unmerged/sst.day.mean.2017_NOAA.nc",
    "../../data/raw/unmerged/sst.day.mean.2018_NOAA.nc",
    "../../data/raw/unmerged/sst.day.mean.2019_NOAA.nc",
])

filesCOPERNICUS = sorted([
    "../../data/raw/unmerged/C3S-GLO-SST-L4-REP-OBS-SST_1767389520200_part1.nc",
    "../../data/raw/unmerged/C3S-GLO-SST-L4-REP-OBS-SST_1767389601469_part2.nc"
])

# We use xarray lib to merge all nc files

dsNOAA = xr.open_mfdataset(
    filesNOAA, # specify the list of files we want to merge
    chunks={"time": 365} # specify chunks size to optimize time and space allocation
)

dsCOPERNICUS = xr.open_mfdataset(
    filesCOPERNICUS, # specify the list of files we want to merge
    chunks={"time":365} # specify chunks size to optimize time and space allocation
)

dsNOAA = dsNOAA.sel(lat=slice(-2,5)) # We select the region of interest : Manche Bay
dsNOAA = dsNOAA.sel(lon=slice(48,51)) 

# We convert to NetCDF

dsNOAA.to_netcdf("../../data/raw/merged/sstNOAA20102019.nc", encoding={"sst": {"zlib": True, "complevel": 4}}) # We convert NetCDF NOAA file merged to a merged nc file
dsCOPERNICUS.to_netcdf("../../data/raw/merged/sstCOPERNICUS20102019.nc", encoding={"analysed_sst": {"zlib": True, "complevel": 4}}) # We convert NetCDF COPERNICUS file merged to a merged nc file

PermissionError: [Errno 13] Permission denied: 'c:\\Users\\gaoks\\OneDrive\\Desktop\\Isen\\FISE M1\\PROJET M1\\m1Project_SciML\\data\\raw\\merged\\sstNOAA20102019.nc'