# Update - Download New Daily OISSTv2 Data

This is foundational step towards continuous integration. 

In this notebook we use Beautifulsoup to ping the url for the most recently available OISST data, and join it with the rest of the data for the current year. This works in a way to replace preliminary data as finalized data becomes available while keeping our data current. Preliminary data is re-downloading and replaced with finalized data, and the most recent OISST data (including prelim files) is exported to be used for the remaining workflows. This step is the primer for any subsequent `Update`  workflow steps.

In [1]:
# Libraries
from bs4 import BeautifulSoup
import requests
import os
import xarray as xr

# Set the workspace - local/ docker
workspace = "local"

# Root paths
root_locations = {"local"  : "/Users/akemberling/Box/",
                  "docker" : "/home/jovyan/"}


# Set root
box_root = root_locations[workspace]
print(f"Working via {workspace} directory at: {box_root}")

## Set Destinations for Downloads and Updated Files

In this section we specify what year/months we are interested in, and where we want the daily netcdf files to be downloaded to. I also set up a dictionary for quick-access to other resources.

In [2]:
# Update year to search for among links
update_yr = 2021
update_month = 12

# Out Destination - Cache gobals
_cache_root = f"{box_root}RES_Data/OISST/oisst_mainstays/"

# Cache Subdirectory Locations
cache_locs = {
  "cache_root"        : f"{_cache_root}",
  "annual_obs"        : f"{_cache_root}annual_observations/",
  "daily_file_caches" : f"{_cache_root}update_caches/{update_month}/",
}

# Set the output location for where things should save to:
cache_loc = cache_locs["daily_file_caches"] # Where daily caches should go
annual_loc = cache_locs["annual_obs"]       # Where the complete year files live

# Print paths to validate
print("Output folder for daily caches:      " + cache_loc)
print("Access folder for yearly aggregates: " + annual_loc)

Output folder for daily caches:      RES_Data/OISST/oisst_mainstays/update_caches/12/
Access folder for yearly aggregates: RES_Data/OISST/oisst_mainstays/annual_observations/


### Identify the Root url for Web Scraping

This part is where we identify the web pages where the files and their directories can be accessed. `update_yr` and `update_month` are use to navigate to the right folders for updating the correct files.

In [3]:
# Root url where the yearly files are stored

# This url does not work as it is the return for a data query. When using bs4 it scrapes the site prior to the search
# https://psl.noaa.gov/data/gridded/data.noaa.oisst.v2.html


# This URL is from Eric Bridgers repo oisst-clim-daily
# Use the url from Eric's Repo, thank you Eric

# This URL will grab the desired update month - need to add functionality for the transition to new months
fetch_url = f"https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/{update_yr}{update_month}/"

### Open the http directory listing

In [4]:
# open the http directory listing.
req = requests.get(fetch_url)

# Print error message if link does not work
if req.status_code != requests.codes.ok:
    #logger.info("{} {} {}".format(req.status_code, req.reason, req.url))
    print(f"Request Error, Reason: {req.reason}")

### Parse URL contents with BS

In [5]:
 # Parse the url with BS and its html parser.
soup = BeautifulSoup(req.text, 'html.parser')

Once parsed, the next step is to look for the correct html elements to access the files we need.

In [6]:
# Find all href anchors in the html text
anchors = soup.find_all("a")
for link in anchors:
    #print(link)
    if link.get("href").endswith("nc"):
        href = link.get("href")
        print(f"Found File: {href}")

Found File: oisst-avhrr-v02r01.20201201.nc
Found File: oisst-avhrr-v02r01.20201202.nc
Found File: oisst-avhrr-v02r01.20201203.nc
Found File: oisst-avhrr-v02r01.20201204.nc
Found File: oisst-avhrr-v02r01.20201205.nc
Found File: oisst-avhrr-v02r01.20201206.nc
Found File: oisst-avhrr-v02r01.20201207.nc
Found File: oisst-avhrr-v02r01.20201208.nc
Found File: oisst-avhrr-v02r01.20201209.nc
Found File: oisst-avhrr-v02r01.20201210.nc
Found File: oisst-avhrr-v02r01.20201211.nc
Found File: oisst-avhrr-v02r01.20201212.nc
Found File: oisst-avhrr-v02r01.20201213.nc
Found File: oisst-avhrr-v02r01.20201214.nc
Found File: oisst-avhrr-v02r01.20201215.nc
Found File: oisst-avhrr-v02r01.20201215_preliminary.nc
Found File: oisst-avhrr-v02r01.20201216.nc
Found File: oisst-avhrr-v02r01.20201216_preliminary.nc
Found File: oisst-avhrr-v02r01.20201217.nc
Found File: oisst-avhrr-v02r01.20201217_preliminary.nc
Found File: oisst-avhrr-v02r01.20201218.nc
Found File: oisst-avhrr-v02r01.20201219.nc
Found File: oisst-

## Download Netcdf's to Cache Folder

Once we have verified what daily files are available the next step is to download each of them to the cache in their current daily file format.

In [7]:
# list to store download paths
new_downloads = []

# Find all the links in fetch_url which end with ".nc"
for link in anchors:
    
    # Find links that match update year
    if link.get('href').endswith(f'.nc'):
        
        # Get the link(s) that match
        href = link.get('href')
        #print(f"Download link match: {href}")
        
        # Use requests to download
        req_link = fetch_url + href
        req = requests.get(fetch_url + href)
        if req.raise_for_status():
            exit()
        
        # Open link
        dl_path = f"{cache_loc}{href}"
        file = open(dl_path, 'wb')
        chunk_size = 17000000
        
        # Add to log
        new_downloads.append(dl_path)
        
        # Process in chunks to save daily files
        for chunk in req.iter_content(chunk_size):
            file.write(chunk)
        file.close()
        print(f"Cached Daily NETCDF File: {href}")
        

Cached Daily NETCDF File: oisst-avhrr-v02r01.20201201.nc
Cached Daily NETCDF File: oisst-avhrr-v02r01.20201202.nc
Cached Daily NETCDF File: oisst-avhrr-v02r01.20201203.nc
Cached Daily NETCDF File: oisst-avhrr-v02r01.20201204.nc
Cached Daily NETCDF File: oisst-avhrr-v02r01.20201205.nc
Cached Daily NETCDF File: oisst-avhrr-v02r01.20201206.nc
Cached Daily NETCDF File: oisst-avhrr-v02r01.20201207.nc
Cached Daily NETCDF File: oisst-avhrr-v02r01.20201208.nc
Cached Daily NETCDF File: oisst-avhrr-v02r01.20201209.nc
Cached Daily NETCDF File: oisst-avhrr-v02r01.20201210.nc
Cached Daily NETCDF File: oisst-avhrr-v02r01.20201211.nc
Cached Daily NETCDF File: oisst-avhrr-v02r01.20201212.nc
Cached Daily NETCDF File: oisst-avhrr-v02r01.20201213.nc
Cached Daily NETCDF File: oisst-avhrr-v02r01.20201214.nc
Cached Daily NETCDF File: oisst-avhrr-v02r01.20201215.nc
Cached Daily NETCDF File: oisst-avhrr-v02r01.20201215_preliminary.nc
Cached Daily NETCDF File: oisst-avhrr-v02r01.20201216.nc
Cached Daily NETCDF

## Review the files we just accessed and cached:

Now that the daily files have been accessed and saved individually, the next step is to stack them all up and align them with the annual netcdf file. Things that need to nbe done before that happens are:
 * The removal of duplicated dates
 * The overwriting of preliminary data if available

In [16]:
# # 1. Find the dates with preliminary data, flag them
# #

# prelim_dates = []
# for link in new_downloads:
#     if link.endswith("_preliminary.nc"): 
#         start_idx = link.find(f"{update_yr}{update_month}")
#         end_idx = start_idx + 8
#         step = int(1)
#         date_id = link[start_idx: end_idx: step]
#         prelim_dates.append(date_id)
        
# print("Preliminary data found for: ")
# prelim_dates

In [8]:
# 1. Find the dates with preliminary data, flag them
#

# CHANGE:
# should look for preliminary files in the cache folder and not the download list,
# this way no files slip through the cracks
# os.listdir(cache_loc)

prelim_dates = []
for cache_file in os.listdir(cache_loc):
    if cache_file.endswith("_preliminary.nc"): 
        start_idx = cache_file.find(f"{update_yr}{update_month}")
        end_idx = start_idx + 8
        step = int(1)
        date_id = cache_file[start_idx: end_idx: step]
        prelim_dates.append(date_id)
        
print("Preliminary data found for: ")
prelim_dates

Preliminary data found for: 


['20201215',
 '20201216',
 '20201217',
 '20201225',
 '20201226',
 '20201227',
 '20201228',
 '20201229',
 '20201230',
 '20201231']

Once all the preliminary data files have been flagged, the next step is to check if those dates now have files without _preliminary on the end, indicating that the data has been finalized.

In [13]:
# 2. Check if those dates only have prelim data, or if there is also finalized data:

# list of duplicated dates, dates where we can ignore the preliminary data
finalized_dates = []

# check all download links
for link in os.listdir(cache_loc):
    
    # check each preliminary data date
    for prelim_date in prelim_dates:
        
        # If the date for the link matches dates with preliminary data for a link, flag the date
        if prelim_date in link:
            start_idx = link.find(f"{update_yr}{update_month}")
            end_idx = start_idx + 8
            step = int(1)
            date_id = link[start_idx: end_idx: step]
            
            # and add those to the list of dates where there is prelim and final data
            finalized_dates.append(date_id)
            


# Program to check for repeated list contents
# This code is contributed  
# by Sandeep_anand 
def Repeat(x): 
    _size = len(x) 
    repeated = [] 
    for i in range(_size): 
        k = i + 1
        for j in range(k, _size): 
            if x[i] == x[j] and x[i] not in repeated: 
                repeated.append(x[i]) 
    return repeated 
  
    
    
# Report the dates with both preliminary and finalized data  
print("Prelim and Finalized Data Found for:")
print (Repeat(finalized_dates))     

Prelim and Finalized Data Found for:
['20201225', '20201226', '20201227', '20201228', '20201229', '20201230', '20201231']


## Remove preliminaries that have final data

If any preliminary files exist the next step is to remove them so we don't get duplicate dates when we combine the individual netcdfs

In [14]:
# Pull the dates for situations where there is both a preliminary and final file.
remove_prelim = Repeat(finalized_dates)

# Build out the preliminary names that these would be, drop them from cache.
for repeated_date in remove_prelim:
    
    # Build full file name
    file_name = f"{cache_loc}oisst-avhrr-v02r01.{repeated_date}_preliminary.nc"
    
    # Print the ones we removed    
    print(f"File Removed for Finalized Data: {file_name}")
    
    # Remove them from the folder. Don't need them anymore.
    if os.path.exists(file_name):
        os.remove(file_name)
    

File Removed for Finalized Data: RES_Data/OISST/oisst_mainstays/update_caches/12/oisst-avhrr-v02r01.20201225_preliminary.nc
File Removed for Finalized Data: RES_Data/OISST/oisst_mainstays/update_caches/12/oisst-avhrr-v02r01.20201226_preliminary.nc
File Removed for Finalized Data: RES_Data/OISST/oisst_mainstays/update_caches/12/oisst-avhrr-v02r01.20201227_preliminary.nc
File Removed for Finalized Data: RES_Data/OISST/oisst_mainstays/update_caches/12/oisst-avhrr-v02r01.20201228_preliminary.nc
File Removed for Finalized Data: RES_Data/OISST/oisst_mainstays/update_caches/12/oisst-avhrr-v02r01.20201229_preliminary.nc
File Removed for Finalized Data: RES_Data/OISST/oisst_mainstays/update_caches/12/oisst-avhrr-v02r01.20201230_preliminary.nc
File Removed for Finalized Data: RES_Data/OISST/oisst_mainstays/update_caches/12/oisst-avhrr-v02r01.20201231_preliminary.nc


## Building Annual File

Now that each daily file has been checked for updates, with duplicates removed we can start gluing everything together again.

In [15]:
# First generate new list of filenames since removing the duplicates etc.
daily_files = []
for file in os.listdir(f"{cache_loc}"):
    if file.endswith(".nc"):
        daily_files.append(f"{cache_loc}{file}")
        print(f"Appended File: {file}")

        
# Use open_mfdataset to access the new downloads
oisst_update = xr.open_mfdataset(daily_files, combine = "by_coords")

Appended File: oisst-avhrr-v02r01.20201201.nc
Appended File: oisst-avhrr-v02r01.20201202.nc
Appended File: oisst-avhrr-v02r01.20201203.nc
Appended File: oisst-avhrr-v02r01.20201204.nc
Appended File: oisst-avhrr-v02r01.20201205.nc
Appended File: oisst-avhrr-v02r01.20201206.nc
Appended File: oisst-avhrr-v02r01.20201207.nc
Appended File: oisst-avhrr-v02r01.20201208.nc
Appended File: oisst-avhrr-v02r01.20201209.nc
Appended File: oisst-avhrr-v02r01.20201210.nc
Appended File: oisst-avhrr-v02r01.20201211.nc
Appended File: oisst-avhrr-v02r01.20201212.nc
Appended File: oisst-avhrr-v02r01.20201213.nc
Appended File: oisst-avhrr-v02r01.20201214.nc
Appended File: oisst-avhrr-v02r01.20201215.nc
Appended File: oisst-avhrr-v02r01.20201216.nc
Appended File: oisst-avhrr-v02r01.20201217.nc
Appended File: oisst-avhrr-v02r01.20201218.nc
Appended File: oisst-avhrr-v02r01.20201219.nc
Appended File: oisst-avhrr-v02r01.20201220.nc
Appended File: oisst-avhrr-v02r01.20201221.nc
Appended File: oisst-avhrr-v02r01.

In [24]:
# Check time index
#oisst_update.get_index("time")

# Get all dates where the time indexes are not (~) duplicated
oisst_noreps = oisst_update.sel(time = ~oisst_update.get_index("time").duplicated())

# drop additional coordinates and variables not in the annual file
update_prepped = oisst_noreps#.drop("zlev")
update_prepped = update_prepped.drop_vars(["anom", "err", "ice"])
update_prepped

# remove attributes, going to add back later
update_prepped.attrs = {}

# dropping zlev
update_prepped.sst.drop("zlev")


---

# Next Steps:

## Load and Append to Yearly File

On the new year this will change: But currently we have an annual file of the same structure as all the others. We want to append on all the new days, avoiding any overlap/repetition.

Once that is done we just need to format everything and it can be saved out as the annual netcdf with daily increments.

In [17]:
# Load the yearly file we're appending to
oisst = xr.open_dataset(f"{annual_loc}sst.day.mean.{update_yr}.v2.nc")
oisst

In [17]:
# Remove dates from annual file that overlap with updates.
# This will make it so the current month will overwrite as it gets finalized.

# Boolean flag for whether time is before the update_month set in beginning
def pre_update_month(month):
    return (month != update_month)

# Subset time so there isn't overlap on update month
oisst_prepped = oisst.sel(time = pre_update_month(oisst['time.month']))
oisst_prepped

# close oisst

KeyError: 'time'

In [17]:
# append/combine updates to previous months
oisst_updated = xr.combine_by_coords(datasets = [oisst_prepped, update_prepped])
oisst_updated

## Build Netcdf Attributes

Things like the datetime origin, and the other attributes of the array should match the yearly netcdfs we have so that they can all append correctly without information loss.

In [18]:
# Take attribute information from oisst_prepped
oisst_updated.attrs = oisst_prepped.attrs
oisst_updated.attrs


#OR, make a manual addition of what they should be based on the other files, so we aren't transferring



{'Conventions': 'CF-1.5',
 'title': 'NOAA/NCEI 1/4 Degree Daily Optimum Interpolation Sea Surface Temperature (OISST) Analysis, Version 2.1',
 'institution': 'NOAA/National Centers for Environmental Information',
 'source': 'NOAA/NCEI https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/',
 'References': 'https://www.psl.noaa.gov/data/gridded/data.noaa.oisst.v2.highres.html',
 'dataset_title': 'NOAA Daily Optimum Interpolation Sea Surface Temperature',
 'version': 'Version 2.1',
 'comment': 'Reynolds, et al.(2007) Daily High-Resolution-Blended Analyses for Sea Surface Temperature (available at https://doi.org/10.1175/2007JCLI1824.1). Banzon, et al.(2016) A long-term record of blended satellite and in situ sea-surface temperature for climate monitoring, modeling and environmental studies (available at https://doi.org/10.5194/essd-8-165-2016). Huang et al. (2020) Improvements of the Daily Optimum Interpolation Sea Surface Temperature (DOISST) Ver

# Save out Annual File

Now that there is data up through the current date, and with all attributes correctly set, the annual file can be saved out.

In [22]:
# Build out destination folder:
out_folder = annual_loc
naming_structure = f"sst.day.mean.{update_yr}.v2.nc"
out_path = f"TESTING{out_folder}{naming_structure}"
out_path

NameError: name 'annual_loc' is not defined

In [20]:
# close oisst and oisst_prepped
oisst.close()
oisst_prepped.close()

In [21]:
# Export the finished file
#oisst_updated.to_netcdf(path = out_path)

ValueError: Variable 'sst' has conflicting _FillValue (nan) and missing_value (-9.969209968386869e+36). Cannot encode data.