# Tutorial to Download NOAA Satellite Data Files

This code was written in December 2022 by Dr. Amy Huff, IMSG at NOAA/NESDIS/STAR (amy.huff@noaa.gov) and Dr. Rebekah Esmaili, STC at NOAA/JPSS (rebekah.esmaili@noaa.gov) using:
- Python 3.9.12
- S3Fs 2022.5.0
- Requests 2.27.1

This tutorial will demonstrate how to download satellite data files from the NOAA Open Data Dissemination (NODD) GOES-R & JPSS application programming interfaces (APIs) on Amazon Web Services (AWS) and the NOAA/NESDIS/STAR gridded aerosol data archive website.

The downloaded files will include:
- From the GOES-R NODD on AWS:
    - GOES-16 (GOES-East) ABI L2 Multichannel Cloud & Moisture Imagery Products (MCMIP) - Mesoscale sector (M) data file for Dec 2, 2022 at 20:30 UTC
- From the JPSS NODD on AWS:
    - NOAA-20 VIIRS L2 Active Fires (AF) I-band Environmental Data Record (EDR) data file for Oct 16, 2022 at 21:18 UTC
    - NOAA-20 NUCAPS Sounding Environmental Data Record (EDR) data file for Nov 29, 2022 at 19:07 UTC
- From the NOAA/NESDIS/STAR gridded aerosol data archive website:
    - SNPP VIIRS L3 Aerosol Optical Depth (AOD) global gridded data file for Sep 11, 2022

## Step 1: Import Python packages

We will use two Python packages (libraries) and two Python modules in this tutorial:
- The **S3Fs** library is used to set up a filesystem interface with the Amazon Simple Storage Service (S3)
- The **Requests** library is used to send HTTP requests
- The **datetime** module is used to manipulate dates and times
- The **pathlib** module is used to set filesystem paths for the user's operating system

In [None]:
import s3fs

import requests

import datetime

from pathlib import Path

## Step 2: Set directory path where satellite data files will be saved

We set the directory path for the downloaded satellite data files using the **pathlib** module, which automatically uses the correct format for the user's operating system. This helps avoid errors in situations when more than one person is using the same code file, because Windows uses back slashes in directory paths, while MacOS and Linux use forward slashes. 

More information about the **pathlib** module: https://docs.python.org/3/library/pathlib.html#module-pathlib

To keep things simple for this training, we will put the satellite data files we download in the current working directory ("cwd"), i.e., the same Jupyter Notebook folder where this code file is located.

In [None]:
directory_path = Path.cwd()

## Step 3: Connect to AWS S3 (Simple Storage Service)

The NODD program is increasing access to NOAA satellite data, including data from the GOES-R geostationary satellites and JPSS polar-orbiting satellites. 

More information on the NODD program: https://www.noaa.gov/information-technology/open-data-dissemination

NOAA data are available through NODD via collaborations with AWS, Google Earth Engine, and Microsoft Azure. We will use the AWS platform in this tutorial because it is free to access and does not require any additional registration or a password.

Think of the NODD AWS S3 "buckets" as online data archives. You do **not** need an AWS cloud computing account to access NOAA satellite data!

The **S3Fs** package allows us to set up a filesystem ("fs") interface to AWS S3 "buckets", so we can view and download NOAA data files. We use an anonymous connection ("annon=True") because the NODD "buckets" are publicly available & read-only.

Documentation for the S3Fs package: https://s3fs.readthedocs.io/en/latest/

In [None]:
fs = s3fs.S3FileSystem(anon=True)

## Step 4a: Browse the GOES-R NODD on AWS

NOAA geostationary satellite data from the GOES-16, GOES-17, and GOES-18 satellites are available from the GOES-R NODD on AWS: https://registry.opendata.aws/noaa-goes/

There are separate "buckets" for each satellite, which can be viewed in a web browser:
- GOES-16: https://noaa-goes16.s3.amazonaws.com/index.html
- GOES-17: https://noaa-goes17.s3.amazonaws.com/index.html
- GOES-18: https://noaa-goes18.s3.amazonaws.com/index.html

Data on the GOES-R NODD are updated in near real-time, and there is a full archive of the publicly available data (generally provisional and full maturity data; for descriptions of satellite product maturity levels, see https://www.star.nesdis.noaa.gov/atmospheric-composition-training/satellite_data_maturity_levels.php). 

Data files in each of the satellite "buckets" are organized by product name. Let's access the GOES-16 "bucket" and list ("fs.ls") the available products, and then print the product names.

GOES-R product name descriptions: https://github.com/awslabs/open-data-docs/tree/main/docs/noaa/noaa-goes16#about-the-data)

In [None]:
bucket = 'noaa-goes16'

products_path = bucket

products = fs.ls(products_path)

for product in products:
    print(product.split('/')[-1])

## Step 4b: GOES-16 ABI MCMIPM data (find Julian day)

We can see that the GOES-R product names are abbreviations that begin with the sensor (ABI, EXIS, GLM, MAG, SEIS, SUVI) and data processing level (L1b, L2). For more information on satellite data processing levels, see https://www.star.nesdis.noaa.gov/atmospheric-composition-training/satellite_data_processing_levels.php. 

ABI product names include an abbreviation of the data product (e.g., Rad, AOD, MCMIP) and the ABI scan sector: F (full disk), C (CONUS), or M (mesoscale). Not all ABI data products are generated for each scan sector. For more information on ABI scan sectors, see https://www.star.nesdis.noaa.gov/atmospheric-composition-training/satellite_data_abi_scan_sectors.php.

In this tutorial, we are going to download the GOES-16 ABI L2 Multichannel Cloud & Moisture Imagery Products Mesoscale 2 sector (ABI-L2-MCMIPM) data file for Dec 2, 2022 at 20:30 UTC, when strong winds associated with the passage of a cold front generated blowing dust in eastern CO & western KS. You will use the data in this file to create composite color imagery to visualize the dust storm, including dust RGB and (simulated) true color.

The L2 Cloud and Moisture Imagery Product (CMIP) files are easier to work with than the L1b radiance (Rad) files for making composite imagery because the CMIP files contain the ABI band values needed to calculate the composites: brightness temperature (BT) at the Top-Of-Atmosphere (TOA) in Kelvin for the ABI emissive bands (7-16) and the dimensionless reflectance (normalized by the solar zenith angle) for the ABI reflective bands (1-6); the multiband product file (MCMIP) includes includes all band values in one file at a consistent spatial resolution of 2km.

ABI data files on the AWS NODD are organized in the same way for each of the satellite "buckets":
- Product
- Year
- Julian day
- Hour
- Filename

In order to browse the "ABI-L2-MCMIPM" data for Dec 2, 2022, we need to know the 3-digit Julian day. ABI data are organized by the Julian day of the year instead of the month and day of the month. 

The following code uses the **datetime** module to return the 3-digit Julian day for any year, month, and day we enter as **integers**.

In [None]:
year = 2022
month = 12
day = 2

julian_day = datetime.datetime(year, month, day).strftime('%j')
print(julian_day)

## Step 4c: Browse GOES-16 ABI MCMIPM data for Dec 2, 2022 at 20:00-20:59 UTC

Now that we know the Julian day for Dec 2, 2022, we can set the "data_path" for GOES-16 ABI MCMIPM files for "hour" 20, which corresponds to observations between 20:00-20:59 UTC. 

The built-in Python fuction "str.zfill(width)" ensures the "julian" string in the "data_path" is 3 digits and the "hour" string in the "data_path" is 2 digits; "str.zfill(width)" returns a copy of the string left-filled with ASCII '0' digits to make a string of length "width". This ensures that the "data_path" syntax is correct for "julian" variable integers < 100 and "hour" variable integers < 10.

Let's list ("fs.ls") the available MCMIPM files for Dec 2, 2022 at 20:00-20:59 UTC, and then print the total number of files and print the first 10 and last 10 file names.

In [None]:
bucket = 'noaa-goes16'
product = 'ABI-L2-MCMIPM'
year = 2022
julian = 336
hour = 20

data_path = bucket + '/' + product + '/'  + str(year) + '/' + str(julian).zfill(3) + '/' + str(hour).zfill(2)

files = fs.ls(data_path)

print('Total number of files:', len(files), '\n')

for file in files[:10]:
    print(file.split('/')[-1])
for file in files[-10:]:
    print(file.split('/')[-1])

## Step 4d: Find the GOES-16 ABI MCMIPM2 data file for Dec 2, 2022 at 20:30 UTC

We can see that there 120 MCMIPM files each hour: one observation every 1 minute for the two Mesoscale sectors, M1 and M2. We want to download the M2 file for 20:30 UTC. 

ABI file names contain a lot of information about the data in the file. For an explanation of how to decode ABI L2 file names, see https://www.star.nesdis.noaa.gov/atmospheric-composition-training/satellite_data_decoding_data_file_names.php#abi_level2

Using slicing and list comprehension, we can identify the one file we want by using the information in the file name to match the starting ("s") observation time of "2030" and the product name "MCMIPM2". Then we print the file name to confirm it's the one we want and check the approximate file size ("fs.size") before we download the file.

In [None]:
observation_time = '2030'
product_name = 'MCMIPM2'

matches = [file for file in files if (file.split('/')[-1].split('_')[3][8:12] == observation_time and file.split('/')[-1].split('-')[2] == product_name)]

for match in matches:
    print(match.split('/')[-1])
    print('Approximate file size (MB):', round((fs.size(match)/1.0E6), 2))

## Step 4e: Download the GOES-16 ABI MCMIPM2 data file for Dec 2, 2022 at 20:30 UTC

Now that we have identified the file we want to download from the GOES-16 "bucket" and checked the file size, we can proceed to download ("fs.get") the file to the directory we set ("directory_path", the cwd) on our local computer. 

We set the full path for the downloaded data file using **pathlib** syntax, which uses a forward slash ("/") to join the "directory_path" and the file name for the downloaded file.

In [None]:
for match in matches:
    fs.get(match, str(directory_path / match.split('/')[-1]))

## Step 5a: Browse the JPSS NODD on AWS

NOAA polar-orbiting satellite data from the SNPP and NOAA-20 satellites are available from the JPSS NODD on AWS: https://registry.opendata.aws/noaa-jpss/

There is one "bucket" that contains all of the JPSS data, which can be viewed in a web browser: https://noaa-jpss.s3.amazonaws.com/index.html

**We thank Lihang Zhou of NOAA/NESDIS/JPSS for her leadership of the JPSS NODD, and Gian Dilawari of NOAA/NESDIS/JPSS and his team for their hard work adding the massive JPSS datasets to the NODD!**

Data availability on the JPSS NODD varies widely; some JPSS products are not yet included in the NODD, and some products don't have a full archive of files on the NODD. More products are being added all the time, in response to end user requests.

Data files in the JPSS "bucket" are organized first by satellite and then by sensor name. The sensors on the SNPP and NOAA-20 satellites are the same. Let's access the JPSS "bucket" and list ("fs.ls") the available sensors for the NOAA-20 satellite, and then print the sensor names.

In [None]:
bucket = 'noaa-jpss'
satellite = 'NOAA20'

sensors_path = bucket + '/' + satellite 

sensors = fs.ls(sensors_path)

for sensor in sensors:
    print(sensor.split('/')[-1])

## Step 5b: Browse NOAA-20 Soundings data

We can see there are data available from four NOAA-20 sensors (ATMS, CrIS, OMPS, VIIRS) and for "Soundings". In this tutorial, we are going to download a Soundings data file and a VIIRS data file, both from the NOAA-20 satellite. 

Let's start by listing ("fs.ls") the available NOAA-20 Soundings products.

In [None]:
bucket = 'noaa-jpss'
satellite = 'NOAA20'
sensor = 'SOUNDINGS'

products_path = bucket + '/' + satellite + '/' + sensor

products = fs.ls(products_path)

for product in products:
    print(product.split('/')[-1])

## Step 5c: Browse NOAA-20 NUCAPS-EDR Soundings data for November 29, 2022

We can see the JPSS Soundings product names include the satellite ("NOAA20") and an abbreviation of the data product. In this tutorial, we are going to download a NOAA Unique Combined Atmospheric Processing System (NUCAPS) Environmental Data Record (EDR) file for November 29, 2022 at 19:07 UTC. On November 29, severe thunderstorms and tornadoes moved through parts of Mississippi and Alabama, killing two people. You will use the water vapor and temperature profiles data in this file to generate a skew-T/log-P plot. 

JPSS data files are organized on the NODD as follows:
- Satellite
- Sensor
- Product
- Year
- Month
- Day
- Filename

Let's list ("fs.ls") the available NOAA-20 NUCAPS-EDR Soundings files for November 29, 2022, and then print the total number of files and the first 10 file names.

In [None]:
bucket = 'noaa-jpss'
satellite = 'NOAA20'
sensor = 'SOUNDINGS'
product = 'NOAA20_NUCAPS-EDR'
year = 2022
month = 11
day = 29

files_path = bucket + '/' + satellite + '/' + sensor + '/' + product + '/' + str(year) + '/' + str(month).zfill(2)  + '/' + str(day).zfill(2)

files = fs.ls(files_path)

print('Total number of files:', len(files), '\n')

for file in files[:10]:
    print(file.split('/')[-1])

## Step 5d: Find the NOAA-20 NUCAPS-EDR Soundings data file for November 29, 2022 at 19:07 UTC

We can see that there a lot of NUCAPS-EDR files for November 29: 2,691! This is because the JPSS satellites have global coverage. We want to download the file for 19:07 UTC.

Similar to ABI file names, JPSS file names also contain a lot of information about the data in the file. For an explanation of how to decode JPSS L2 granules (EDR) file names, see https://www.star.nesdis.noaa.gov/atmospheric-composition-training/satellite_data_decoding_data_file_names.php#viirs_level2

Using slicing and list comprehension, we can identify the one file we want by using the information in the file name to match the starting ("s") observation time of "1907510"; there are two files with observations starting at 19:07 UTC, so we need to use the seconds in the observation time to select the correct file. Then we print the file name to confirm it's the one we want and check the approximate file size ("fs.size") before we download the file.

In [None]:
observation_time = '1907510'

matches = [file for file in files if (file.split('/')[-1].split('_')[3][9:16] == observation_time)]

for match in matches:
    print(match.split('/')[-1])
    print('Approximate file size (MB):', round((fs.size(match)/1.0E6), 2))

## Step 5e: Download the NOAA-20 NUCAPS-EDR Soundings data file for November 29, 2022 at 19:07 UTC

We use the same code as in Step 4e to download the NUCAPS-EDR file to our local computer.

In [None]:
for match in matches:
    fs.get(match, str(directory_path / match.split('/')[-1]))

## Step 5f: Browse NOAA-20 VIIRS data

We also need to download a NOAA-20 VIIRS data file. Let's list ("fs.ls") the available VIIRS products.

In [None]:
bucket = 'noaa-jpss'
satellite = 'NOAA20'
sensor = 'VIIRS'

products_path = bucket + '/' + satellite + '/' + sensor

products = fs.ls(products_path)

for product in products:
    print(product.split('/')[-1])

## Step 5g: Browse NOAA-20 VIIRS AF I-band data for October 16, 2022

There are a lot of VIIRS data products: > 80! In this tutorial, we are going to download a VIIRS Active Fires (AF) I-band Environmental Data Record (EDR) file for October 16, 2022 at 21:18 UTC, when wildfires in southern Washington State, near the Oregon border, underwent explosive growth. You will use the data in this file to plot fire detections on a map.

Let's list ("fs.ls") the available NOAA-20 VIIRS AF I-band data for October 16, 2022, and then print the total number of files and the first 10 file names.

In [None]:
bucket = 'noaa-jpss'
satellite = 'NOAA20'
sensor = 'VIIRS'
product = 'NOAA20_VIIRS_AF_I-Band_EDR_NRT'
year = 2022
month = 10
day = 16

files_path = bucket + '/' + satellite + '/' + sensor + '/' + product + '/' + str(year) + '/' + str(month).zfill(2)  + '/' + str(day).zfill(2)

files = fs.ls(files_path)

print('Total number of files:', len(files), '\n')

for file in files[:10]:
    print(file.split('/')[-1])

## Step 5h: Find the NOAA-20 VIIRS AF I-band data file for October 16, 2022 at 21:18 UTC

We can see there are a lot of VIIRS AF I-band EDR files for October 16: 1,011! Again, this is because the JPSS satellites have global coverage.

As we did in Step 5d, we use slicing and list comprehension to identify the one file we want by using the information in the file name to match the starting ("s") observation time of "2118". Then we print the file name to confirm it's the one we want and check the approximate file size ("fs.size") before we download the file.

In [None]:
observation_time = '2118'

matches = [file for file in files if (file.split('/')[-1].split('_')[3][9:13] == observation_time)]

for match in matches:
    print(match.split('/')[-1])
    print('Approximate file size (MB):', round((fs.size(match)/1.0E6), 2))

## Step 5i: Download the NOAA-20 VIIRS AF I-band data file for October 16, 2022 at 21:18 UTC

We use the same code as in Step 4e and 5e to download the VIIRS AF I-band file to our local computer.

In [None]:
for match in matches:
    fs.get(match, str(directory_path / match.split('/')[-1]))

## Step 6a: Request information from a website

Not all satellite data are accessible via APIs, such as the NODD on AWS. Often, higher processing level data products such as Level 3 (L3) and Level 4 (L4) files are hosted by individual science teams on regular websites.  An example is the NOAA/NESDIS/STAR archive of VIIRS gridded aerosol data: https://www.star.nesdis.noaa.gov/pub/smcd/VIIRS_Aerosol/viirs_aerosol_gridded_data/ 

In this tutorial, we will use the **Requests** package to request information about a file on the VIIRS gridded aerosol data archive website, and then download the file. 

Documentation for the Requests package: https://requests.readthedocs.io/en/latest/

We will download a SNPP VIIRS L3 Aerosol Optical Depth (AOD) data file for September 11, 2022. L3 data are L2 data that have been mapped on a uniform space-time grid, i.e., the data have been averaged over space and/or time. The VIIRS L3 AOD product is a gridded global composite at 0.10° or 0.25° resolution, available as a daily or monthly average.  We will download a daily average AOD file at 0.10° resolution, and use the data in the file to plot AOD on a global map to see areas of optically thick aerosols from smoke, blowing dust, and haze.

The files on the VIIRS gridded aerosol data archive website are organized by satellite (NOAA-20 or SNPP), time averaging period (daily or monthly), and observation year. All of the daily or monthly files for a given year are located in the corresponding observation year directory. We want to download a SNPP satellite daily file for September 11, 2022, so we need to access the directory with all of the files for 2022: https://www.star.nesdis.noaa.gov/pub/smcd/VIIRS_Aerosol/viirs_aerosol_gridded_data/snpp/aod/eps/2022/

We can open the directory in a web browser and see the list of data files. The name of the file we want to download is "viirs_eps_npp_aod_0.100_deg_20220911.nc". Let's use **Requests** to get a "response" ("requests.get") for the URL corresponding to the file we want to download. 

Note that for simplicity, we are assuming you will open the webpage for the online archive of interest in a web browser to determine the file naming convention before sending a request using **Requests**. If you need to browse the files/content on a website with Python, use the **beautifulsoup4** package (part of the standard Anaconda installation) to scrape the "response.text" from **Requests**.

Documentation for beautifulsoup4: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [None]:
url = 'https://www.star.nesdis.noaa.gov/pub/smcd/VIIRS_Aerosol/viirs_aerosol_gridded_data/snpp/aod/eps/2022/'
file_name = 'viirs_eps_npp_aod_0.100_deg_20220911.nc'

response = requests.get(url + file_name)

## Step 6b: Check the status code of the website URL

As a first step, it's a good idea to check the **HTTP status code** for the URL corresponding to the data file. This way, we can confirm that the file exists before we try to download it.

List of HTTP status codes: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

In [None]:
status_code = response.status_code
print(status_code)

## Step 6c: Get the approximate size of a file on a website archive

Great - we see that the status code for the URL corresponding to the file we want to download is "200" ("success"), so that means the file exists on the website. 

Before we download the file, let's check the approximate file size by listing the "content-length" of "response.headers".

In [None]:
file_size = response.headers['content-length']
print('Approximate file size (MB):', round((float(file_size)/1.0E6), 2))

## Step 6d: Download the content of a request from a website (i.e., download a data file)

Now that we know the approximate size of the file, we can proceed to download it to our local computer. As in Steps 4e, 5e, and 5i, we set the "full_path" for the downloaded data file using **pathlib** syntax.

We download the data file by writing the "response.content" to a file using the Python "open(filename, mode)" method: "open(full_path, 'wb').  The "wb" argument means write-only in binary mode. It's good practice to to use the "with" keyword when dealing with file objects, so we don't have to call "file.close()" to close the open file when we're done downloading content to it.

In [None]:
full_path = str(directory_path / file_name)

with open(full_path, 'wb') as file:
    file.write(response.content)