# Tutorial to Download NOAA Satellite Data Files

This tutorial was written in December 2022 by Dr. Amy Huff, IMSG at NOAA/NESDIS/STAR (amy.huff@noaa.gov) and Dr. Rebekah Esmaili, STC at NOAA/JPSS (rebekah.esmaili@noaa.gov). It demonstrates how to download satellite data files from the GOES-R & JPSS Amazon Web Services (AWS) Simple Storage Service (S3) buckets and the NOAA/NESDIS/STAR gridded aerosol data archive website.

The downloaded files will include:
- From the GOES-R S3 buckets:
    - GOES-16 (GOES-East) ABI L2 Multichannel Cloud & Moisture Imagery Products (MCMIP) - Mesoscale sector (M) data file for Dec 2, 2022 at 20:30 UTC (1 file)
- From the JPSS S3 bucket:
    - NOAA-20 VIIRS L2 Active Fires (AF) I-band Environmental Data Record (EDR) data files for Oct 16, 2022 at 21:16-21:19 UTC (3 files)
    - NOAA-20 NUCAPS Sounding Environmental Data Record (EDR) data file for Nov 29, 2022 at 19:07 UTC (1 file)
- From the NOAA/NESDIS/STAR gridded aerosol data archive website:
    - SNPP VIIRS L3 Aerosol Optical Depth (AOD) global gridded data file for Sep 11, 2022 (1 file)

## Topic 1: Getting Started with Jupyter Notebook

### Step 1.1: Import Python packages

We will use two Python packages (libraries) and two Python modules in this tutorial:
- The **S3Fs** library is used to set up a filesystem interface with the Amazon Simple Storage Service (S3)
- The **Requests** library is used to send HTTP requests
- The **datetime** module is used to manipulate dates and times
- The **pathlib** module is used to set filesystem paths for the user's operating system

In [None]:
import s3fs

import requests

import datetime

from pathlib import Path

### Step 1.2: Set directory path where satellite data files will be saved

We set the directory path for the satellite data files using the [pathlib module](https://docs.python.org/3/library/pathlib.html#module-pathlib), which automatically uses the correct format for the user's operating system. This helps avoid errors in situations when more than one person is using the same code file, because Windows uses back slashes in directory paths, while MacOS and Linux use forward slashes. 

To keep things simple for this training, we put the satellite data files we downloaded in the current working directory ```Path(cwd)```, i.e., the same Jupyter Notebook folder where this code file is located.

In [None]:
directory_path = Path.cwd()

## Topic 2: Downloading Satellite Data Files

### Step 2.1: Connect to AWS S3 (Simple Storage Service)

The [NOAA Open Data Dissemination (NODD) program](https://www.noaa.gov/information-technology/open-data-dissemination) is increasing access to NOAA satellite data, including data from the GOES-R geostationary satellites and JPSS polar-orbiting satellites. 

The NODD program disseminates data through collaborations with AWS, Google Earth Engine, and Microsoft Azure. We will use the AWS S3 buckets in this tutorial because they are free to access and do not require any additional registration or a password.

Think of the S3 buckets as online data archives. You do **not** need an AWS cloud computing account to access NOAA satellite data!

The [S3Fs package](https://s3fs.readthedocs.io/en/latest/) allows us to set up a filesystem (```fs```) interface to S3 buckets. We use an anonymous connection (```annon=True```) because the NODD S3 buckets are publicly available & read-only.

In [None]:
fs = s3fs.S3FileSystem(anon=True)

### Step 2.2: Download data from the GOES-R S3 buckets

The NODD program makes NOAA GOES-R geostationary satellite data from the GOES-16, GOES-17, and GOES-18 satellites available via [AWS](https://registry.opendata.aws/noaa-goes/).

There are separate S3 buckets for each satellite, which can be viewed in a web browser:
- [GOES-16](https://noaa-goes16.s3.amazonaws.com/index.html)
- [GOES-17](https://noaa-goes17.s3.amazonaws.com/index.html)
- [GOES-18](https://noaa-goes18.s3.amazonaws.com/index.html)

Data on the GOES-R S3 buckets are updated in near real-time, and there is a full archive of the publicly available data (generally provisional and full maturity data; [descriptions of satellite product maturity levels](https://www.star.nesdis.noaa.gov/atmospheric-composition-training/satellite_data_maturity_levels.php)). 

#### Step 2.2.1: Browse the GOES-16 S3 bucket

Data files in each of the satellite buckets are organized by product name. Let's access the GOES-16 bucket and list (```fs.ls```) the available products, and then print the product names.

In [None]:
bucket = 'noaa-goes16'

products_path = bucket

products = fs.ls(products_path)

for product in products:
    print(product.split('/')[-1])

#### Step 2.2.2:  Find Julian day

We can see that the [GOES-R product names](https://github.com/awslabs/open-data-docs/tree/main/docs/noaa/noaa-goes16#about-the-data) are abbreviations that begin with the sensor (ABI, EXIS, GLM, MAG, SEIS, SUVI) and [data processing level](https://www.star.nesdis.noaa.gov/atmospheric-composition-training/satellite_data_processing_levels.php) (L1b, L2). ABI product names include an abbreviation of the data product (e.g., Rad, AOD, MCMIP) and the [ABI scan sector](https://www.star.nesdis.noaa.gov/atmospheric-composition-training/satellite_data_abi_scan_sectors.php): F (full disk), C (CONUS), or M (mesoscale). Not all ABI data products are generated for each scan sector.

ABI data files are organized in the S3 buckets as follows:
- Product
- Year
- Julian day
- Hour
- Filename

ABI data are classified by the 3-digit Julian day of the year instead of the month and day of the month. So in order to browse any of the ABI data files, we need to know the Julian day for our observation day of interest. 

The following code uses the Python **datetime** module to return the 3-digit Julian day for any year, month, and day we enter as **integers**. We are going to download ABI data for December 2, 2022.

In [None]:
year = 2022
month = 12
day = 2

julian_day = datetime.datetime(year, month, day).strftime('%j')
print(julian_day)

#### Step 2.2.3: Browse GOES-16 ABI-L2-MCMIPM data for Dec 2, 2022 at 20:00-20:59 UTC

In this tutorial, we are going to download the GOES-16 ABI L2 Multichannel Cloud & Moisture Imagery Products Mesoscale 2 sector (ABI-L2-MCMIPM) data file for Dec 2, 2022 at 20:30 UTC, when strong winds associated with the passage of a cold front generated blowing dust in eastern CO & western KS. You will use the data in this file to create composite color imagery to visualize the dust storm, including dust RGB and (simulated) true color.

The L2 Cloud and Moisture Imagery Product (CMIP) files are easier to work with than the L1b radiance (Rad) files for making composite imagery because the CMIP files contain the ABI band values needed to calculate the composites: brightness temperature (BT) at the Top-Of-Atmosphere (TOA) in Kelvin for the ABI emissive bands (7-16) and the dimensionless reflectance (normalized by the solar zenith angle) for the ABI reflective bands (1-6); the multiband product file (MCMIP) includes includes all band values in one file at a consistent spatial resolution of 2km.

Now that we know the Julian day for Dec 2, 2022, we can set the ```data_path``` for GOES-16 ABI MCMIPM files for ```hour=20```, which corresponds to observations between 20:00-20:59 UTC. 

The built-in Python fuction ```str.zfill(width)``` ensures the ```julian``` string in the ```data_path``` is 3 digits and the ```hour``` string in the ```data_path``` is 2 digits; ```str.zfill(width)``` returns a copy of the string left-filled with ASCII '0' digits to make a string of length ```width```. This ensures that the ```data_path``` syntax is correct for ```julian``` variable integers < 100 and ```hour``` variable integers < 10.

Let's list (```fs.ls```) the available ```MCMIPM``` files for Dec 2, 2022 at 20:00-20:59 UTC, and then print the total number of files and print the first 10 and last 10 file names.

In [None]:
bucket = 'noaa-goes16'
product = 'ABI-L2-MCMIPM'
year = 2022
julian = 336
hour = 20

data_path = bucket + '/' + product + '/'  + str(year) + '/' + str(julian).zfill(3) + '/' + str(hour).zfill(2)

files = fs.ls(data_path)

print('Total number of files:', len(files), '\n')

for file in files[:10]:
    print(file.split('/')[-1])
for file in files[-10:]:
    print(file.split('/')[-1])

#### Step 2.2.4: Find the GOES-16 ABI-L2-MCMIPM2 data file for Dec 2, 2022 at 20:30 UTC

We can see that there 120 ```MCMIPM``` files each hour: one observation every 1 minute for the two Mesoscale sectors, M1 and M2. We want to download the M2 file for 20:30 UTC. 

[ABI file names](https://www.star.nesdis.noaa.gov/atmospheric-composition-training/satellite_data_decoding_data_file_names.php#abi_level2) contain a lot of information about the data in the file. Using slicing and list comprehension, we can identify the one file we want by using the information in the file name to match the starting (```s```) observation time of ```2030``` and the product name ```MCMIPM2```. Then we print the file name to confirm it's the one we want and check the approximate file size (```fs.size```) before we download the file.

In [None]:
observation_time = '2030'
product_name = 'MCMIPM2'

matches = [file for file in files if (file.split('/')[-1].split('_')[3][8:12] == observation_time and file.split('/')[-1].split('-')[2] == product_name)]

for match in matches:
    print(match.split('/')[-1])
    print('Approximate file size (MB):', round((fs.size(match)/1.0E6), 2))

#### Step 2.2.5: Download the GOES-16 ABI-L2-MCMIPM2 data file for Dec 2, 2022 at 20:30 UTC

Now that we have identified the file we want to download from the GOES-16 S3 bucket and checked the file size, we can proceed to download (```fs.get```) the file to the directory we set (```directory_path```, the cwd) on our local computer. 

We set the full path for the downloaded data file using **pathlib** syntax, which uses a forward slash ("/") to join the ```directory_path``` and the file name for the downloaded file.

In [None]:
for match in matches:
    fs.get(match, str(directory_path / match.split('/')[-1]))

### Step 2.3: Download data from the JPSS S3 bucket

The NODD program makes NOAA JPSS polar-orbiting satellite data from the SNPP and NOAA-20 satellites available via [AWS](https://registry.opendata.aws/noaa-jpss/).

There is one S3 bucket that contains all of the JPSS data, which can be viewed in a web browser: 
- [JPSS satellites](https://noaa-jpss.s3.amazonaws.com/index.html)

The JPSS satellites generate an enormous volume of data products, which are gradually being added to the NODD. As a result, JPSS data availability on the NODD varies widely; some JPSS products are not yet included in the NODD, and some products don't have a full archive of files on the NODD. More products are being added all the time, in response to end user requests.

**We thank Lihang Zhou of NOAA/NESDIS/JPSS for her leadership of the JPSS NODD, and Gian Dilawari of NOAA/NESDIS/JPSS and his team for their hard work adding the massive JPSS datasets to the NODD!**

#### Step 2.3.1: Browse the JPSS S3 bucket

Data files in the JPSS S3 bucket are organized by satellite (SNPP or NOAA-20) and sensor name. There is also a category for blended products (containing data from both satellites).

In this tutorial, we are going to download four data files, all from the NOAA-20 satellite.

Let's access the JPSS bucket and list (```fs.ls```) the available sensors for the NOAA-20 satellite, and then print the sensor names.

In [None]:
bucket = 'noaa-jpss'
satellite = 'NOAA20'

sensors_path = bucket + '/' + satellite 

sensors = fs.ls(sensors_path)

for sensor in sensors:
    print(sensor.split('/')[-1])

#### Step 2.3.2: Browse NOAA-20 Soundings data

We can see there are data available from four NOAA-20 sensors (ATMS, CrIS, OMPS, VIIRS) and for "Soundings". We are going to download a Soundings data file and three VIIRS data files. 

Let's start by listing (```fs.ls```) the available NOAA-20 Soundings products.

In [None]:
bucket = 'noaa-jpss'
satellite = 'NOAA20'
sensor = 'SOUNDINGS'

products_path = bucket + '/' + satellite + '/' + sensor

products = fs.ls(products_path)

for product in products:
    print(product.split('/')[-1])

#### Step 2.3.3: Browse NOAA-20 Soundings NUCAPS-EDR data for November 29, 2022

We can see the JPSS Soundings product names include the satellite ("NOAA20") and an abbreviation of the data product. We are going to download a NOAA Unique Combined Atmospheric Processing System (NUCAPS) Environmental Data Record (EDR) file for November 29, 2022 at 19:07 UTC. On November 29, severe thunderstorms and tornadoes moved through parts of Mississippi and Alabama, killing two people. You will use the water vapor and temperature profiles data in this file to generate a skew-T/log-P plot. 

JPSS data files are organized in the S3 bucket as follows:
- Satellite
- Sensor
- Product
- Year
- Month
- Day
- Filename

Let's list (```fs.ls```) the available NOAA-20 Soundings NUCAPS-EDR files for November 29, 2022, and then print the total number of files and the first 10 file names.

In [None]:
bucket = 'noaa-jpss'
satellite = 'NOAA20'
sensor = 'SOUNDINGS'
product = 'NOAA20_NUCAPS-EDR'
year = 2022
month = 11
day = 29

files_path = bucket + '/' + satellite + '/' + sensor + '/' + product + '/' + str(year) + '/' + str(month).zfill(2)  + '/' + str(day).zfill(2)

files = fs.ls(files_path)

print('Total number of files:', len(files), '\n')

for file in files[:10]:
    print(file.split('/')[-1])

#### Step 2.3.4: Find the NOAA-20 Soundings NUCAPS-EDR data file for November 29, 2022 at 19:07 UTC

We can see that there a lot of NUCAPS-EDR files for November 29: 2,691! This is because the JPSS satellites have global coverage. We want to download the file for 19:07 UTC.

Similar to ABI file names, [JPSS file names](https://www.star.nesdis.noaa.gov/atmospheric-composition-training/satellite_data_decoding_data_file_names.php#viirs_level2) also contain a lot of information about the data in the file. Using slicing and list comprehension, we can identify the one file we want by using the information in the file name to match the starting (```s```) observation time of ```1907510```; there are two files with observations starting at 19:07 UTC, so we need to use the seconds in the observation time to select the correct file. Then we print the file name to confirm it's the one we want and check the approximate file size (```fs.size```) before we download the file.

In [None]:
observation_time = '1907510'

matches = [file for file in files if (file.split('/')[-1].split('_')[3][9:16] == observation_time)]

for match in matches:
    print(match.split('/')[-1])
    print('Approximate file size (MB):', round((fs.size(match)/1.0E6), 2))

#### Step 2.3.5: Download the NOAA-20 Soundings NUCAPS-EDR data file for November 29, 2022 at 19:07 UTC

We use the same code as in Step 2.2.5 to download the NUCAPS-EDR file to our local computer.

In [None]:
for match in matches:
    fs.get(match, str(directory_path / match.split('/')[-1]))

#### Step 2.3.6: Browse NOAA-20 VIIRS data

We also need to download three NOAA-20 VIIRS data files. Let's list (```fs.ls```) the available VIIRS products.

In [None]:
bucket = 'noaa-jpss'
satellite = 'NOAA20'
sensor = 'VIIRS'

products_path = bucket + '/' + satellite + '/' + sensor

products = fs.ls(products_path)

for product in products:
    print(product.split('/')[-1])

#### Step 2.3.7: Browse NOAA-20 VIIRS AF I-band data for October 16, 2022

There are a lot of VIIRS data products: > 80! We are going to download VIIRS Active Fires (AF) I-band Environmental Data Record (EDR) files for October 16, 2022 at 21:16-21:19 UTC, when wildfires in the US Pacific Northwest underwent explosive growth. You will combine these three individual netCDF4 files into one large netCDF4 file, and use the data in this file to plot fire detections on a map.

Let's list (```fs.ls```) the available NOAA-20 VIIRS AF I-band data for October 16, 2022, and then print the total number of files and the first 10 file names.

In [None]:
bucket = 'noaa-jpss'
satellite = 'NOAA20'
sensor = 'VIIRS'
product = 'NOAA20_VIIRS_AF_I-Band_EDR_NRT'
year = 2022
month = 10
day = 16

files_path = bucket + '/' + satellite + '/' + sensor + '/' + product + '/' + str(year) + '/' + str(month).zfill(2)  + '/' + str(day).zfill(2)

files = fs.ls(files_path)

print('Total number of files:', len(files), '\n')

for file in files[:10]:
    print(file.split('/')[-1])

#### Step 2.3.8: Find the NOAA-20 VIIRS AF I-band data files for October 16, 2022 at 21:16-21:19 UTC

We can see there are a lot of VIIRS AF I-band EDR files for October 16: 1,011! Again, this is because the JPSS satellites have global coverage.

As we did in Step 2.5.4, we use slicing and list comprehension to identify the three files we want by using the information in the file names to match the starting (```s```) observation time range of ```2116``` to ```2119```. Then we print the file names to confirm they are the ones we want and check the approximate size of each file (```fs.size```) before we download them.

In [None]:
start_time = '2116'
end_time = '2119'

matches = [file for file in files if (file.split('/')[-1].split('_')[3][9:13] >= start_time and file.split('/')[-1].split('_')[3][9:13] <= end_time)]

for match in matches:
    print(match.split('/')[-1])
    print('Approximate file size (MB):', round((fs.size(match)/1.0E6), 2))

#### Step 2.3.9: Download the NOAA-20 VIIRS AF I-band data files for October 16, 2022 at 21:16-21:19 UTC

We use the same code as in Steps 2.2.5 and 2.3.5 to download the NOAA-20 VIIRS AF I-band files to our local computer.

In [None]:
for match in matches:
    fs.get(match, str(directory_path / match.split('/')[-1]))

### Step 2.4: Request information from a website

Not all NOAA satellite data products are available from the S3 buckets. Many higher processing level data products - Level 3 (L3) and Level 4 (L4) - are hosted by individual science teams on regular websites.  An example is the NOAA/NESDIS/STAR archive of [VIIRS gridded aerosol data](https://www.star.nesdis.noaa.gov/pub/smcd/VIIRS_Aerosol/viirs_aerosol_gridded_data/). 

We will use the [Requests package](https://requests.readthedocs.io/en/latest/) to request information about a file on the VIIRS gridded aerosol data archive website, and then download the file. 

#### Step 2.4.1: Request information about a SNPP VIIRS L3 AOD data file for September 11, 2022

We will download a SNPP VIIRS L3 Aerosol Optical Depth (AOD) data file for September 11, 2022. L3 data are L2 data that have been mapped on a uniform space-time grid, i.e., the data have been averaged over space and/or time. The VIIRS L3 AOD product is a gridded global composite at 0.10° or 0.25° resolution, available as a daily or monthly average.  We will download a daily average AOD file at 0.10° resolution, and use the data in the file to plot AOD on a global map to see areas of optically thick aerosols from smoke, blowing dust, and haze.

The files on the VIIRS gridded aerosol data archive website are organized by satellite (NOAA-20 or SNPP), time averaging period (daily or monthly), and observation year. All of the daily or monthly files for a given year are located in the corresponding observation year directory. We want to download a SNPP satellite daily file for September 11, 2022, so we need to access the [directory with all of the files for 2022](https://www.star.nesdis.noaa.gov/pub/smcd/VIIRS_Aerosol/viirs_aerosol_gridded_data/snpp/aod/eps/2022/).

We can open the directory in a web browser and see the list of data files. The name of the file we want to download is ```viirs_eps_npp_aod_0.100_deg_20220911.nc```. Let's use **Requests** to get a "response" (```requests.get()```) for the URL corresponding to the file we want to download. 

Note that for simplicity, we are assuming you will open the webpage for the online archive of interest in a web browser to determine the file naming convention before sending a request using **Requests**. If you need to browse the files/content on a website with Python, use the [beautifulsoup4 package](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) (part of the standard Anaconda installation) to scrape the ```response.text``` from **Requests**.

In [None]:
url = 'https://www.star.nesdis.noaa.gov/pub/smcd/VIIRS_Aerosol/viirs_aerosol_gridded_data/snpp/aod/eps/2022/'
file_name = 'viirs_eps_npp_aod_0.100_deg_20220911.nc'

response = requests.get(url + file_name)

#### Step 2.4.2: Check the status code of the website URL

As a first step, it's a good idea to check the [HTTP status code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) for the URL corresponding to the data file. This way, we can confirm that the file exists before we try to download it.

In [None]:
status_code = response.status_code
print(status_code)

#### Step 2.4.3: Get the approximate size of a file on a website archive

Great - we see that the status code for the URL corresponding to the file we want to download is ```200``` ("success"), so that means the file exists on the website. 

Before we download the file, let's check the approximate file size by listing the ```content-length``` of ```response.headers```.

In [None]:
file_size = response.headers['content-length']
print('Approximate file size (MB):', round((float(file_size)/1.0E6), 2))

#### Step 2.4.4: Download the content of a request from a website (i.e., download a data file)

Now that we know the approximate size of the file, we can proceed to download it to our local computer. As in Steps 2.2.5, 2.3.5, and 2.3.9, we set the ```full_path``` for the downloaded data file using **pathlib** syntax.

We download the data file by writing the ```response.content``` to a file using the Python ```open(filename, mode)``` method: ```open(full_path, 'wb')```.  The ```'wb'``` argument means write-only in binary mode. It's good practice to to use the ```with``` keyword when dealing with file objects, so we don't have to call ```file.close()``` to close the open file when we're done downloading content to it.

In [None]:
full_path = str(directory_path / file_name)

with open(full_path, 'wb') as file:
    file.write(response.content)