# Getting Started with the Subseasonal Data Python Package

The `subseasonal_data` package provides an API for loading and manipulating the **SubseasonalClimateUSA** dataset developed for training and benchmarking subseasonal forecasting models.  Here, _subseasonal_ refers to climate and weather forecasts made 2-6 weeks in advance.
See [DATA.md](../DATA.md) for a detailed description of dataset contents, sources, and processing.

The underlying data is made available through Azure and is updated periodically. To download the data through this package, you will need to have the Azure Storage CLI [`azcopy`](https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy) installed on your machine.

**Data directory structure (on Azure):**
* `dataframes`: individual dataframes containing meteorological measurements and dynamical model forecasts
* `combined_dataframes`: combination dataframes pairing temperature and precipitation target variables with lagged measurement and forecast features
* `masks`: lat-lon filters for the contiguous U.S. and Western U.S.

**Package structure:**
* `downloader`: methods for manually downloading the data
* `data_loaders`: methods for loading the data. By default, these can download data on demand.


In this notebook:

1. [Download all subseasonal forecasting data](#Download-all-subseasonal-forecasting-data)
2. [Download one file](#Download-one-file)
3. [Download on demand](#Download-on-demand)

Requirements: `Python 3.6+`, `azcopy`.

In [1]:
from subseasonal_data import downloader, data_loaders

# Download all subseasonal forecasting data

Use the `subseasonal_data.downloader.download` function to download the entire dataset on disk. **WARNING:** The entire dataset is approximately 175GB in size, proceed with caution. 

There are two ways to specify where to download the data:
* **Default:** user's home directory
    * Windows: `C:\users\[username]\subseasonal_data`
    * Linux, Mac: `/home/[username]/subseasonal_data`
* **Environment Variable:** specify path via the `SUBSEASONALDATA_PATH` environmental variable. **Note:** this will be where the package will look for the data in the future, so make sure this environmental variable is permanent. If this variable is undefined, it will default to the `subseasonal_data` folder in the user's home directory.

To find out where the data will be/was downloaded, use the `subseasonal_data.downloader.get_subseasonal_data_path` method. 

**Note:** The initial download can take a while, however subsequent downloads will just sync the data which is faster most of the time.

In [2]:
# Test azcopy
downloader.check_azcopy_install()

In [3]:
# Get download path (uncomment below)
downloader.get_subseasonal_data_path()

'/pool001/moprescu'

In [4]:
# Download/ sync data
downloader.download()

Downloading data from the 'dataframes' directory...
INFO: Any empty folders will not be processed, because source and/or destination doesn't have full folder support

Job a291ab41-1a52-c14c-49c3-a3a62b041ef9 has started
Log file is located at: /home/moprescu/.azcopy/a291ab41-1a52-c14c-49c3-a3a62b041ef9.log

0 Files Scanned at Source, 0 Files Scanned at Destination

Job a291ab41-1a52-c14c-49c3-a3a62b041ef9 Summary
Files Scanned at Source: 199
Files Scanned at Destination: 199
Elapsed Time (Minutes): 0.0336
Number of Copy Transfers for Files: 0
Number of Copy Transfers for Folder Properties: 0 
Total Number Of Copy Transfers: 0
Number of Copy Transfers Completed: 0
Number of Copy Transfers Failed: 0
Number of Deletions at Destination: 0
Total Number of Bytes Transferred: 0
Total Number of Bytes Enumerated: 0
Final Job Status: Completed

Downloading data from the 'combined_dataframes' directory...
INFO: Any empty folders will not be processed, because source and/or destination doesn't hav

# Download one file

To download one file only, use the `subseasonal_data.downloader.download_file` method. You will need to specify the data directory on Azure (see [Getting Started with the Subseasonal Data Python Package](#Getting-Started-with-the-Subseasonal-Data-Python-Package) ), as well as the file name. 

You can list the files in a data directory using `subseasonal_data.downloader.list_subdir_files`.



In [5]:
# List files in 'combined_dataframes'
downloader.list_subdir_files(data_subdir="combined_dataframes")

INFO: all_data-contest_precip_34w.feather;  Content Length: 5.28 GiB
INFO: all_data-contest_precip_56w.feather;  Content Length: 5.28 GiB
INFO: all_data-contest_tmp2m_34w.feather;  Content Length: 2.76 GiB
INFO: all_data-contest_tmp2m_56w.feather;  Content Length: 2.76 GiB
INFO: all_data-salient_fri.feather;  Content Length: 66.49 MiB
INFO: all_data-us_precip_34w.feather;  Content Length: 8.88 GiB
INFO: all_data-us_precip_56w.feather;  Content Length: 8.88 GiB
INFO: all_data-us_tmp2m_34w.feather;  Content Length: 4.63 GiB
INFO: all_data-us_tmp2m_56w.feather;  Content Length: 4.63 GiB
INFO: all_data_no_NA-contest_precip_34w.feather;  Content Length: 2.05 MiB
INFO: all_data_no_NA-contest_precip_56w.feather;  Content Length: 2.04 MiB
INFO: all_data_no_NA-contest_tmp2m_34w.feather;  Content Length: 2.01 MiB
INFO: all_data_no_NA-contest_tmp2m_56w.feather;  Content Length: 2.00 MiB
INFO: all_data_no_NA-us_precip_34w.feather;  Content Length: 3.38 MiB
INFO: all_data_no_NA-us_precip_56w.feathe

In [6]:
# Download/ sync a file
downloader.download_file(data_subdir="combined_dataframes", filename="all_data-us_precip_34w.feather", verbose=True)

INFO: Any empty folders will not be processed, because source and/or destination doesn't have full folder support

Job 4a18a083-ec54-e74c-7cd3-0233c470073e has started
Log file is located at: /home/moprescu/.azcopy/4a18a083-ec54-e74c-7cd3-0233c470073e.log

0 Files Scanned at Source, 1 Files Scanned at Destination

Job 4a18a083-ec54-e74c-7cd3-0233c470073e Summary
Files Scanned at Source: 1
Files Scanned at Destination: 1
Elapsed Time (Minutes): 0.0334
Number of Copy Transfers for Files: 0
Number of Copy Transfers for Folder Properties: 0 
Total Number Of Copy Transfers: 0
Number of Copy Transfers Completed: 0
Number of Copy Transfers Failed: 0
Number of Deletions at Destination: 0
Total Number of Bytes Transferred: 0
Total Number of Bytes Enumerated: 0
Final Job Status: Completed



# Download on demand

As a space-efficient alternative, the data loader methods can download data on demand if the `sync` flag is set to true. `sync=True` is the default for these methods

In [7]:
df = data_loaders.get_ground_truth("us_tmp2m")

Syncing data....Set sync=False to avoid this step.
INFO: Any empty folders will not be processed, because source and/or destination doesn't have full folder support

Job 2ca29862-ed03-b640-6b21-0c4bf39b0135 has started
Log file is located at: /home/moprescu/.azcopy/2ca29862-ed03-b640-6b21-0c4bf39b0135.log

0 Files Scanned at Source, 1 Files Scanned at Destination

Job 2ca29862-ed03-b640-6b21-0c4bf39b0135 Summary
Files Scanned at Source: 1
Files Scanned at Destination: 1
Elapsed Time (Minutes): 0.0336
Number of Copy Transfers for Files: 0
Number of Copy Transfers for Folder Properties: 0 
Total Number Of Copy Transfers: 0
Number of Copy Transfers Completed: 0
Number of Copy Transfers Failed: 0
Number of Deletions at Destination: 0
Total Number of Bytes Transferred: 0
Total Number of Bytes Enumerated: 0
Final Job Status: Completed

Loading /pool001/moprescu/dataframes/gt-us_tmp2m-14d.h5


In [8]:
df.head()

Unnamed: 0,lat,lon,start_date,tmp2m,tmp2m_sqd,tmp2m_std
0,26.0,279.0,1979-01-01,18.932249,373.676582,3.90468
1,26.0,279.0,1979-01-02,18.53118,357.715187,3.782928
2,26.0,279.0,1979-01-03,18.178591,343.465285,3.606123
3,26.0,279.0,1979-01-04,18.764877,361.311215,3.031604
4,26.0,279.0,1979-01-05,19.305099,377.88983,2.281007
