<a href="https://colab.research.google.com/github/kpd19/Insect_Outbreaks/blob/main/Download_ERA_5_CDS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Download ERA-5 data from the climate data store using the API

This document is stored on KP Dixon's drive and edits will not save, please make your own copy in your own drive to make edits by doing:

`File > Save a copy in Drive`

You must first have an ECMWF account to access data from the climate data store. After you get your account, you can get your personal access token, which you must paste in the first cell below to use the API.

https://cds.climate.copernicus.eu/

You must also accept the liscence agreement:

https://cds.climate.copernicus.eu/datasets/reanalysis-era5-single-levels?tab=download#manage-licences

When you run the first cell in the notebook below, it will connect to the API and download important packages to your drive. You will also have to mount to your drive, so click yes/continue on the pop ups when the next cell runs. This allows you to save documents in your google drive. The `%cd` command will also change your working directory to Colab Notebooks.

In [None]:
%%capture
### make a file in the root with the api location and your personal key
!echo "url:  https://cds.climate.copernicus.eu/api" > $HOME/.cdsapirc
#!echo "key: your_key_here" >> $HOME/.cdsapirc # insert personal access token from your account where it says your_key_here


!echo "key: aeee41fe-5ef8-49fd-bf6a-c881da527f19" >> $HOME/.cdsapirc # insert personal access token from your account where it says your_key_here

!pip install cdsapi>=0.7.2
import cdsapi
import os, shutil, numpy as np, xarray as xr
from datetime import datetime
from google.colab import drive; drive.mount('/content/drive/')

%cd /content/drive/My Drive/Colab Notebooks/ # change working directory

This notebook includes several basic examples of downloading climate data from ERA5, which is a gridded reanalysis of historical weather data. The first example will be to download hourly data, the second example will be to download monthly data. I've also included some simple functions that will turn NetCDF files to dataframes, which are easier for most people to work with. You will have to adapt the functions based on the type of data you are downloading.

The documentation for ERA5 can be found here:

https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation

Before downloading the data, you must first decide what you want to download and at what spatial and time scale:

*   **Variable:** There is an abundance of weather variables that can be downloaded, which are detailed in the documentation. Variable examples: 2m_temperature, total_precipitation, relative_humidity
*   **Spatial scale:** The default for the data is downloaded as 0.25 degree x 0.25 degree grid cells
*   **Time scale:** The data can be downloaded as hourly data, which is going to take longer and be a larger file size, or monthly aggregated data.

Below is an example of downloading the temperature at 2m (2m_temperature) for each hour for all the days in June 2020 for the state of Colorado. The file will be saved as a netCDF, which is a common file format for data with multiple dimensions (space, time).

*   **Variable:** 2m_temperature
*   **Spatial scale:** "41/-109/37/-102"
*   **Time scale:** Hourly time scale

Sometimes, the files will take a long time to aggregate and save to your drive, and your Google Colab will crash. However, your request will still be running in the system, which can be checked here:

https://cds.climate.copernicus.eu/requests?tab=all

You can also download your request from that website, if you don't feel like waiting for it to run in Google Colab. This request should only take a minute or so to run, and the NetCDF file will be saved in your drive.


In [None]:
c = cdsapi.Client()

c.retrieve(
    'reanalysis-era5-single-levels',
    {
        'product_type': 'reanalysis',
        'variable': '2m_temperature',
        'year': '2020',
        'month': '06',
        'day': [
                '01', '02', '03',
                '04', '05', '06',
                '07', '08', '09',
                '10', '11', '12',
                '13', '14', '15',
                '16', '17', '18',
                '19', '20', '21',
                '22', '23', '24',
                '25', '26', '27',
                '28', '29', '30',
                '31',
            ],
            'time': [
                '00:00', '01:00', '02:00',
                '03:00', '04:00', '05:00',
                '06:00', '07:00', '08:00',
                '09:00', '10:00', '11:00',
                '12:00', '13:00', '14:00',
                '15:00', '16:00', '17:00',
                '18:00', '19:00', '20:00',
                '21:00', '22:00', '23:00',],
        'format': 'netcdf',
        'area' : "41/-109/37/-102"    # top/left/bottom/right"
    },
    'colorado_june_2020_t2m.nc')

### Converting from NetCDF to easier to use dataframe format and downsampling from hourly to daily min, max, average

In [None]:
import pandas as pd
import glob
import matplotlib.pyplot as plt

In [None]:
def downsample_temps(dataset):
  """ Function to take hourly temperature data (K) from an xarray and resample to daily min, max, and average temperature (C) in a dataframe """
  dataset['t2m'] -= 273.15 # converting from Kelvin to Celcius
  dataset.t2m.attrs['units'] = 'deg C' # updating the units
  dataset = dataset.rename({'valid_time': 'time'})
  max_daily = dataset.resample(time = 'D').max(dim = 'time') # resampling to get the max hourly temperature for each day
  min_daily = dataset.resample(time = 'D').min(dim = 'time') # min temperature
  mean_daily = dataset.resample(time = 'D').mean(dim = 'time') # average temperature

  max_daily = max_daily.rename({'t2m':'max_t2m'}) # renaming variable accordingly
  min_daily = min_daily.rename({'t2m':'min_t2m'})
  mean_daily = mean_daily.rename({'t2m':'mean_t2m'})

  merged_data = xr.merge([max_daily,min_daily,mean_daily]) # merging dataset

  merged_data['year'] = merged_data['time'].dt.strftime('%Y') # getting the year from the date information
  merged_data['month'] = merged_data['time'].dt.strftime('%B') # month
  merged_data['day'] = merged_data['time'].dt.strftime('%d') # day

  df = merged_data.to_dataframe() # converting xarray to dataframe
  df = df.reset_index() # resetting the index

  return(df)


Loading in the xarray formatted data we just downloaded from the Climate Data Store:

In [None]:
col_jun_xr = xr.open_dataset('colorado_june_2020_t2m.nc') # xarray formatted data we just downloa

Lets look at the format first, sometimes the variable names are different than what we expect based on updates to the Climate Data Store. The dataset used to say `time` and not it says `valid_time`. You'll have to update the downsampling function accordingly in future analyses. Also if you change the variable from temperature at 2m to something else, you will have to change that.

In [None]:
col_jun_xr

Using our function to convert the data:

In [None]:
col_jun_df = downsample_temps(col_jun_xr)

Now we have a dataframe that we can save:

In [None]:
col_jun_df.to_csv("colorado_june_2020_t2m_df.csv")

Plotting the new dataframe along a latitudinal gradient for one longitude value, to get an idea of the structure:

In [None]:
col_jun_df.loc[col_jun_df['longitude'] == -102.75].plot(kind = 'scatter', x = 'time',y = 'min_t2m', c = 'latitude')

plt.show()

## Monthly Precipitation for Colorado

Below is an example of downloading the monthly total precipitation for 2010-2019 for the state of Colorado. The file will be saved as a netCDF.

*   **Variable:** total_precipitation
*   **Spatial scale:** "41/-109/37/-102"
*   **Time scale:** Monthly

In [None]:
c = cdsapi.Client()

c.retrieve(
    'reanalysis-era5-single-levels-monthly-means',
    {
        'product_type': 'monthly_averaged_reanalysis',
        'variable': 'total_precipitation',
        'year': ['2010', '2011',
            '2012', '2013', '2014',
            '2015', '2016', '2017',
            '2018', '2019',
        ],
        'month': [
            '01', '02', '03',
            '04', '05', '06',
            '07', '08', '09',
            '10', '11', '12',
        ],
        'time': '00:00',
        'format': 'netcdf',
        'area' : "41/-109/37/-102"    # top/left/bottom/right"
    },
    'colorado_2010s_total_precipitation.nc')

The following function takes monthly precipitation and converts it to a dataframe. The monthly dataset I downloaded gave me the date in the format '20100101' For January 1st, 2010, so I used `datetime` to convert it to a readable date.

In [None]:
def monthly_precip(dataset):
  """ Function to take monthly precipitation data and convert to dataframe """
  df = dataset.to_dataframe()
  df = df.reset_index()

  df['time'] = [datetime.strptime(str(int(date)), '%Y%m%d').strftime('%Y-%m-%d') for date in df['date']]
  df['time'] = pd.DatetimeIndex(df['time'].values)


  df['year'] = df['time'].dt.strftime('%Y')
  df['month'] = df['time'].dt.strftime('%B')
  df = df[['latitude','longitude','tp','time','year','month']]


  return(df)

In [None]:
col_precip_xr = xr.open_dataset('colorado_2010s_total_precipitation.nc', decode_times = False)

In [None]:
col_precip_xr

Lets again look at our dataset and update the function accordingly:

Convert to dataframe:

In [None]:
col_precip_df = monthly_precip(col_precip_xr)

In [None]:
col_precip_df

In [None]:
col_precip_df.to_csv("colorado_2010s_tp_monthly.csv")

This time we can generate a map of November total precipitation in 2019 across Colorado:

In [None]:
dat = col_precip_df.loc[col_precip_df['time'] == '2019-11-01'][['latitude','longitude','tp']].pivot(index = 'latitude',columns = 'longitude',values = 'tp')

plt.pcolor(dat)
plt.yticks(np.arange(0.5, len(dat.index), 1), dat.index)
plt.xticks(np.arange(0.5, len(dat.columns), 1), dat.columns)
plt.show()
