# Formatting climate indicies to fit our code base 

* Climate Indexs from different places come in different formats 
* This code base assumes data is in a certain format 
* This notebook is an example of processing data to fit that template - this example is for Sudden Stratospheric Warming (SSW) events 
    * This data comes from Hauchecorne, Alain. (2021). ERA5 dataset for the categorization of Stratospheric Final Warming (Version 1). Zenodo. https://doi.org/10.5281/zenodo.5744919
* Downloaded data needs to be inspected and processed manually before adding into the analysis 

In [3]:
import h5py
import pandas as pd 
import datetime 
from datetime import date, timedelta 

* This data is in the he5 format, we want to produce a CSV using this data 

In [4]:
indir = '/example/data/dir/'
filename = f'{}Stratospheric_Final_Warming_data.he5'

* Lets open this and see what we are dealing with 

In [5]:
with h5py.File(filename, "r") as f:
    print(f.keys())
    group_key = list(f.keys())[0]
    data = f[group_key]
    print(data.keys())
    U60N = data['U60N'][()]
    HFlux = data['HFlux'][()]
    Tpole = data['Tpole'][()]
    Z1_60N = data['Z1_60N'][()]

<KeysViewHDF5 ['HDF5_data']>
<KeysViewHDF5 ['HFlux', 'Tpole', 'U60N', 'Z1_60N']>


* We have 4 varibles associated with this file 
* The U60N is the U wind at 60N and 10hPa which we are interested in 
* I'll add all data into a dataframe first, as all data has the same shape it's a sanity check
    * We can also print the shapes

In [6]:
U60N.shape , HFlux.shape, Tpole.shape, Z1_60N.shape

((25933,), (25933,), (25933,), (25933,))

* An issue we have is this data doesn't have time stamps associated, so we will need to add these  

In [7]:
sdate = date(1950,1,1)   # start date
edate = date(2021,1,1)   # end date 

* start and end dates as defined in the data repo
* we can now use pandas to create ourselves a datetime index 

In [8]:
pd.date_range(sdate,edate-timedelta(days=1),freq='d')

DatetimeIndex(['1950-01-01', '1950-01-02', '1950-01-03', '1950-01-04',
               '1950-01-05', '1950-01-06', '1950-01-07', '1950-01-08',
               '1950-01-09', '1950-01-10',
               ...
               '2020-12-22', '2020-12-23', '2020-12-24', '2020-12-25',
               '2020-12-26', '2020-12-27', '2020-12-28', '2020-12-29',
               '2020-12-30', '2020-12-31'],
              dtype='datetime64[ns]', length=25933, freq='D')

* we can now create our dataframe 

In [9]:
df = pd.DataFrame({'U60N':U60N,
              'HFlux':HFlux,
              'Tpole':Tpole,
              'Z1_60N':Z1_60N,
                'Date':pd.date_range(sdate,edate-timedelta(days=1),freq='d')
             })

In [10]:
df.head()

Unnamed: 0,U60N,HFlux,Tpole,Z1_60N,Date
0,17.525972,5.271371,211.642372,326.567664,1950-01-01
1,19.777361,8.61802,211.31585,280.259097,1950-01-02
2,20.784653,15.365684,210.199111,195.846716,1950-01-03
3,22.356251,2.391595,208.343646,156.116303,1950-01-04
4,23.118334,-4.185827,206.620259,151.00689,1950-01-05


In [11]:
df.set_index('Date', inplace = True)

In [12]:
df.head()

Unnamed: 0_level_0,U60N,HFlux,Tpole,Z1_60N
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1950-01-01,17.525972,5.271371,211.642372,326.567664
1950-01-02,19.777361,8.61802,211.31585,280.259097
1950-01-03,20.784653,15.365684,210.199111,195.846716
1950-01-04,22.356251,2.391595,208.343646,156.116303
1950-01-05,23.118334,-4.185827,206.620259,151.00689


* We have daily data, so we need to use pandas groupby and take a mean  

In [13]:
df_monthly_mean = df.groupby(pd.PeriodIndex(df.index, freq="M")).mean()

In [14]:
df_monthly_mean.head()

Unnamed: 0_level_0,U60N,HFlux,Tpole,Z1_60N
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1950-01,30.861825,53.635743,205.543982,250.457566
1950-02,38.451986,99.817699,206.807619,681.824895
1950-03,5.489621,35.583926,226.713104,539.618329
1950-04,-5.098322,5.772435,227.533828,131.814979
1950-05,-4.832007,0.602856,230.276704,62.952333


* Now lets add the year and month as their own columns 

In [17]:
df_monthly_mean['Month'] = df_monthly_mean.index.month
df_monthly_mean['Year'] = df_monthly_mean.index.year

In [18]:
df_monthly_mean.head()

Unnamed: 0_level_0,U60N,HFlux,Tpole,Z1_60N,Month,Year
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1950-01,30.861825,53.635743,205.543982,250.457566,1,1950
1950-02,38.451986,99.817699,206.807619,681.824895,2,1950
1950-03,5.489621,35.583926,226.713104,539.618329,3,1950
1950-04,-5.098322,5.772435,227.533828,131.814979,4,1950
1950-05,-4.832007,0.602856,230.276704,62.952333,5,1950


* We require months to be written as a string, let's create a dictonary for this  

In [19]:
num_to_month = {1:'Jan',
               2:'Feb',
               3:'Mar',
               4:'Apr',
               5:'May',
               6:'Jun',
               7:'Jul',
               8:'Aug',
               9:'Sep',
               10:'Oct',
               11:'Nov',
               12:'Dec'}

In [21]:
df_monthly_mean.replace({'Month':num_to_month}, inplace=True) # replace the values with the month 

In [24]:
df_monthly_mean.head()

Unnamed: 0_level_0,U60N,HFlux,Tpole,Z1_60N,Month,Year
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1950-01,30.861825,53.635743,205.543982,250.457566,Jan,1950
1950-02,38.451986,99.817699,206.807619,681.824895,Feb,1950
1950-03,5.489621,35.583926,226.713104,539.618329,Mar,1950
1950-04,-5.098322,5.772435,227.533828,131.814979,Apr,1950
1950-05,-4.832007,0.602856,230.276704,62.952333,May,1950


* Let's drop the stuff we don't need 

In [28]:
df_monthly_mean.drop(['HFlux','Tpole','Z1_60N'],axis=1, inplace=True)

In [29]:
df_monthly_mean.head()

Unnamed: 0_level_0,U60N,Month,Year
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1950-01,30.861825,Jan,1950
1950-02,38.451986,Feb,1950
1950-03,5.489621,Mar,1950
1950-04,-5.098322,Apr,1950
1950-05,-4.832007,May,1950


* Use the pivot function to have months as the column headers

In [30]:
df_monthly_mean = df_monthly_mean.pivot(index = 'Year',
                                        columns = 'Month',
                                        values = 'U60N')

In [31]:
df_monthly_mean.head()

Month,Apr,Aug,Dec,Feb,Jan,Jul,Jun,Mar,May,Nov,Oct,Sep
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1950,-5.098322,-3.760094,19.401472,38.451986,30.861825,-9.25963,-7.909236,5.489621,-4.832007,20.970926,19.850237,7.529322
1951,12.351736,-3.703555,24.348734,4.170357,18.500327,-9.97192,-9.860956,9.093483,-3.621378,21.984424,22.727825,6.583759
1952,9.820597,-2.602935,30.842829,7.52358,40.392686,-9.486326,-10.382162,4.571577,-5.173109,10.940232,17.639572,8.738449
1953,3.348074,-2.730733,20.512426,20.497101,9.261478,-9.817928,-12.752176,29.229572,-8.514693,34.714838,22.073531,11.370838
1954,-7.398718,-2.862303,20.544897,28.173348,16.570961,-10.976635,-12.592391,5.876707,-9.658215,22.913912,23.319659,9.805539


* Let's reorder so we have our months in the correct order  

In [33]:
df_monthly_mean = df_monthly_mean[['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']]

In [34]:
df_monthly_mean.head()

Month,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1950,30.861825,38.451986,5.489621,-5.098322,-4.832007,-7.909236,-9.25963,-3.760094,7.529322,19.850237,20.970926,19.401472
1951,18.500327,4.170357,9.093483,12.351736,-3.621378,-9.860956,-9.97192,-3.703555,6.583759,22.727825,21.984424,24.348734
1952,40.392686,7.52358,4.571577,9.820597,-5.173109,-10.382162,-9.486326,-2.602935,8.738449,17.639572,10.940232,30.842829
1953,9.261478,20.497101,29.229572,3.348074,-8.514693,-12.752176,-9.817928,-2.730733,11.370838,22.073531,34.714838,20.512426
1954,16.570961,28.173348,5.876707,-7.398718,-9.658215,-12.592391,-10.976635,-2.862303,9.805539,23.319659,22.913912,20.544897


* Finally save the file, this is commented out

In [35]:
#df.to_csv('Zonal_wind_60N_10hPa.csv')