# Notebook 2: Loading and Visualizing Proxy Data

## Introduction

Proxy records (e.g., ice cores, sediment cores, and speleothems) provide the main source of climate information used in paleoclimate data assimilation.  Recent efforts have compiled proxy data into large machine-readible databases.

In this activity, we'll load and plot proxy data from the Temperature 12k database. These proxy records can be visualized here: https://lipdverse.org/Temp12k/current_version/ and downloaded on the left side of that page.  In this activity, we'll use the pickle file for the v1.0.2 version of that database, which is already downloaded in the "da_workshop/data/" folder. These Temp 12k proxy records are a main source of data for the Erb et al. 2022 Holocene Reconstruction. Before diving into data assimilation, a good first step is to better understand the data that you're working with.

*As you go through the notebook, read text cells, comments in the code, and the code itself.*

Feel free to use this code however you want, even after the workshop ends. All code below (other than commands that start with an exclaimation point, which are shell commands) can be used normally in python.

Note: When programming, there are usually a variety of ways to accomplish the same thing. The code in these notebooks is pretty basic, so feel free to find more powerful ways of accomplishing these tasks in the future.

## 1. Setting up Google Colab

Google Colab has a lot of python libraries pre-installed, but we need to install two more: lipd (for loading lipd files) and cartopy (for making maps). Run the cell below by pressing the arrow button in the upper left. This code may take a minute or two, but most of the code later on will be faster.

In [None]:
# Install lipd and cartopy
!pip install lipd
!pip install cartopy

To load data from Google Drive, you'll have to mount your Google Drive locally. Run the code below and, when prompted, allow it to connect with your Google account. (Since you made a shortcut to the da_workshop folder in Notebook 1, you'll be able to load that data without copying the data.)

In [None]:
# Mount Google Drive locally
from google.colab import drive
drive.mount('/content/drive')

## 2. Importing necessary libraries

Python libraries provide additional functionality. Here, we import some that will be used later.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import pickle
import lipd
plt.style.use('ggplot')  # This sets the plotting style for the figures we'll be making later.

## 3. Loading proxy data

Now, let's load the Temp12k database v1.0.2 (https://lipdverse.org/Temp12k/1_0_2/).  The commands below will do several things:
- Load the data
- Extract proxy record time series
- Select the proxy records which are in the Temp12k collection and have units of degrees Celsius.

In [None]:
# Load the Temp12k proxy metadata
data_dir = '/content/drive/MyDrive/da_workshop/data/'
file_to_open = open(data_dir+'Temp12k1_0_2.pkl','rb')
proxies_all = pickle.load(file_to_open)['D']
file_to_open.close()

# Extract the time series and use only those which are in Temp12k and in units of degC
all_ts = lipd.extractTs(proxies_all)
proxy_ts = lipd.filterTs(all_ts,  'paleoData_inCompilation == Temp12k')
proxy_ts = lipd.filterTs(proxy_ts,'paleoData_units == degC')
n_proxies = len(proxy_ts)

## 4. Exploring the Proxy Metadata

The proxy database is now stored as a list called "proxy_ts", which is 1276 entries long. Each entry in the list is a dictionary. You can see for yourself:

In [None]:
print(type(proxy_ts),len(proxy_ts))
print(type(proxy_ts[0]))

Python is a 0 indexed language, so to print the first proxy record, we would use the command:

In [None]:
print(proxy_ts[0])

As you can see, there's a lot of data and metadata. Since "proxy_ts[0]" is a dictionary, we need to use the right key to get a specific piece of data or metadata from it. To see all of the keys, you could use the command:

> print(proxy_ts[0].keys())

Some of the more important keys are:

| Key | Explanation |
| --- | --- |
| paleoData_values | The proxy record data |
| age | The proxy record ages |
| dataSetName | Data set name |
| paleoData_TSid | The "TSid," which is a unique identifier for the proxy record |
| archiveType | Archive type |
| paleoData_proxyGeneral | General proxy type |
| paleoData_proxy | Specific proxy type |
| paleoData_variableName | Variable |
| geo_meanLat | Latitude (-90 to 90) |
| geo_meanLon | Longitude (-180 to 180) |
| geo_meanElev | Elevation (m) |
| paleoData_interpretation | Notes about the interpretation of the proxy record |
| originalDataUrl | The URL of the original proxy record |
| paleoData_units | Units of the data |
| ageUnits | Units of the ages |

To demonstrate how keys are used, the code below gets some data and metadata from the proxy record, then makes a simple figure.

In [None]:
# Get data
index_chosen = 0
proxy_data   = np.array(proxy_ts[index_chosen]['paleoData_values']).astype(float)
proxy_ages   = np.array(proxy_ts[index_chosen]['age']).astype(float)
dataset_name = proxy_ts[index_chosen]['dataSetName']

# Make a simple figure
plt.plot(proxy_ages,proxy_data)
plt.xlabel('Age ('+proxy_ts[index_chosen]['ageUnits']+')')
plt.ylabel('T ('+proxy_ts[index_chosen]['paleoData_units']+')')
plt.title(dataset_name)
plt.show()

To get a better sense of what's in the database, let's make a function to summarize a chosen metadata key across all 1276 records.

Note: Defining functions is useful when you want to run the same code multiple times in different contexts. We'll create a variety of functions in this notebook.

In [None]:
# Print a sorted list of a selected variable
def print_sorted_list(key):
    #
    # Make a list of all values of the given key
    variable_all = []
    for i in range(n_proxies):
        try:    variable_all.append(proxy_ts[i][key])
        except: variable_all.append('not given')
    #
    # Count the number of each name
    name_words,name_counts = np.unique(variable_all,return_counts=True)
    count_sort_ind = np.argsort(-name_counts)
    name_words_sorted  = name_words[count_sort_ind]
    name_counts_sorted = name_counts[count_sort_ind]
    #
    # Print the counts
    print('===',key,'===')
    for i in range(len(name_counts_sorted)):
        print('%25s %5s' % (name_words_sorted[i],name_counts_sorted[i]))

Run the code below to display the counts for the archive types across all of the records.

In [None]:
print_sorted_list('archiveType')

**Try this:** Edit the code cell above to summarize the 'paleoData_proxyGeneral' key instead. What kinds of proxy types are in the database?

## 5. Making figures

Okay, let's make some more figures. First, let's make a map of all proxy locations. To do this, let's get the lat and lons of all of our proxy records.

In [None]:
# Create empty arrays to store the lats and lons
lats_all = np.zeros((n_proxies)); lats_all[:] = np.nan
lons_all = np.zeros((n_proxies)); lons_all[:] = np.nan

# Loop through all proxy records, storing the lats and lons in the newly-created arrays
for i in range(n_proxies):
    lats_all[i] = proxy_ts[i]['geo_meanLat']
    lons_all[i] = proxy_ts[i]['geo_meanLon']

Now, let's create two functions:
- **proxy_map:** This will create a map of all proxy locations in a given region.
- **proxy_metadata:** This will print selected metadata of all proxies in a given region.

In [None]:
# A function to make a map of all proxy locations in a given region
def proxy_map(map_bounds):
    #
    # Count the number of proxy records in this region
    n_selected = len(np.where((lons_all >= map_bounds[0]) & (lons_all <= map_bounds[1]) & (lats_all >= map_bounds[2]) & (lats_all <= map_bounds[3]))[0])
    #
    # Plot the locations of all proxy records in the region
    plt.figure(figsize=(12,20))
    ax1 = plt.subplot2grid((1,1),(0,0),projection=ccrs.PlateCarree()); ax1.set_extent(map_bounds,crs=ccrs.PlateCarree())
    ax1.scatter(lons_all,lats_all,25,c='r',marker='o',alpha=1,transform=ccrs.PlateCarree())
    ax1.coastlines()
    ax1.gridlines(color='k',linestyle='--',draw_labels=True)
    ax1.set_title('Locations of '+str(n_selected)+' proxy records',fontsize=14,loc='center')
    plt.show()

# A function to print selected metadata of all proxies in a given region.
def proxy_metadata(region_bounds):
    #
    ind_selected = np.where((lons_all >= region_bounds[0]) & (lons_all <= region_bounds[1]) & (lats_all >= region_bounds[2]) & (lats_all <= region_bounds[3]))[0]
    print('Records found in the region',region_bounds,':',len(ind_selected))
    #
    # Print some metadata for the selected proxies
    print_fmt = '%5s %40s %16s %15s %22s %10s %10s %12s %-10s'
    print(print_fmt % ('Index','dataSetName','Archive','Proxy','Variable','Lat','Lon','Season','Original_URL'))
    print(print_fmt % ('=====','===========','=======','=====','========','===','===','======','============'))
    for i in ind_selected:
        print(print_fmt % (i, \
                           proxy_ts[i]['dataSetName'], \
                           proxy_ts[i]['archiveType'], \
                           proxy_ts[i]['paleoData_proxy'], \
                           proxy_ts[i]['paleoData_variableName'], \
                           proxy_ts[i]['geo_meanLat'], \
                           proxy_ts[i]['geo_meanLon'], \
                           proxy_ts[i]['paleoData_interpretation'][0]['seasonalityGeneral'], \
                           proxy_ts[i]['originalDataUrl']))

Both of the functions above use an input list with four values: [lon_min, lon_max, lat_min, lat_min]. Let's try it:

In [None]:
# Make a global map
proxy_map([-180,180,-90,90])

Now, let's use both functions to make a map and list the proxies in a particlar region. In the code below, I've selected a part of southern Asia.

In [None]:
# Map and list the proxies in a given region
region_selected = [60,120,0,30]  # Give values in the format [lon_min, lon_max, lat_min, lat_max]
proxy_map(region_selected)
proxy_metadata(region_selected)

**Try this:** Edit the code above to explore the proxies in a region you're interested in.

Next, let's make a function to make visual summary of a single proxy record (i.e., a proxy "dashboard").

In [None]:
# A function to plot a dashboard of a selected proxy record
def make_dashboard(index_chosen,save_figure=False):
    #
    # Get data
    proxy_ages = np.array(proxy_ts[index_chosen]['age']).astype(float)
    proxy_data = np.array(proxy_ts[index_chosen]['paleoData_values']).astype(float)
    #
    # Specify subplots
    plt.figure(figsize=(16.5,11))
    ax1 = plt.subplot2grid((2,2),(0,0),colspan=2)
    ax2 = plt.subplot2grid((2,2),(1,0),projection=ccrs.Robinson(central_longitude=0)); ax2.set_global()
    #
    # Print title
    tsid_str = proxy_ts[index_chosen]['paleoData_TSid']
    if len(tsid_str) > 40: tsid_str = tsid_str[0:40]+'...'
    plt.suptitle('Proxy data and metadata\n'+proxy_ts[index_chosen]['archiveType']+'  |  '+proxy_ts[index_chosen]['paleoData_proxy']+'  |  '+proxy_ts[index_chosen]['dataSetName']+'  |  '+tsid_str,y=0.98,fontweight='bold',fontsize=18)
    #
    # Make a time series of each proxy.
    plot_dots, = ax1.plot(proxy_ages,proxy_data,'-ob',color='tab:blue',linewidth=2,markersize=5)
    ax1.set_title('Proxy time series: '+proxy_ts[index_chosen]['paleoData_variableName'],fontsize=14)
    ax1.set_xlabel('Age ('+proxy_ts[index_chosen]['ageUnits']+')',fontsize=14)
    ax1.set_ylabel(proxy_ts[index_chosen]['paleoData_units'],fontsize=14)
    ax1.tick_params(labelsize=14)
    ax1.invert_xaxis()
    #
    # Make a map of the proxy location
    ax2.coastlines(zorder=2)
    ax2.add_feature(cfeature.LAKES,facecolor='none',edgecolor='k')
    ax2.gridlines(zorder=3,color='k',linewidth=1,linestyle=(0,(1,5)))
    ax2.scatter(proxy_ts[index_chosen]['geo_meanLon'],proxy_ts[index_chosen]['geo_meanLat'],300,marker='o',facecolor='r',edgecolor='k',linewidth=3,alpha=1,zorder=1,transform=ccrs.PlateCarree())
    lon_str = str('%1.2f' % proxy_ts[index_chosen]['geo_meanLon'])+'$^\circ$E'
    lat_str = str('%1.2f' % proxy_ts[index_chosen]['geo_meanLat'])+'$^\circ$N'
    ax2.set_title('Proxy location\n('+lat_str+', '+lon_str+')',fontsize=14)
    #
    # Save some metadata to a new variable, to make it easier to access in a loop
    proxy_ts[index_chosen]['interp_variable']       = proxy_ts[index_chosen]['paleoData_interpretation'][0]['variable']
    proxy_ts[index_chosen]['interp_season']         = proxy_ts[index_chosen]['paleoData_interpretation'][0]['seasonalityGeneral']
    proxy_ts[index_chosen]['interp_direction']      = proxy_ts[index_chosen]['paleoData_interpretation'][0]['direction']
    try:    proxy_ts[index_chosen]['interp_detail'] = proxy_ts[index_chosen]['paleoData_interpretation'][0]['variableDetail']
    except: proxy_ts[index_chosen]['interp_detail'] = ''
    #
    # Print metadata on each figure.
    metadata_selected = ['dataSetName','archiveType','paleoData_proxy','paleoData_variableName','geo_meanLat','geo_meanLon','geo_meanElev','interp_variable',\
                         'interp_detail','interp_season','interp_direction','pub1_title','pub1_author','pub1_year','pub1_doi','originalDataUrl']
    fntsize = 14; offsetscale = 0.065; initial_offset = -.22
    for offset, key in enumerate(metadata_selected):
        try:    metadata_str = str(proxy_ts[index_chosen][key])
        except: metadata_str = ''
        if len(metadata_str) > 40: metadata_str = metadata_str[0:40]+'...'
        plt.text(.47,initial_offset-offsetscale*offset,key+':',     transform=ax1.transAxes,fontsize=fntsize)
        plt.text(.65, initial_offset-offsetscale*offset,metadata_str,transform=ax1.transAxes,fontsize=fntsize)
    #
    plt.subplots_adjust(left=0.085,right=0.95,top=0.9,bottom=0.05)
    if save_figure:
        output_dir = '/content/drive/MyDrive/Colab Notebooks/figures/'
        plt.savefig(output_dir+'proxy_i'+str(index_chosen).zfill(4)+'_'+proxy_ts[index_chosen]['dataSetName']+'.png',dpi=300,format='png',bbox_inches='tight')
    #
    plt.show()

To use the function above, give it the index of a proxy. Remember that Python is a 0-indexed programming language. So, to plot the first proxy, we just need to give our function the value "0". The "save_figure" variable is set to False by default, so it will just display the figure, not save it.

In [None]:
make_dashboard(0)

As you can see, the make_dashboard function gives an overview of a chosen proxy record. Take a moment to look at it.

**Try this:** Edit the code below to focus on a region that you're interested in. Then get the index of a proxy you're interested in (from the displayed table) and use the make_dashboard function to take a closer look.

If you want to save images of the proxy dashboards, right click on it and select "Save Image As". Alternately, set the "save_figure" variable to "True" when calling make_dashboard. The figure will be saved to the "Colab Notebooks/figures/" sub-folder on your Google Drive.

In [None]:
# Map and list the proxies in a given region
region_selected = [60,120,0,30]  # Give values in the format [lon_min, lon_max, lat_min, lat_max]
proxy_map(region_selected)
proxy_metadata(region_selected)

In [None]:
make_dashboard(365,save_figure=False)

## 5. Interactive figures

As a final figure, let's make an interactive figure of all of the proxies in a chosen region. Interactive figures may or may not be useful to you, but they do provide some extra functionality.

We'll use "bokeh" to make an interactive figure. First, load the new libraries:

In [None]:
from bokeh.plotting import figure,show
from bokeh.io import output_notebook
from bokeh.models import HoverTool
from bokeh.models import Range1d

Next, we'll create a function to plot all records within a selected region.

In [None]:
# A function to plot all proxy records within a chosen region
def plot_proxies_in_region(map_bounds):
    #
    # Find all proxies in a given region
    indicies_in_region = np.where((lons_all >= map_bounds[0]) & (lons_all <= map_bounds[1]) & (lats_all >= map_bounds[2]) & (lats_all <= map_bounds[3]))[0]
    print('Found '+str(len(indicies_in_region))+' proxies in region')
    #
    # Set up the interactive figure
    output_notebook()
    p1 = figure(width=1200,
                height=700,
                title='Global-mean temperature composites (\u00B0C)',
                tools='pan,box_zoom,hover,save,reset',
                active_drag='box_zoom',active_inspect='hover')
    #
    p1.xaxis.axis_label = 'Age (yr BP)'
    p1.yaxis.axis_label = '\u0394T (\u00B0C)'
    p1.x_range.start = 12000
    p1.x_range.end   = 0
    #
    colors = ['black','blue','brown','green','orange','pink','purple','red']*200
    for i,index_chosen in enumerate(indicies_in_region):
        index_chosen = indicies_in_region[i]
        proxy_ages = np.array(proxy_ts[index_chosen]['age']).astype(float)
        proxy_data = np.array(proxy_ts[index_chosen]['paleoData_values']).astype(float)
        proxy_datasetname = proxy_ts[index_chosen]['dataSetName']
        proxy_TSid        = proxy_ts[index_chosen]['paleoData_TSid']
        p1.line(proxy_ages,proxy_data,line_width=1,color=colors[i],legend_label=proxy_datasetname)  #,color='blue'
    #
    # Set some more figure properties
    p1.legend.location     = 'bottom_right'
    p1.legend.click_policy = 'hide'
    p1.background_fill_color           = '#e0e0e0'
    p1.grid.grid_line_color            = 'white'
    p1.axis.axis_label_text_font_style = 'normal'
    p1.axis.axis_label_text_font_size  = '16px'
    p1.title.text_font_size            = '16px'
    p1.title.align                     = 'center'
    #
    hover = p1.select_one(HoverTool)
    hover.tooltips = [
            ('Age','@x{int} yr BP'),
            ('Temp','@y \u00B0C'),
            ]
    #
    # Plot the figure
    show(p1)

Finally, run this function for your region of interest.

You can also call the "proxy_map" and "proxy_metadata" functions to get more information about the region.

In [None]:
region_selected = [60,120,0,30]  # Give values in the format [lon_min, lon_max, lat_min, lat_max]
plot_proxies_in_region(region_selected)
#proxy_map(region_selected)
#proxy_metadata(region_selected)

In the interactive graph above, you can do several things:
- Use the tools on the right to pan, zoom, save, or reset the figure
- Click on names in the legend to hide or show lines
- Mouse over lines to show specific values

Try it out!

## 6. Keep Exploring!

Use the functions above to explore more proxy data in the Temperature 12k database. Focus on a region you're interested in. Save a figure of something cool.

These Temp12k v1.0.2 proxies are used in the Holocene Reconstruction, which we'll look at soon. A newer version of this database has also been released: v1.1.0. Lots of analyses are based on proxy data; data assimilation is just one of many!

For more tools to explore proxy data, check out:
- LiPD Utilities: https://nickmckay.github.io/LiPD-utilities/python/index.html
- Pyleoclim: https://pyleoclim-util.readthedocs.io/en/latest/

**If you'd like to try out Pyleoclim, check out the extra activity in the folder da_workshop/optional_activity_Pyleoclim.**

When you're ready to continue, open the next notebook, which focuses on loading and analysing paleoclimate reconstructions.