# Filter compact dataframe

This file reads the compact dataframes and filters for specific records (e.g. for moisture sensitive records). 
The filtered dataset is saved in a separate directory and can be loaded for further analysis or plotting etc.

Author: Lucie Luecke

Date produced: 21/01/2025

Input: 
reads dataframe with the following keys:
  - ```archiveType```
  - ```dataSetName```
  - ```datasetId```
  - ```geo_meanElev```
  - ```geo_meanLat```
  - ```geo_meanLon```
  - ```geo_siteName```
  - ```interpretation_direction``` (new in v2.0)
  - ```interpretation_variable```
  - ```interpretation_variableDetail```
  - ```interpretation_seasonality``` (new in v2.0)
  - ```originalDataURL```
  - ```originalDatabase```
  - ```paleoData_notes```
  - ```paleoData_proxy```
  - ```paleoData_sensorSpecies```
  - ```paleoData_units```
  - ```paleoData_values```
  - ```paleoData_variableName```
  - ```year```
  - ```yearUnits```
  - (optional: `DuplicateDetails`)



## Set up working environment

Make sure the repo_root is added correctly, it should be: your_root_dir/dod2k
This should be the working directory throughout this notebook (and all other notebooks).

In [1]:
%load_ext autoreload
%autoreload 2

import sys
import os
from pathlib import Path

# Add parent directory to path (works from any notebook in notebooks/)
# the repo_root should be the parent directory of the notebooks folder
current_dir = Path().resolve()
# Determine repo root
if current_dir.name == 'dod2k':
    repo_root = current_dir
elif current_dir.parent.name == 'dod2k':
    repo_root = current_dir.parent
else:
    raise Exception('Please review the repo root structure (see first cell).')

# Update cwd and path only if needed
if os.getcwd() != str(repo_root):
    os.chdir(repo_root)
if str(repo_root) not in sys.path:
    sys.path.insert(0, str(repo_root))

print(f"Repo root: {repo_root}")
if str(os.getcwd())==str(repo_root):
    print(f"Working directory matches repo root. ")

Repo root: /home/jupyter-lluecke/compile_proxy_database_v2.2/dod2k
Working directory matches repo root. 


In [2]:
import pandas as pd
import numpy as np

from dod2k_utilities import ut_functions as utf # contains utility functions


## read dataframe

Read compact dataframe for filtering.

{db_name} refers to the database, including
  - database of databases:
    - dod2k_dupfree_dupfree (twice filtered for duplicates)
    - dod2k_dupfree_dupfree_MT (twice filtered for duplicates and filtered for MT sensitive proxies only)
    - dod2k_dupfree (once filtered for duplicates)
    - dod2k (NOT filtered for duplicates, only fusion of the input databases)
  - original databases:
    - fe23
    - ch2k
    - sisal
    - pages2k
    - iso2k

All compact dataframes are saved in {repo_root}/data/{db_name} as {db_name}_compact.csv.

In [3]:

db_name = 'dod2k_v1.2'

df = utf.load_compact_dataframe_from_csv(db_name)
print(df.originalDatabase.unique())
df.name = db_name
print(df.info())

['FE23 (Breitenmoser et al. (2014))' 'CoralHydro2k v1.0.0'
 'dod2k_composite_standardised' 'Iso2k v1.0.1'
 'PAGES2k v2.0.0 (Ocn_103 updated with Dee et al. 2020)' 'SISAL v3']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4516 entries, 0 to 4515
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   archiveType                           4516 non-null   object 
 1   climateInterpretation_variable        4516 non-null   object 
 2   climateInterpretation_variableDetail  4516 non-null   object 
 3   dataSetName                           4516 non-null   object 
 4   datasetId                             4516 non-null   object 
 5   duplicateDetails                      4516 non-null   object 
 6   geo_meanElev                          4433 non-null   float32
 7   geo_meanLat                           4516 non-null   float32
 8   geo_meanLon                           4516 

## filter dataframe for specific record types

Here you can filter the dataframe for specific record types. Below is an example where we filter for interpretation_variable=temperature. 

This could be done with any column and any value (e.g. for a specific archive type, etc.)

Please look at the examples below which are commented out for future use

In [4]:
# if you want to filter for specific metadata, e.g. temperature or moisture records, run this:


# ---> interpretation_variable
# e.g.

# # filter for >>moisture<< sensitive records only (also include records which are moisture and temperature sensitive)
df_filter = df.loc[(df['interpretation_variable']=='moisture')|(df['interpretation_variable']=='temperature+moisture')]

# # filter for >>exclusively moisture<< sensitive records only (without t+m)
# df_filter = df.loc[(df['interpretation_variable']=='moisture')]

# # filter for >>temperature<< sensitive records only (also include records which are moisture and temperature sensitive)
# df_filter = df.loc[(df['interpretation_variable']=='temperature')|(df['interpretation_variable']=='temperature+moisture'])]

# # filter for >>exclusively temperature<< sensitive records only (without t+m)
# df_filter = df.loc[(df['interpretation_variable']=='temperature')]

# ---> archiveType and paleoData_proxy
# e.g.

# # filter for specific proxy type, e.g. archiveType='speleothem' and paleoData_proxy='d18O'
# df_filter = df.loc[(df['archiveType']=='speleothem')&(df['paleoData_proxy']=='d18O')]


# ---> paleoData_proxy only
# e.g. 

# df_filter = df.loc[(df['paleoData_proxy']=='MXD')]

# etc.

IMPORTANT: the database name needs to be adjusted according to the filtering.

Please add an identifier to the dataframe name which will be used for displaying and savng the data. 

Make sure it is different from the original db_name.

As df.name is used for saving the filtered data it is crucial that it differs from the original db_name otherwise the data will get overwritten!

In [5]:
# df needs name reassigned as it gets lost otherwise after assigning new value to df (through the filtering above)

# for the M+T filtered example, revise df.name to _filtered_MT
df_filter.name = db_name + "_filtered_M" 
print(df_filter.name)

assert df_filter.name!=db_name

dod2k_v1.2_filtered_M


Display the filtered dataframe

In [6]:
print(df_filter.info())

<class 'pandas.core.frame.DataFrame'>
Index: 1597 entries, 2 to 4513
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   archiveType                           1597 non-null   object 
 1   climateInterpretation_variable        1597 non-null   object 
 2   climateInterpretation_variableDetail  1597 non-null   object 
 3   dataSetName                           1597 non-null   object 
 4   datasetId                             1597 non-null   object 
 5   duplicateDetails                      1597 non-null   object 
 6   geo_meanElev                          1565 non-null   float32
 7   geo_meanLat                           1597 non-null   float32
 8   geo_meanLon                           1597 non-null   float32
 9   geo_siteName                          1597 non-null   object 
 10  originalDataURL                       1597 non-null   object 
 11  originalDatabase      

## save filtered dataframe

Saves the filtered dataframe in:

{repo_root}/data/{df_filter.name}

In [7]:
# create new directory if dir does not exist
path = '/data/'+df_filter.name
os.makedirs(os.getcwd()+path, exist_ok = True)

In [8]:
# save as pickle
df_filter.to_pickle(f'data/{df_filter.name}/{df_filter.name}_compact.pkl')

In [9]:
# save csv
utf.write_compact_dataframe_to_csv(df_filter)

METADATA: archiveType, climateInterpretation_variable, climateInterpretation_variableDetail, dataSetName, datasetId, duplicateDetails, geo_meanElev, geo_meanLat, geo_meanLon, geo_siteName, originalDataURL, originalDatabase, paleoData_notes, paleoData_proxy, paleoData_sensorSpecies, paleoData_units, yearUnits
Saved to /home/jupyter-lluecke/compile_proxy_database_v2.2/dod2k/data/dod2k_v1.2_filtered_M/dod2k_v1.2_filtered_M_compact_%s.csv


In [10]:
# load dataframe
utf.load_compact_dataframe_from_csv(df_filter.name).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1597 entries, 0 to 1596
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   archiveType                           1597 non-null   object 
 1   climateInterpretation_variable        1597 non-null   object 
 2   climateInterpretation_variableDetail  1597 non-null   object 
 3   dataSetName                           1597 non-null   object 
 4   datasetId                             1597 non-null   object 
 5   duplicateDetails                      1597 non-null   object 
 6   geo_meanElev                          1565 non-null   float32
 7   geo_meanLat                           1597 non-null   float32
 8   geo_meanLon                           1597 non-null   float32
 9   geo_siteName                          1597 non-null   object 
 10  originalDataURL                       1597 non-null   object 
 11  originalDatabase 