## 
This notebook runs a duplicate detection algorithm on a dataframe with the following columns:
- 'archiveType'       (used for duplicate detection algorithm)
- 'climateInterpretation_variable'
- 'dataSetName'
- 'datasetId'
- 'geo_meanElev'      (used for duplicate detection algorithm)
- 'geo_meanLat'       (used for duplicate detection algorithm)
- 'geo_meanLon'       (used for duplicate detection algorithm)
- 'geo_siteName'      (used for duplicate detection algorithm)
- 'originalDataURL'
- 'originalDatabase'
- 'paleoData_notes'
- 'paleoData_proxy'   (used for duplicate detection algorithm)
- 'paleoData_units'
- 'paleoData_values'  (used for duplicate detection algorithm, test for correlation, RMSE, correlation of 1st difference, RMSE of 1st difference)
- 'year'              (used for duplicate detection algorithm)
- 'yearUnits'

The key function for duplicate detection is find_duplicates in f_duplicate_search.py

The output is saved as csvs in the directory dup_detection/DATABASENAME:
- pot_dup_correlations_DATABASENAME.csv          
- pot_dup_distances_km_DATABASENAME.csv          
- pot_dup_IDs_DATABASENAME.csv                   (saves the IDs of each pair)
- pot_dup_indices_DATABASENAME.csv               (saves the dataframe indices of each pair)

Summary figures of the potential duplicate pairs are created and the plots are saved in the same directory, following:
duplicatenumber_ID1_ID2_index1_index2.jpg

27/11/2024: Fixed a bug in find_duplicates (in f_duplicate_search) and relaxed site criteria.
27/9/2024 v0: Notebook written by Lucie J. Luecke 



In [1]:
%load_ext autoreload
%autoreload 2

# Set up environment

In [2]:
# import pickle
# import gzip
import os
import pandas as pd
# import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature 
from matplotlib.gridspec import GridSpec as GS
from copy import deepcopy as dc
import functions as f
import geopy
import datetime
import f_duplicate_search as dupdet

In [3]:
# choose working directory
wdir = '/home/jupyter-lluecke/compile_proxy_database_v2.1'
os.chdir(wdir)
print(wdir)

/home/jupyter-lluecke/compile_proxy_database_v2.1


## Load dataset

In [4]:
# read dataframe 

# db_name = 'dod2k'
db_name = 'dod2k_dupfree'
# db_name = 'ch2k'
# db_name = 'fe23'
# db_name = 'iso2k'
# db_name = 'pages2k'
# db_name = 'sisal'


# load dataframe
df = f.load_compact_dataframe_from_csv(db_name)
# databasedir    = '%s/%s_compact.pkl'%(db_name, db_name)
# df = pd.read_pickle(databasedir)

print(df.info())
df.name = db_name


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4517 entries, 0 to 4516
Data columns (total 18 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   archiveType                           4517 non-null   object 
 1   climateInterpretation_variable        4517 non-null   object 
 2   climateInterpretation_variableDetail  4517 non-null   object 
 3   dataSetName                           4517 non-null   object 
 4   datasetId                             4517 non-null   object 
 5   geo_meanElev                          4434 non-null   float32
 6   geo_meanLat                           4517 non-null   float32
 7   geo_meanLon                           4517 non-null   float32
 8   geo_siteName                          4517 non-null   object 
 9   originalDataURL                       4517 non-null   object 
 10  originalDatabase                      4517 non-null   object 
 11  paleoData_notes  

In [5]:
df.year

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
4512   NaN
4513   NaN
4514   NaN
4515   NaN
4516   NaN
Name: year, Length: 4517, dtype: float64

In [5]:
# for ii in df.index:
#     # if type(df.at[ii, 'paleoData_values'])==np.ma.core.MaskedArray: continue
#     dd=f.convert_to_nparray(df.at[ii, 'paleoData_values'])
#     # print
#     df.at[ii, 'paleoData_values']=dd.data[~dd.mask]
#     df.at[ii, 'year']=df.at[ii, 'year'][~dd.mask]

# Duplicate Detection

### Find duplicates

In [6]:
## run the find duplicate algorithm
out = dupdet.find_duplicates(df, n_points_thresh=10)
pot_dup_inds, pot_dup_IDs, distances_km, correlations = out

#OR if you want to load the duplicates from saved CSV then just comment this cell out


dod2k_dupfree
Start duplicate search:
checking parameters:
proxy archive                  :  must match     
proxy type                     :  must match     
distance (km)                  < 8               
elevation                      :  must match     
time overlap                   > 10              
correlation                    > 0.9             
RMSE                           < 0.1             
1st difference rmse            < 0.1             
correlation of 1st difference  > 0.9             
Start duplicate search
Progress: 0/4517
Progress: 10/4517
Progress: 20/4517
Progress: 30/4517
Progress: 40/4517


KeyboardInterrupt: 

### Plot duplicate candidate pairs

In [None]:
dupdet.plot_duplicates(df, save_figures=True)

In [None]:

# date =  '24-11-22'
date = str(datetime.datetime.utcnow())[2:10]
fn = f.find('pot_dup_meta_short_%s.csv'%df.name, 
     '%s/dup_detection'%df.name)



In [None]:
if fn != []:
    print('----------------------------------------------------')
    print('Sucessfully finished the duplicate detection process!'.upper())
    print('----------------------------------------------------')
    print('Saved the detection output file in:')
    print()
    print('%s.'%', '.join(fn))
    print()
    print('You are now able to proceed to the next notebook: dup_decision.ipynb')
else:
    print('Final output file is missing.')
    print()
    print('Please re-run the notebook to complete duplicate detection process.')