# Open cluster cataloging
##### In this notebook we study the discrepancies between various (open) cluster catalogues. We use the open cluster catalogue by Hunt & Reffert (2023) as reference cluster and we base our results under the assumptions that the catalogue concerned, achieved the most accurate and precise data for the open clusters in question.

First we start by creating a datahandler made to get tables from the catalogues in question. We crossmatch the literature with the crossmatch table by Hunt & Reffert (2023). We then see if any of those crossmatched clusters occur in the original literature data. The clusters in the literature can be in one of two states which yields respective data tables:
- Matched = The clusters from the literature is confirmed by the Hunt-catalogue
  - Out of $N$ literature clusters $C$ Hunt-clusters are confirmed which yields $C$ records in the literature-and Hunt-catalogue (I and II in the code)
- Not Matched = The clusters from the literature is not confirmed
  - Out of $N$ literature clusters $N-C$ literature clusters are refuted which yields $N-C$ records in the literature catalogue (III in the code) 



In [9]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

cantat = pd.read_csv('Data\\CantatGaudin\\cantatgaudinfile.csv')
hunt = pd.read_csv('Data\\Hunt\\huntfile.csv')
xmatch = pd.read_csv('Data\\Hunt\\xmatchfile.csv').dropna(subset='Sep')
khar = pd.read_csv('Data\\Kharchenko\\kharchenkofile.csv').query('Type != "g"')
dias = pd.read_csv('Data\\Dias\\diasfile.csv')
dias['Cluster'] = dias['Cluster'].str.replace(' ', '_').str.replace('-', '_')

In [15]:
def datahandler(df_lit, df=hunt, crossmatch=xmatch):

    if df_lit is cantat:
        sourcecat = 'Cantat-Gaudin+20'
        NameCol = 'Cluster'
    elif df_lit is khar:
        sourcecat = 'Kharchenko+13'
        NameCol = 'Name'
    elif df_lit is dias:
        sourcecat = 'Dias+02'
        NameCol = 'Cluster'
        # df_lit['Cluster'] = df_lit['Cluster'].str.replace(' ', '_').str.replace('-', '_')
        # df_lit.drop_duplicates(subset=['Cluster'])

    df = df.query('Type == "o"') #Only open clusters
    crossmatch = crossmatch.query('SourceCat == @sourcecat').drop_duplicates('ID')

    xm = pd.merge(crossmatch, df, on='ID', how='inner') #Crossmatched clusters
    allnames = xm.assign(synonym = xm['AllNames'].str.split(',')).explode('synonym').add_suffix('_h') #Create AllNames column with synonyms of the OCs
    
    # if df_lit is dias:
    #     allnames['synonym_h'] = allnames['synonym_h'].str.replace('-', '_').str.replace(' ', '_')
    
    df_matched = pd.merge(df_lit, allnames, left_on=NameCol, right_on='synonym_h', how='outer', indicator=True).drop_duplicates(NameCol) #Crossmatched clusters (matched with literature)
    
    matched = df_matched.query('_merge == "both"') #Matched clusters
    not_matched = df_matched.query('_merge == "left_only"')
    
    hunt_matched = matched.filter(regex='_h$').drop(columns=['synonym_h'])
    lit_matched = matched[df_lit.columns]
    lit_not_matched = not_matched[df_lit.columns]
    
    return hunt_matched, lit_matched, lit_not_matched


In [16]:
cantat_matched, cantat_lit, cantat_not_matched = datahandler(cantat)
khar_matched, khar_lit, khar_not_matched = datahandler(khar)
dias_matched, dias_lit, dias_not_matched = datahandler(dias)

print(f'Cantat matched: {cantat_matched.shape[0]}, Cantat not matched: {cantat_not_matched.shape[0]}, Cantat total: {cantat.shape[0]}')
print(f'Kharchenko matched: {khar_matched.shape[0]}, Kharchenko not matched: {khar_not_matched.shape[0]}, Kharchenko total: {khar.shape[0]}')
print(f'Dias matched: {dias_matched.shape[0]}, Dias not matched: {dias_not_matched.shape[0]}, Dias total: {dias.shape[0]}')



Cantat matched: 1427, Cantat not matched: 54, Cantat total: 1481
Kharchenko matched: 1391, Kharchenko not matched: 1468, Kharchenko total: 2859
Dias matched: 1167, Dias not matched: 999, Dias total: 2167


## Analyzing
Now that the relavant clusters have been selected from the literature and the Hunt-catalogue. Several techniques will be employed into researching what properties make the clusters belong to the "real" data. Or the opposite, what makes clusters pop up in the literature but not with Gaia DR3

We will take on several avenues into qualifying these discrepancies
- Correlation matrices
- Minimum Spanning Trees / Hierarchical Clustering
- Binary classification