# Purpose of this script

This code generates a catalog of planets based on the NASA Exoplanet Archive "Planetary Systems (Alpha release)" database. The users can download this database from the following website: https://exoplanetarchive.ipac.caltech.edu/cgi-bin/TblView/nph-tblView?app=ExoTbls&config=PS. 

After importing the database, they can specify their parameters interest (e.g. Rp, Mp, etc.) in the "interesting_cols" array. For each of these parameters, the script will consider all the reported values in the Planetary Systems catalog and will choose the one with the smallest error. It will also add an additional column providing the bibliographical reference for such value. The rest of the columns in the Planetary Systems database will be kept. 

Date: 15 May 2020

# User input

In [1]:
infile = "PS_2020.05.15_03.52.12.csv"
interesting_cols = ["pl_rade", "pl_masse", "pl_radj", "pl_massj", "st_rad", "st_mass", "pl_eqt"] 
radius_max = None #Upper limit on radii to be returned; None: skip step 2
mass_only = False #True: only return mass measurements; False: return mass and Msini measurements (skip step 5)

# Code

In [2]:
import numpy as np
import pandas as pd

In [3]:
def get_min_value(col_name, df):
    ref_colname = "{}_refname".format(col_name.split("_")[0])
    cols = ["pl_name"] + [tmp.format(col_name) for tmp in ("{}", "{}err1", "{}err2")] + [ref_colname]
    df_aux = df[cols].copy()
    err_col = "{}_err".format(col_name)
    df_aux = df_aux.assign(**{err_col: np.sqrt(df_aux[cols[2]]**2 + df_aux[cols[3]]**2)})
    df_aux = df_aux.rename(columns={ref_colname: "{}_ref".format(col_name)})
    idxs = df_aux.groupby("pl_name")[err_col].idxmin().dropna()
    df_aux = df_aux.loc[idxs].drop(err_col, axis=1)
    return df_aux

def update_column_with_min_err(col_name, df_orig, df_fin):
    df_aux = get_min_value(col_name, df_orig).set_index("pl_name")
    *cols, ref_col = df_aux.columns
    df_fin.loc[df_aux.index, cols] = df_aux[cols]
    df_fin[ref_col] = ""
    df_fin.loc[df_aux.index, ref_col] = df_aux[ref_col]
    return df_fin

# 1. Preparation

## 1a. Load the NASA Exoplanet Archive Catalog

Download the Planetary Systems (alpha release) from the Exoplanet Archive dataset (https://exoplanetarchive.ipac.caltech.edu/cgi-bin/TblView/nph-tblView?app=ExoTbls&config=PS).

In [4]:
df_original = pd.read_csv(infile, comment="#")
len(np.unique(df_original['pl_name']))

  interactivity=interactivity, compiler=compiler, result=result)


4154

## 1b. [Optional] Select Parameters of Interest 

Choose your parameters of interest (here "pl_rade" and "st_rad"). For each of these parameters, the script will choose the value that has the smallest error and will create a reference column showing the paper where it comes from. You may also decide to skip this step, if you want to download the NASA default parameters.

Note: The names of the chosen parameters must coincide with the column names here (https://exoplanetarchive.ipac.caltech.edu/docs/API_PS_columns.html)



# 2. [Optional] Do you want a Planetary Radius Mask? 

Do you want to have a maximum value for the planetary radius? (i.e. to do studies focused on a specific type of planet, such as sub-Neptunes or super-Earths?))
* If not, do not execute cells in Section 2 
* If you want to impose a cut at XX, then prad_mask = XX. 

In [5]:
if radius_max is not None:
    unit_radius = 'pl_rade' #Select the unit of the planetary radius (either 'pl_rade' or 'pl_radj')
    prad_mask = radius_max #fill XX with your cut (in the units chosen above)

    min_radi_per_pl = get_min_value(unit_radius, df_original).set_index("pl_name")
    min_radi_per_pl

    radi_mask = min_radi_per_pl.pl_rade <= prad_mask #we're not going to put a cut in planet radius
    planets_to_consider = min_radi_per_pl[radi_mask].index
    len(planets_to_consider)

    #"Min_error" approach to choosing values: : consider the planet radi as the one with less error, cut using this value

    min_radi_per_pl = get_min_value("pl_rade", df_original).set_index("pl_name")

    planets_to_consider = min_radi_per_pl[radi_mask].index
    planets_to_consider

    planet_mask = df_original.pl_name.isin(planets_to_consider)
    planet_mask.sum()  # we're going to use these rows

    df_original = df_original[planet_mask]
    df_original.shape

# 3. Confirm that the Catalog Entries are Unique

In [6]:
print(np.unique(df_original.pl_name).size)
print(df_original.default_flag.sum()) #equivalent to checking which ones have default_flag = 1

4154
4154


To create the final table with one row per planet and the best measures, we will follow these steps:
1. Drop duplicates on "pl_name" to get the proper shape of the final table, `df_final`
1. For all planets with a default flag, use these parameter values
1. For the parameters of interest, overwrite the defaults (either coming from drop duplicates or from default flag) with the available values with smaller error

In [7]:
df_final = df_original.copy(deep=True)
df_final = df_final.drop_duplicates("pl_name").set_index("pl_name").drop("default_flag", axis=1)
df_final.shape

(4154, 294)

In [8]:
df_aux = df_original[df_original.default_flag == 1].drop("default_flag", axis=1).set_index("pl_name")
df_final.loc[df_aux.index] = df_aux
df_final["pl_masse"] = np.nan  
df_final.shape

(4154, 294)

# 4. [Optional]: Retrieve Parameters of Interest with Min_error Approach

In [9]:
for col in interesting_cols:
    df_final = update_column_with_min_err(col, df_original, df_final)
    
df_final.shape

(4154, 301)

# 5. [Optional]: Do you want a Mass Measurement, or M*sin(i)? 

Execute the cells below if you require actual mass measurements, rather than msin(i) values.

In [10]:
if mass_only:
    # this should select masses only from the correct subset with the radii data having already been selected from the whole table
    mass_mask = df_original.pl_bmassprov == "Mass"
    df_final = update_column_with_min_err("pl_masse", df_original[mass_mask], df_final)
    df_final.shape  # 3 extra columns with specific referencesk

    df_final['pl_bmassprov']

    df_final = df_final.drop("pl_bmassprov", axis=1).dropna(subset=["pl_masse"])
    df_final.shape
    np.unique(df_original[mass_mask].dropna(subset=["pl_masse", "pl_masseerr1", "pl_masseerr2"]).pl_name).size
    np.unique(df_original[mass_mask].pl_name).size

# 5. Create your final dataset

In [11]:
exo_archive = df_final.copy(deep=True) #final dataset
exo_archive.shape

(4154, 301)

In [12]:
exo_archive.to_csv('final_catalog_planets.csv')

In [13]:
np.unique(exo_archive.index).size

4154