# Quickstart Guide

In [1]:
import numpy as np
import pandas as pd
import sherlock

## Read a local csv file
The function `get_exoarchive` allows to download a new version of the database or to read a local file after a manual download. For convenience, `sherlock` includes a csv file downloaded from the NASA Exoplanet Archive; it is loaded if no filename is passed when calling `get_exoarchive`:

In [2]:
complete_catalog = sherlock.get_exoarchive(local=True)
complete_catalog.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,rowid,pl_name,hostname,pl_letter,hd_name,hip_name,tic_id,default_flag,sy_snum,sy_pnum,...,sy_kepmagerr1,sy_kepmagerr2,rowupdate,pl_pubdate,releasedate,pl_nnotes,st_nphot,st_nrvc,st_ntranspec,st_nspec
0,1,11 Com b,11 Com,b,,HIP 60202 b,TIC 72437047 b,1,2,1,...,,,2014-05-14,2008-01,2014-05-14,2.0,0.0,2.0,0.0,0.0
1,2,11 Com b,11 Com,b,,HIP 60202 b,TIC 72437047 b,0,2,1,...,,,2014-07-23,2011-08,2014-07-23,2.0,0.0,2.0,0.0,0.0
2,3,11 UMi b,11 UMi,b,,HIP 74793 b,TIC 230061010 b,1,1,1,...,,,2018-09-04,2017-03,2018-09-04,0.0,0.0,1.0,0.0,0.0
3,4,11 UMi b,11 UMi,b,,HIP 74793 b,TIC 230061010 b,0,1,1,...,,,2018-04-25,2009-10,2018-04-25,0.0,0.0,1.0,0.0,0.0
4,5,11 UMi b,11 UMi,b,,HIP 74793 b,TIC 230061010 b,0,1,1,...,,,2018-04-25,2011-08,2018-04-25,0.0,0.0,1.0,0.0,0.0


## Selecting a row per planet
The alpha release of the archive includes a `default_flag` column, which helps select a single row per planet, however, this may not always be the indicated. We may be interested in selecting a subset of values based on the lowest reported error, and also, we can use different references for a single object, using the planetary radius from a first reference and the planetary mass from a second reference.

### `get_exoarchive`
We can get the catalog with `default_flag == 1` with the same `get_exoplanet` command:

In [3]:
catalog = sherlock.get_exoarchive(local=True, default_pars=True)
catalog.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,rowid,pl_name,hostname,pl_letter,hd_name,hip_name,tic_id,default_flag,sy_snum,sy_pnum,...,sy_kepmagerr1,sy_kepmagerr2,rowupdate,pl_pubdate,releasedate,pl_nnotes,st_nphot,st_nrvc,st_ntranspec,st_nspec
0,1,11 Com b,11 Com,b,,HIP 60202 b,TIC 72437047 b,1,2,1,...,,,2014-05-14,2008-01,2014-05-14,2.0,0.0,2.0,0.0,0.0
2,3,11 UMi b,11 UMi,b,,HIP 74793 b,TIC 230061010 b,1,1,1,...,,,2018-09-04,2017-03,2018-09-04,0.0,0.0,1.0,0.0,0.0
5,6,14 And b,14 And,b,,HIP 116076 b,TIC 333225860 b,1,1,1,...,,,2014-05-14,2008-12,2014-05-14,0.0,0.0,1.0,0.0,0.0
12,13,14 Her b,14 Her,b,,HIP 79248 b,TIC 219483057 b,1,1,1,...,,,2018-09-04,2017-03,2018-09-04,0.0,0.0,4.0,0.0,1.0
14,15,16 Cyg B b,16 Cyg B,b,,HIP 96901 b,TIC 27533327 b,1,3,1,...,,,2018-09-04,2017-03,2018-09-04,5.0,0.0,4.0,0.0,3.0


It can be seen that unlike the first `complete_catalog` DataFrame, `catalog` does not have multiple rows for a single planet. 

### `get_from_exoarchive`
To select the data included in the DataFrame with more detail, we can use `get_from_exoarchive`. `get_from_exoarchive` selects the values with lowest reported error for a user-defined subset of columns and the value from the default parameter set for all other columns. Moreover, it also integrates querying into the workflow.

One simple example would be including the measures with lowest reported error for `pl_rade` and `st_rad`.

In [4]:
catalog1 = sherlock.get_from_exoarchive(
    local = True, 
    col_names = ["pl_rade", "st_rad"],
)

  This is separate from the ipykernel package so we can avoid doing imports until


We can now check that for some objects the value is the same as the one in with `default_flag == 1`, but that is not always the case:

In [5]:
pct = np.isclose(catalog.pl_rade, catalog1.pl_rade).mean() * 100
print(f"The value in the default parameter set is the one with\n\
lowest reported error for {pct:.1f} % of the planets")

The value in the default parameter set is the one with
lowest reported error for 20.0 % of the planets


`sherlock` automatically stores the references used for all columns passed as `col_names` in a new column: `<col_name>_ref`:

In [6]:
catalog1.loc[["55 Cnc e"],["pl_rade_ref", "pl_refname"]]

Unnamed: 0_level_0,pl_rade_ref,pl_refname
pl_name,Unnamed: 1_level_1,Unnamed: 2_level_1
55 Cnc e,<a refstr=DEMORY_ET_AL__2011 href=https://ui.a...,<a refstr=DEMORY_ET_AL__2016 href=https://ui.a...


## Querying
Querying capabilities are possible thanks to [`pandas.DataFrame.query`](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-query). Here we assume you are familiar with this method.

`get_from_exoarchive` supports 2 types of queries: queries enforced **before** selecting the values with minimum error and queries enforced **after** selecting the values with minimum error. Let's dive in with one example:

In [7]:
df = pd.DataFrame(
    [
        ["a", 2,   .3, "I"],
        ["a", 2.3, .6, "I"],
        ["a", 1.9, .2, "II"],
        ["b", 7,   .1, "I"],
        ["b", 8,   .5, "II"],
    ],
    columns=("name", "measure", "error", "type")
)
df

Unnamed: 0,name,measure,error,type
0,a,2.0,0.3,I
1,a,2.3,0.6,I
2,a,1.9,0.2,II
3,b,7.0,0.1,I
4,b,8.0,0.5,II


If we select the measures with lowest error for every object, we would end up with the 3d and 4th rows. Then, selecting only objects whose measure is of type `I` removes `a` from the dataset even though `a` has two measures of type `I` available. 

However, uf we first select only the measures of type `I` and then select the measure with lowest error, `a` would still be present in the final dataset.

`get_from_exoarchive` has `pre_queries` and `post_query` to handle both cases.

### Using `pre_queries`
`pre_queries` is thought to be applied on a per column basis, thus, restricting the mass measurement on a specific type of measure is done independently than restricting the radius measurement. One paper may report the radius with great accuracy while reporting a type of measure for the mass we are not interested in, discarding the radius measurement because the mass measurement is not of the desired type seems uncalled for.

For this main reason, `pre_queries` takes a dictionary whose keys are the colum for which the query is to be applied and the value is the query itself (passed to `pandas.Dataframe.query`). Here is one example to select only the masses whose measure is available (discarding the `Mass*sin(i)` measurements:

In [8]:
mass_catalog = sherlock.get_from_exoarchive(
    local = True, 
    col_names = ["pl_masse"],
    pre_queries={"pl_masse": "pl_bmassprov == 'Mass'"}
)

  after removing the cwd from sys.path.


<div class="alert alert-warning">

Note that while the condition on `pl_bmassprov` is enforced, the values for `pl_bmassprov` are not updated. Thus, `pl_bmassprov` will no longer refer to the value of `pl_masse` in its same row but to the value in the default parameter set!

</div>

### Using `post_query`
`post_query` is equivalent to `get_from_exoarchive().query(post_query)`, it is basically provided for convenience and readability. We can combine it with the two previous examples:

In [9]:
query_catalog = sherlock.get_from_exoarchive(
    local = True, 
    col_names = ["pl_rade", "st_rad", "pl_masse"],
    pre_queries={"pl_masse": "pl_bmassprov == 'Mass'"},
    post_query="pl_rade < 4 & st_rad > .2"
)

  """
