In [1]:
import numpy as np
import pandas as pd
import os
from tess_stars2px import tess_stars2px_function_entry

In [2]:
# Mount the GCP filesystem onto this VM
data_dir = "/home/parsellsx/tesslcs/"
os.system(f"gcsfuse --implicit-dirs tess-goddard-lcs {data_dir}")

256

## Kepler EBs to TIC IDs and GCP Filepaths
In this notebook, I am going to try out what I talked about with Daniel, Ann Marie, and Steve in our weekly meeting on 7/6, which is taking the list of Kepler EBs (from Villanova; keplerebs.villanova.edu) and using their RA/dec to get their TIC IDs (**note** in streamlined version, I am not actually using RA/dec, I'm using the names), then using their TIC IDs to get the filepaths to their light curves in our GCP storage bucket. Once I have the light curves, I can actually load them in with the loaders in Daniel's SPOcc repository and then start getting features for them and training a model. But that will be in a separate notebook - here I just want to get all the filepaths. 

**Note:** (Update 7/13/21) In this notebook I am loading in a version of the Villanova Kepler EB catalog that is already filtered to have K-mags between 10 and 15. In the next notebook "Sector 14 Kepler EBs Lightcurve Loading and Featurization", I realized that there's actually a good amount (> 300) of extra EBs that can be included whose K-mags are not between 10 and 15 but whose TESS mags from the lookup tables are between 10 and 15. For that reason, I recommend using that file to access the Villanova EBs and not this one. This notebook is still good for understanding the procedure, however.

In [3]:
lookup14_path = '~/tesslcs/sector14lookup.csv'
lookup14 = pd.read_csv(lookup14_path,header=None,names=['filename','RA','dec','TIC ID','sector','camera','CCD',
    'mag'],index_col=False,dtype={'filename':str,'RA':float,'dec':float,'TIC ID':int,'sector':int,'camera':int,
    'CCD':int,'mag':float},skiprows=1)

In [5]:
lookup14

Unnamed: 0,filename,RA,dec,TIC ID,sector,camera,CCD,mag
0,tesslcs_sector_14_104/2_min_cadence_targets/te...,296.037464,52.358691,27693449,14,2,4,9.83679
1,tesslcs_sector_14_104/2_min_cadence_targets/te...,296.347333,52.487350,27837522,14,2,4,9.59878
2,tesslcs_sector_14_104/2_min_cadence_targets/te...,296.262922,52.729809,27837873,14,2,4,12.93180
3,tesslcs_sector_14_104/2_min_cadence_targets/te...,298.571902,53.778422,264393379,14,2,4,11.26480
4,tesslcs_sector_14_104/2_min_cadence_targets/te...,296.146668,52.858682,27692689,14,2,4,10.99730
...,...,...,...,...,...,...,...,...
4009712,tesslcs_sector_14_104/tesslcs_tmag_9_10/tesslc...,266.543113,76.482123,1401093064,14,3,3,9.52726
4009713,tesslcs_sector_14_104/tesslcs_tmag_9_10/tesslc...,266.545287,76.482495,1401093065,14,3,3,9.65109
4009714,tesslcs_sector_14_104/tesslcs_tmag_9_10/tesslc...,216.211428,76.585895,156457298,14,3,3,9.82640
4009715,tesslcs_sector_14_104/tesslcs_tmag_9_10/tesslc...,285.164309,76.615343,420107253,14,3,2,9.95180


OK, this is a bummer but I found a Github repo that seems to be exactly what I want (https://github.com/jradavenport/kic2tic) - the only problem is it has almost no documentation and I question whether it would be accurate. But it has a big CSV file with KIC ID in one column and TIC ID in the other. I'm going to test it out by taking a few (maybe 10) objects from the Kepler catalog and looking them up on SIMBAD to get their TIC IDs, then seeing if they match the one given by the CSV file from this Github repo. Hopefully they do! 

In [6]:
# Read in that CSV file
kic2tic_path = '~/kic2tic/KIC2TIC.csv'
k2t = pd.read_csv(kic2tic_path,header=None,names=['KIC ID','TIC ID'],index_col=False,
                  dtype={'KIC ID':int,'TIC ID':int},skiprows=1)

In [10]:
# Load in the Villanova Kepler data
kepler_path = '~/berkeley-seti/keplerebs_villanova_kmag_10-15.csv'
keplerebs = pd.read_csv(kepler_path,header=None,names=['KIC ID','period','period err','bjd0','bjd0 err',
    'morphology','RA','dec','gal lon','gal lat','kmag','Teff','SC'],index_col=False,usecols=['KIC ID','period',
    'period err','RA','dec','kmag','Teff'],skiprows=6)

In [11]:
keplerebs

Unnamed: 0,KIC ID,period,period err,RA,dec,kmag,Teff
0,8912468,0.094838,0.000000e+00,300.1156,45.1679,11.751,6194.0
1,8758716,0.107205,0.000000e+00,293.8519,44.9494,13.531,-1.0
2,10855535,0.112782,0.000000e+00,289.2119,48.2031,13.870,7555.0
3,9472174,0.125765,1.000000e-07,294.6359,46.0664,12.264,10645.0
4,9612468,0.133471,1.000000e-07,299.9821,46.2624,11.531,7202.0
...,...,...,...,...,...,...,...
1992,9408440,989.985000,-1.000000e+00,293.5906,45.9317,13.199,5688.0
1993,8054233,1058.000000,-1.000000e+00,299.2436,43.8144,11.783,4733.0
1994,7672940,1064.270000,-1.000000e+00,288.2586,43.3765,12.328,-1.0
1995,11961695,1082.815000,-1.000000e+00,290.6239,50.3904,13.718,5768.0


And now I want to loop through all 1997 of these EBs and for each one, search for the KIC ID in the k2t dataframe and get the corresponding TIC ID.

In [18]:
ticids = []
not_found_counter = 0 # Will count how many KIC IDs aren't found in the k2t dataframe
for kicid in keplerebs['KIC ID']:
    inds = np.where(k2t['KIC ID'] == kicid)[0]
    if inds.size == 0:
        not_found_counter += 1
        continue
    elif inds.size > 1:
        print('More than one line present for KIC ID ' + str(kicid))
    ticids.append(k2t['TIC ID'][inds[0]]) # Add the 1st TIC ID to the list (presumably it will be the only TIC ID)
print('Number not found: ' + str(not_found_counter))

Number not found: 0


In [20]:
ticids[:5]

[239276223, 270700726, 299157009, 271164763, 239233211]

In [22]:
print(len(ticids))

1997


Let's go! It worked, and every EB was found! So now I have a list of 1997 TIC IDs, each corresponding to a confirmed eclipsing binary. Now I need to take these TIC IDs and figure out which ones are in sector 14, and for the ones that are, get their filepaths.

I'm going to want to loop through the sector 14 lookup table, and for each star in there, see if it's in my "ticids" list. If it is, I'll pull out the whole line and store it in a separate dataframe, then save that to a file.

In [42]:
ticids = np.asarray(ticids)
sector_14_eb_indices = []
for lookup_index, ticid in enumerate(lookup14['TIC ID']):
    inds = np.where(ticids == ticid)[0]
    if inds.size == 0:
        continue
    elif inds.size > 1:
        print('More than one line present in ticids for TIC ID ' + str(ticid) + ' - this means that 2 KIC IDs point to the same TIC ID in k2t')
    sector_14_eb_indices.append(lookup_index) # This time I am just appending the index so I can then make a new df

More than one line present in ticids for TIC ID 121121622 - this means that 2 KIC IDs point to the same TIC ID in k2t
More than one line present in ticids for TIC ID 63366439 - this means that 2 KIC IDs point to the same TIC ID in k2t
More than one line present in ticids for TIC ID 273379486 - this means that 2 KIC IDs point to the same TIC ID in k2t
More than one line present in ticids for TIC ID 120499724 - this means that 2 KIC IDs point to the same TIC ID in k2t
More than one line present in ticids for TIC ID 270782003 - this means that 2 KIC IDs point to the same TIC ID in k2t
More than one line present in ticids for TIC ID 272177504 - this means that 2 KIC IDs point to the same TIC ID in k2t
More than one line present in ticids for TIC ID 243271716 - this means that 2 KIC IDs point to the same TIC ID in k2t
More than one line present in ticids for TIC ID 158489694 - this means that 2 KIC IDs point to the same TIC ID in k2t


**Note:** (Update 7/13/21) I said in the output of the cell above that the fact that there was more than one line present in ticids for a given TIC ID meant that 2 different KIC IDs pointed to that TIC ID in Jim's k2t file. However, I realized today that it could also be that the original Villanova data had the same KIC ID listed multiple times. And after looking at a snippet of the Villanova data by eye, I already found a KIC ID for which this is the case: 2856960. So there might not be any duplicates in Jim's file. Update from later today: that's the case! There are no duplicates in Jim's data - to see this, just print len(k2t['TIC ID']) and then print len(set(k2t['TIC ID'])). 

In [43]:
len(sector_14_eb_indices)

1809

In [50]:
# Let's remove duplicates from sector_14_eb_indices
# According to this StackOverflow post, I can use set() to make a list into a set (thereby eliminating duplicates),
# then use list() on that set to convert it back into a set. Note however that order is not preserved - but for me
# that shouldn't be an issue
sector_14_eb_indices_unique = list(set(sector_14_eb_indices))
print(len(sector_14_eb_indices_unique))

1809


Nice - there were no duplicates in this list, which means there are no duplicates in the lookup table. So now I have 1809 indices in my lookup14 dataframe, and I can pull those out to get the filenames of these light curves. I think it's just about time for me to record them to a file, then start a new notebook for the actual lightcurve manipulation.

In [59]:
cleaned_sector_14_eb_array = lookup14.iloc[sector_14_eb_indices_unique]
cleaned_sector_14_eb_array

Unnamed: 0,filename,RA,dec,TIC ID,sector,camera,CCD,mag
1417218,tesslcs_sector_14_104/tesslcs_tmag_13_14/tessl...,284.417796,48.680907,267667898,14,2,3,13.4082
8197,tesslcs_sector_14_104/2_min_cadence_targets/te...,298.828625,44.489136,268305489,14,2,4,10.8095
1785863,tesslcs_sector_14_104/tesslcs_tmag_13_14/tessl...,296.839695,43.408060,272716209,14,2,4,13.2177
557064,tesslcs_sector_14_104/tesslcs_tmag_12_13/tessl...,300.975737,44.112761,185057845,14,2,4,12.4576
1335315,tesslcs_sector_14_104/tesslcs_tmag_13_14/tessl...,295.940920,44.578345,271970252,14,2,4,13.5418
...,...,...,...,...,...,...,...,...
3850193,tesslcs_sector_14_104/tesslcs_tmag_14_15/tessl...,279.856103,43.667885,383752517,14,2,3,14.2124
450534,tesslcs_sector_14_104/tesslcs_tmag_12_13/tessl...,287.856867,44.102747,158629066,14,2,3,12.2402
696308,tesslcs_sector_14_104/tesslcs_tmag_12_13/tessl...,295.573993,39.700610,184089092,14,1,1,12.5710
1679358,tesslcs_sector_14_104/tesslcs_tmag_13_14/tessl...,288.869256,47.369978,158840322,14,2,3,13.7675


Awesome - now all I need to do is write this dataframe to a file and I should be good to move on.

In [60]:
# I set index=False because I don't care about saving the indices (the numbers in bold in the dataframe above) - 
# these are just the indices of the relevant EBs within the lookup table, which is now irrelevant info
cleaned_sector_14_eb_array.to_csv('cleaned_sector_14_ebs_filepaths.csv',index=False)

And I now have 1809 EBs with filepath and camera info recorded in the file '~/berkeley-seti/cleaned_sector_14_ebs_filepaths.csv'.