In [1]:
import numpy as np
import matplotlib.pyplot as plt
import astropy
import pandas as pd
import os
import time
import itertools
import lightkurve
import pickle
from tess_stars2px import tess_stars2px_function_entry

In [2]:
# Mount the GCP filesystem onto this VM
data_dir = "/home/parsellsx/tesslcs/"
os.system(f"gcsfuse --implicit-dirs tess-goddard-lcs {data_dir}")

256

## Kepler EBs to TIC IDs and GCP Filepaths
In this notebook, I am going to try out what I talked about with Daniel, Ann Marie, and Steve in our weekly meeting on 7/6, which is taking the list of Kepler EBs (from Villanova; keplerebs.villanova.edu) and using their RA/dec to get their TIC IDs with tesspoint, then using their TIC IDs to get the filepaths to their light curves in our GCP storage bucket. Once I have the light curves, I can actually load them in with the loaders in Daniel's SPOcc repository and then start getting features for them and training a model. But that will be in a separate notebook - here I just want to get all the filepaths. 

Before I get the RA/dec for all 1997 Kepler EBs, I first just want to try out tesspoint on some random EB. I'll take the first EB from the sector 14 lookup table.

In [3]:
lookup14_path = '~/tesslcs/sector14lookup.csv'
lookup14 = pd.read_csv(lookup14_path,header=None,names=['filename','RA','dec','TIC ID','sector','camera','CCD',
    'mag'],index_col=False,dtype={'filename':str,'RA':float,'dec':float,'TIC ID':int,'sector':int,'camera':int,
    'CCD':int,'mag':float},skiprows=1)

In [4]:
# Now I can access the RA and dec of my example star with lookup14['RA'][0] and lookup14['dec'][0]
# Let's print the TIC ID of that star so we can tell if tesspoint gets it right later
print(lookup14['TIC ID'][0])

27693449


Now I just need to figure out how to actually pass the RA/dec into tesspoint to get the corresponding TIC ID. The Github page (https://github.com/christopherburke/tess-point) has a few examples, but they're not totally clear. It
sounds like you can supply just the RA and dec of a star to get various info about it, including TIC ID, RA, dec (even though you just put those in), ecliptic coordinates, sector, camera, CCD, and column and row in pixels. But
you have to supply the TIC ID or some kind of numeric identifier to get that info? But it sounds like you don't need to have the TIC ID in advance, which is good. I'm going to try it out with an example below. I'm going to pass in the RA and dec of my example star and see what happens.

After reading about it more here (https://github.com/christopherburke/tess-point/blob/master/example_use_tess_stars2py_byfunction.py), I figured out how it works. You can use tesspoint in your Python program by importing the function they show there, so I just added that to my import cell at the top of this file. And then you can feed in one or multiple targets at a time. The "TIC ID" that you pass in doesn't actually get used by the program so it doesn't have to really be the TIC ID, it's just used to identify that star in the output because when you feed in a star, it will return not just one, but _every_ sector/camera/CCD that that star appears in. 

Let me now try passing in the RA/dec of the first 2 stars from the sector 14 lookup table, and see what I get back. For their TIC IDs, I'll just call them 1 and 2, and if this works then when I go to do this for the entire list of 1997 Kepler EBs, I'll use np.linspace or np.arange to generate IDs from 1 through 1997.

In [16]:
ra = lookup14['RA'][:2]
dec = lookup14['dec'][:2]
iden = [0,1]
outID, outEclipLong, outEclipLat, outSec, outCam, outCcd, outColPix, outRowPix, scinfo = tess_stars2px_function_entry(iden, ra, dec)

Problem: I'm realizing now that tesspoint doesn't actually give you the TIC ID. So I'm going to need some other way to get the filepaths - either get the TIC ID from someplace else, or find a way to use the lookup tables with just RA and dec.

In [5]:
lookup14

Unnamed: 0,filename,RA,dec,TIC ID,sector,camera,CCD,mag
0,tesslcs_sector_14_104/2_min_cadence_targets/te...,296.037464,52.358691,27693449,14,2,4,9.83679
1,tesslcs_sector_14_104/2_min_cadence_targets/te...,296.347333,52.487350,27837522,14,2,4,9.59878
2,tesslcs_sector_14_104/2_min_cadence_targets/te...,296.262922,52.729809,27837873,14,2,4,12.93180
3,tesslcs_sector_14_104/2_min_cadence_targets/te...,298.571902,53.778422,264393379,14,2,4,11.26480
4,tesslcs_sector_14_104/2_min_cadence_targets/te...,296.146668,52.858682,27692689,14,2,4,10.99730
...,...,...,...,...,...,...,...,...
4009712,tesslcs_sector_14_104/tesslcs_tmag_9_10/tesslc...,266.543113,76.482123,1401093064,14,3,3,9.52726
4009713,tesslcs_sector_14_104/tesslcs_tmag_9_10/tesslc...,266.545287,76.482495,1401093065,14,3,3,9.65109
4009714,tesslcs_sector_14_104/tesslcs_tmag_9_10/tesslc...,216.211428,76.585895,156457298,14,3,3,9.82640
4009715,tesslcs_sector_14_104/tesslcs_tmag_9_10/tesslc...,285.164309,76.615343,420107253,14,3,2,9.95180


OK, this is a bummer but I found a Github repo that seems to be exactly what I want (https://github.com/jradavenport/kic2tic) - the only problem is it has almost no documentation and I question whether it would be accurate. But it has a big CSV file with KIC ID in one column and TIC ID in the other. I'm going to test it out by taking a few (maybe 10) objects from the Kepler catalog and looking them up on SIMBAD to get their TIC IDs, then seeing if they match the one given by the CSV file from this Github repo. Hopefully they do! 

In [6]:
# Read in that CSV file
kic2tic_path = '~/kic2tic/KIC2TIC.csv'
k2t = pd.read_csv(kic2tic_path,header=None,names=['KIC ID','TIC ID'],index_col=False,
                  dtype={'KIC ID':int,'TIC ID':int},skiprows=1)

In [7]:
# Print the first 10 IDs from the file
k2t[:10]

Unnamed: 0,KIC ID,TIC ID
0,2285208,120970531
1,5369520,138426805
2,6435985,122066934
3,7975417,272717183
4,9287814,271164174
5,2976009,121395656
6,5567762,172770539
7,10092003,273378207
8,3946338,121939328
9,8160525,159221152


Out of the 10 KIC IDs above, 8 of them had the matching TIC ID in SIMBAD, and as for the other 2, SIMBAD couldn't find them based on either the KIC ID or the TIC ID. So I'm feeling a lot more confident now that this CSV file is legit. Let me try another few from somewhere in the middle of the file just to reassure myself. 

In [8]:
k2t['KIC ID'].size

199421

In [9]:
k2t[85324:85329]

Unnamed: 0,KIC ID,TIC ID
85324,5273184,137090434
85325,10157123,273129783
85326,4839259,137897159
85327,8394035,239291099
85328,5653887,172770809


1 of the 5 IDs above wasn't found by SIMBAD, but the other 4 were accurate. I feel good enough to try this out.

So the next step is to load in the Kepler EBs Villanova data.

In [10]:
# Load in the Villanova Kepler data
kepler_path = '~/berkeley-seti/keplerebs_villanova_kmag_10-15.csv'
keplerebs = pd.read_csv(kepler_path,header=None,names=['KIC ID','period','period err','bjd0','bjd0 err',
    'morphology','RA','dec','gal lon','gal lat','kmag','Teff','SC'],index_col=False,usecols=['KIC ID','period',
    'period err','RA','dec','kmag','Teff'],skiprows=6)

In [11]:
keplerebs

Unnamed: 0,KIC ID,period,period err,RA,dec,kmag,Teff
0,8912468,0.094838,0.000000e+00,300.1156,45.1679,11.751,6194.0
1,8758716,0.107205,0.000000e+00,293.8519,44.9494,13.531,-1.0
2,10855535,0.112782,0.000000e+00,289.2119,48.2031,13.870,7555.0
3,9472174,0.125765,1.000000e-07,294.6359,46.0664,12.264,10645.0
4,9612468,0.133471,1.000000e-07,299.9821,46.2624,11.531,7202.0
...,...,...,...,...,...,...,...
1992,9408440,989.985000,-1.000000e+00,293.5906,45.9317,13.199,5688.0
1993,8054233,1058.000000,-1.000000e+00,299.2436,43.8144,11.783,4733.0
1994,7672940,1064.270000,-1.000000e+00,288.2586,43.3765,12.328,-1.0
1995,11961695,1082.815000,-1.000000e+00,290.6239,50.3904,13.718,5768.0


And now I want to loop through all 1997 of these EBs and for each one, search for the KIC ID in the k2t dataframe and get the corresponding TIC ID.

In [15]:
# Can we use np.where on a pandas Series?
np.where(keplerebs['KIC ID'] == 9408440)[0]

array([1992])

I can! That should make things easier. I have the KIC IDs in the k2t dataframe. As a side note, I contacted Jim Davenport, the man who wrote the kic2tic Github repository that I got this CSV file from. He said it should work fine for my purposes, with the caveat that it might only include the 2-minute cadence targets. I'll see if that's true or not by seeing how many of my 1997 EBs it finds in that list. Let's go!

In [18]:
ticids = []
not_found_counter = 0 # Will count how many KIC IDs aren't found in the k2t dataframe
for kicid in keplerebs['KIC ID']:
    inds = np.where(k2t['KIC ID'] == kicid)[0]
    if inds.size == 0:
        not_found_counter += 1
        continue
    elif inds.size > 1:
        print('More than one line present for KIC ID ' + str(kicid))
    ticids.append(k2t['TIC ID'][inds[0]]) # Add the 1st TIC ID to the list (presumably it will be the only TIC ID)
print('Number not found: ' + str(not_found_counter))

Number not found: 0


In [20]:
ticids[:5]

[239276223, 270700726, 299157009, 271164763, 239233211]

In [22]:
print(len(ticids))

1997


Let's go! It worked, and every EB was found! So now I have a list of 1997 TIC IDs, each corresponding to a confirmed eclipsing binary. Now I need to take these TIC IDs and figure out which ones are in sector 14, and for the ones that are, get their filepaths.

I'm going to want to loop through the sector 14 lookup table, and for each star in there, see if it's in my "ticids" list. If it is, I'll pull out the whole line and store it in a separate dataframe, then save that to a file.

In [25]:
sector_14_eb_indices = []
for lookup_index, ticid in enumerate(lookup14['TIC ID']):
    inds = np.where(ticids == ticid)[0]
    if inds.size == 0:
        continue
    elif inds.size > 1:
        print('More than one line present in ticids for TIC ID ' + str(ticid) + ' - this means that 2 KIC IDs point to the same TIC ID in k2t')
    sector_14_eb_indices.append(lookup_index) # This time I am just appending the index so I can then make a new df

In [26]:
len(sector_14_eb_indices)

0

Hmm...why are there none showing up in sector 14? Not what I expected at all. 

In [27]:
lookup14['TIC ID']

0            27693449
1            27837522
2            27837873
3           264393379
4            27692689
              ...    
4009712    1401093064
4009713    1401093065
4009714     156457298
4009715     420107253
4009716     524593265
Name: TIC ID, Length: 4009717, dtype: int64

In [29]:
# Let's try using tesspoint to get the (TESS) sector of all my Kepler EBs - I'll see what looks most common
ra = keplerebs['RA']
dec = keplerebs['dec']
iden = np.linspace(0,1996,num=1997)
outID, outEclipLong, outEclipLat, outSec, outCam, outCcd, outColPix, outRowPix, scinfo = tess_stars2px_function_entry(iden, ra, dec)

In [30]:
outSec

array([14, 15, 41, ..., 41, 54, 55])

In [38]:
outID[:5]

array([0, 0, 0, 0, 0])

In [31]:
np.where(outSec==14)[0].size

1885

It seems based on that like 1885 of my 1997 Kepler EBs actually are in sector 14, which is a relief. But why is that not showing up in what I did? 

Let me check, since tesspoint says that the very first EB I gave it (the first one in keplerebs, and therefore the first one in the ticids list) is in sector 14, let me check and see if that star's TIC ID shows up anywhere in the lookup table. If not, I'll know there's some sort of problem.

In [39]:
np.where(lookup14['TIC ID'] == ticids[0])[0]

array([8889])

Good news, the first TIC ID in ticids is actually in the lookup table. So it must just be some bug in my code that led to me not getting any sector 14 ones before.

In [42]:
# I think the problem might be that ticids is a list and not an array. Let me convert it and try again:
ticids = np.asarray(ticids)
sector_14_eb_indices = []
for lookup_index, ticid in enumerate(lookup14['TIC ID']):
    inds = np.where(ticids == ticid)[0]
    if inds.size == 0:
        continue
    elif inds.size > 1:
        print('More than one line present in ticids for TIC ID ' + str(ticid) + ' - this means that 2 KIC IDs point to the same TIC ID in k2t')
    sector_14_eb_indices.append(lookup_index) # This time I am just appending the index so I can then make a new df

More than one line present in ticids for TIC ID 121121622 - this means that 2 KIC IDs point to the same TIC ID in k2t
More than one line present in ticids for TIC ID 63366439 - this means that 2 KIC IDs point to the same TIC ID in k2t
More than one line present in ticids for TIC ID 273379486 - this means that 2 KIC IDs point to the same TIC ID in k2t
More than one line present in ticids for TIC ID 120499724 - this means that 2 KIC IDs point to the same TIC ID in k2t
More than one line present in ticids for TIC ID 270782003 - this means that 2 KIC IDs point to the same TIC ID in k2t
More than one line present in ticids for TIC ID 272177504 - this means that 2 KIC IDs point to the same TIC ID in k2t
More than one line present in ticids for TIC ID 243271716 - this means that 2 KIC IDs point to the same TIC ID in k2t
More than one line present in ticids for TIC ID 158489694 - this means that 2 KIC IDs point to the same TIC ID in k2t


In [43]:
len(sector_14_eb_indices)

1809

This seems a lot better, but I'm surprised that there were times where TIC IDs showed up multiple times in ticids. 

In [44]:
np.where(ticids == 121121622)[0]

array([1127, 1134])

In [48]:
np.where(k2t['TIC ID'] == 121121622)[0]

array([47384])

In [49]:
# Interesting, so this TIC ID shows up twice in ticids, indicating two entries in the Kepler KIC ID list map to it
# through k2t, but it only shows up once in k2t, which would seem to mean that there are 2 entries of the same
# KIC ID (same star) in the Kepler EBs dataset - let's see if that's the case
keplerebs['KIC ID'][1125:1135]

1125     7695093
1126     9835416
1127     4247791
1128    10149845
1129     9881258
1130     4059656
1131     8621026
1132    10480952
1133     8230809
1134     4247791
Name: KIC ID, dtype: int64

I can see that is the case - the KIC ID 4247791 is repeated. I'm guessing this is why the other 7 TIC IDs got flagged for showing up multiple times too (above).

One thing I haven't checked yet is whether there are any duplicate TIC IDs in the lookup table, because if there are, that would make two different lookup_index's get appended to the sector_14_eb_indices list when they really only represent one EB (making it look like I have more unique EBs than I really do). But I guess I could filter those out just from the sector_14_eb_indices list by running some kind of duplicate removal function.

The duplicates in the ticids list (which, as mentioned in the first paragraph of this cell, above, come from duplicate KIC IDs on the Villanova site) shouldn't be a problem because I'm only appending the lookup_index in the lookup table file, so only one number gets recorded per EB.

In [50]:
# Let's remove duplicates from sector_14_eb_indices
# According to this StackOverflow post, I can use set() to make a list into a set (thereby eliminating duplicates),
# then use list() on that set to convert it back into a set. Note however that order is not preserved - but for me
# that shouldn't be an issue
sector_14_eb_indices_unique = list(set(sector_14_eb_indices))
print(len(sector_14_eb_indices_unique))

1809


Nice - there were no duplicates in this list, which means there are no duplicates in the lookup table. So now I have 1809 indices in my lookup14 dataframe, and I can pull those out to get the filenames of these light curves. I think it's just about time for me to record them to a file, then start a new notebook for the actual lightcurve manipulation.

In [59]:
cleaned_sector_14_eb_array = lookup14.iloc[sector_14_eb_indices_unique]
cleaned_sector_14_eb_array

Unnamed: 0,filename,RA,dec,TIC ID,sector,camera,CCD,mag
1417218,tesslcs_sector_14_104/tesslcs_tmag_13_14/tessl...,284.417796,48.680907,267667898,14,2,3,13.4082
8197,tesslcs_sector_14_104/2_min_cadence_targets/te...,298.828625,44.489136,268305489,14,2,4,10.8095
1785863,tesslcs_sector_14_104/tesslcs_tmag_13_14/tessl...,296.839695,43.408060,272716209,14,2,4,13.2177
557064,tesslcs_sector_14_104/tesslcs_tmag_12_13/tessl...,300.975737,44.112761,185057845,14,2,4,12.4576
1335315,tesslcs_sector_14_104/tesslcs_tmag_13_14/tessl...,295.940920,44.578345,271970252,14,2,4,13.5418
...,...,...,...,...,...,...,...,...
3850193,tesslcs_sector_14_104/tesslcs_tmag_14_15/tessl...,279.856103,43.667885,383752517,14,2,3,14.2124
450534,tesslcs_sector_14_104/tesslcs_tmag_12_13/tessl...,287.856867,44.102747,158629066,14,2,3,12.2402
696308,tesslcs_sector_14_104/tesslcs_tmag_12_13/tessl...,295.573993,39.700610,184089092,14,1,1,12.5710
1679358,tesslcs_sector_14_104/tesslcs_tmag_13_14/tessl...,288.869256,47.369978,158840322,14,2,3,13.7675


Awesome - now all I need to do is write this dataframe to a file and I should be good to move on.

In [60]:
# I set index=False because I don't care about saving the indices (the numbers in bold in the dataframe above) - 
# these are just the indices of the relevant EBs within the lookup table, which is now irrelevant info
cleaned_sector_14_eb_array.to_csv('cleaned_sector_14_ebs_filepaths.csv',index=False)

And I now have 1809 EBs with filepath and camera info recorded in the file '~/berkeley-seti/cleaned_sector_14_ebs_filepaths.csv'.