# Young Host Stars Lightcurve Downloader

In this notebook, we will use the `identify_young_stars.ipynb` file to identify young host stars and save the results as a database in `results/young_stars_below199Myr.csv`. Our goal is to determine how many of these stars have lightcurves available for download using the `lightkurve` Python library.

To achieve this, we will create a new Jupyter notebook and import the necessary modules. We will then load the `young_stars_below199Myr.csv` file and loop through each star to search for its lightcurve using `lightkurve`. If a lightcurve is found, we will download it and save it to a folder for further analysis.

By the end of this notebook, we will have a dataset of lightcurves for young host stars that can be used for various scientific analyses. 

In [4]:
import numpy as np
import pandas as pd
import lightkurve as lk
import re
import os

### Reading Data from a CSV File

We read data with information about the young host stars of interest from the `results/young_stars_below199Myr.csv` we generated with `identify_young_stars.ipynb` using the Nasa Exoplanet Database.

We extract different star names for the same host star from our database, and by looping them we try to find a match for the name that `lightkurve` uses for the star. 

We generate a database with relevant information for each host star like:
- If it has lightcurves available
- How many of them 
- Number of lightcurves per cadence
- Missions that retrieved lighcurves

Database is saved in `results/TPF_dataframe.csv`.

This can take up to half an hour to retrieve all the data for each host star, as we are looping through 247 of them.

In [146]:
path_data = '../results/young_stars_below_100Myr.csv'
df = pd.read_csv(path_data, index_col=0)
name_heads = ['hostname', 'hd_name', 'hip_name', 'tic_id', 'gaia_id']
star_names = [list(df[name_head]) for name_head in name_heads]

# Initialize DataFrame
result_df = pd.DataFrame(columns=['Star_Name', 'TPF_Found', 'Found_Star_Name', 'Num_Results', 'Target_Name'])

for j in range(len(star_names[0])):

    # Initialize row dictionary
    row_data = {}
    
    row_data['Star_Name'] = star_names[0][j]
    
    for i in range(len(star_names)):
        star_name = star_names[i][j]
        
        if star_name is not np.nan:
            tpf = lk.search_targetpixelfile(star_name)

            
            if len(tpf) > 0:
                missions = list(set(list(tpf.table['obs_collection'])))
                missions_str = '/'.join(missions)
                row_data['TPF_Found'] = True
                row_data['Mission'] = missions_str
                row_data['Found_Star_Name'] = star_name
                row_data['Num_Results'] = len(tpf)
                row_data['Target_Name'] = tpf.table['target_name'][0]
                
                # Adding dynamic columns for exptime
                for exptime in tpf.table['exptime']:
                    col_name = f'exptime_{exptime:.0f}'
                    if col_name not in result_df.columns:
                        result_df[col_name] = 0
                    row_data[col_name] = row_data.get(col_name, 0) + 1
                
                break  # Exit if TPF is found
            else:
                row_data['TPF_Found'] = False
                
    # Append the row to DataFrame
    result_df = pd.concat([result_df, pd.DataFrame(row_data, index=[0])], ignore_index=True)
    # make the column 'Mission' the third one


# Replace NaNs with appropriate defaults (e.g., 0 or False)
result_df.fillna({'TPF_Found': False, 'Found_Star_Name': 'Not Found', 'Num_Results': 0}, inplace=True)
result_df.fillna(0, inplace=True)

savefold = '../results/'
if not os.path.exists(savefold):
    # create the folder if it does not exist
    os.makedirs(savefold)
figname = f'TPF_dataframe.csv'
savepath = savefold + figname
result_df.to_csv(savepath, index=False)

### Downloading the data

We use the dataframe we just generated to download the data. This cell can be run without running the previous on eif the dataset has already been generated. For that we retrieve the all the Target Pixel Files  (TPFs) for each host star. We use the in-built pipeline to mask the data and integrate the lightcurve.   

Lightcurves are saved in `results/TPF_data` as `.fits` files and classified them by cadence by subfolder. 

The downloading can take up to an hour.

In [11]:
path_data = '../results/TPF_dataframe.csv'
df = pd.read_csv(path_data)

n_light = len(df[df['TPF_Found'] == True])

print(f'{n_light} light curves found out of {len(df)} stars.')


244 light curves found out of 247 stars.


In [None]:

folder = '../results/TPF_data/'
subfolders = [f.path for f in os.scandir(folder) if f.is_dir()]

for index, row in df.iterrows():
    # Check if TPF was found for this star
    if row['TPF_Found']:
        # Create a folder for the star
        name = row['Star_Name']
        
        #join name with underscores
        name = name.replace(" ", "_")
        star_folder = f"{folder}{name}"

        if star_folder in subfolders:
            print(f"Folder for {name} already exists. Skipping...")
            continue

        os.makedirs(star_folder, exist_ok=True)
        # Search for the TPF
        found_name = row['Found_Star_Name']
        tpf = lk.search_targetpixelfile(row['Found_Star_Name'])
        n_tpf = len(tpf)

        print(f'{name} ({found_name})', end=': ')
        print(f"Found {n_tpf} TPFs")
        # Loop through the search result
        for i in range(n_tpf):
            print(f"Downloading TPF {i+1}/{n_tpf}", end='\r')
            # Get the exposure time
            exptime = tpf[i].exptime
            match = re.search(r'\d+', str(exptime))
            number = int(match.group())
            # Create a folder for this exposure time within the star's folder
            exptime_folder = f"{star_folder}/exp_{number}"
            os.makedirs(exptime_folder, exist_ok=True)

            try:
                tpf_file = tpf[i].download()
                fits_hdu = tpf_file.to_lightcurve(aperture_mask=tpf_file.pipeline_mask).to_fits()
                header = fits_hdu[0].header
                telescope, date, object = header['TELESCOP'], header['DATE'], header['OBJECT']
                path = f"{exptime_folder}/{name}_{telescope}_{date}_{object}_{i}.fits"
                fits_hdu.writeto(path, overwrite=True)
                
            except lk.LightkurveError as e:
                print(f"Error downloading TPF for {row['Star_Name']}: {e}")
            except FileNotFoundError as e:
                print(f"Error downloading TPF for {row['Star_Name']}: {e}")
            except IndexError as e:
                print(f"IndexError: {e}. Skipping iteration {i}.")