# NASA Exoplanet Data Cleaning Pipeline
## Datasets: Kepler, K2, and TESS

This notebook will explore, clean, and potentially combine three NASA exoplanet discovery datasets:
- **Kepler**: Original Kepler mission data
- **K2**: Extended K2 mission data
- **TESS**: Transiting Exoplanet Survey Satellite data

### Analysis Plan:
1. **Setup & Import** - Load required libraries
2. **Data Loading** - Load all three CSV files
3. **Individual Exploration** - Examine each dataset's structure, statistics, and quality
4. **Schema Comparison** - Identify common/unique columns across datasets
5. **Data Cleaning** - Handle missing values, duplicates, and inconsistencies
6. **Combination Strategy** - Determine if/how to combine datasets
7. **Export** - Save cleaned data for analysis

## 1. Setup & Imports

In [5]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings

# Configure display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("✓ Libraries imported successfully")

✓ Libraries imported successfully


## 2. Load Datasets

Load all three NASA exoplanet datasets from the raw data folder.

In [8]:
# Define data paths
data_dir = Path('../data/raw')

# Load datasets
# NASA Exoplanet Archive files use '#' for comment lines
print("Loading datasets...")
kepler_df = pd.read_csv(data_dir / 'kepler.csv', comment='#', on_bad_lines='skip')
k2_df = pd.read_csv(data_dir / 'k2.csv', comment='#', on_bad_lines='skip')
tess_df = pd.read_csv(data_dir / 'tess.csv', comment='#', on_bad_lines='skip')

print(f"✓ Kepler dataset loaded: {kepler_df.shape[0]:,} rows × {kepler_df.shape[1]} columns")
print(f"✓ K2 dataset loaded: {k2_df.shape[0]:,} rows × {k2_df.shape[1]} columns")
print(f"✓ TESS dataset loaded: {tess_df.shape[0]:,} rows × {tess_df.shape[1]} columns")

Loading datasets...
✓ Kepler dataset loaded: 9,564 rows × 141 columns
✓ K2 dataset loaded: 4,004 rows × 295 columns
✓ TESS dataset loaded: 7,703 rows × 87 columns
✓ Kepler dataset loaded: 9,564 rows × 141 columns
✓ K2 dataset loaded: 4,004 rows × 295 columns
✓ TESS dataset loaded: 7,703 rows × 87 columns


## 3. Explore Individual Datasets

### 3.1 Kepler Dataset Exploration

In [14]:
# Kepler: Basic information
print("=" * 80)
print("KEPLER DATASET EXPLORATION")
print("=" * 80)

print(f"\nDataset shape: {kepler_df.shape[0]:,} rows × {kepler_df.shape[1]} columns\n")

print("Column names and types:")
print(kepler_df.dtypes)

print("\n" + "-" * 80)
print("First few rows:")
display(kepler_df.head())

print("\n" + "-" * 80)
print("Missing values:")
missing_kepler = kepler_df.isnull().sum()
missing_pct_kepler = (missing_kepler / len(kepler_df) * 100).round(2)
missing_df_kepler = pd.DataFrame({
    'Missing Count': missing_kepler[missing_kepler > 0],
    'Percentage': missing_pct_kepler[missing_kepler > 0]
}).sort_values('Percentage', ascending=False)
display(missing_df_kepler)

print("\n" + "-" * 80)
print("Basic statistics:")
display(kepler_df.describe())

KEPLER DATASET EXPLORATION

Dataset shape: 9,564 rows × 141 columns

Column names and types:
rowid                   int64
kepid                   int64
kepoi_name             object
kepler_name            object
koi_disposition        object
                       ...   
koi_dikco_mra_err     float64
koi_dikco_mdec        float64
koi_dikco_mdec_err    float64
koi_dikco_msky        float64
koi_dikco_msky_err    float64
Length: 141, dtype: object

--------------------------------------------------------------------------------
First few rows:


Unnamed: 0,rowid,kepid,kepoi_name,kepler_name,koi_disposition,koi_vet_stat,koi_vet_date,koi_pdisposition,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_disp_prov,koi_comment,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,koi_time0bk_err1,koi_time0bk_err2,koi_time0,koi_time0_err1,koi_time0_err2,koi_eccen,koi_eccen_err1,koi_eccen_err2,koi_longp,koi_longp_err1,koi_longp_err2,koi_impact,koi_impact_err1,koi_impact_err2,koi_duration,koi_duration_err1,koi_duration_err2,koi_ingress,koi_ingress_err1,koi_ingress_err2,koi_depth,koi_depth_err1,koi_depth_err2,koi_ror,koi_ror_err1,koi_ror_err2,koi_srho,koi_srho_err1,koi_srho_err2,koi_fittype,koi_prad,koi_prad_err1,koi_prad_err2,koi_sma,koi_sma_err1,koi_sma_err2,koi_incl,koi_incl_err1,koi_incl_err2,koi_teq,koi_teq_err1,koi_teq_err2,koi_insol,koi_insol_err1,koi_insol_err2,koi_dor,koi_dor_err1,koi_dor_err2,koi_limbdark_mod,koi_ldm_coeff4,koi_ldm_coeff3,koi_ldm_coeff2,koi_ldm_coeff1,koi_parm_prov,koi_max_sngle_ev,koi_max_mult_ev,koi_model_snr,koi_count,koi_num_transits,koi_tce_plnt_num,koi_tce_delivname,koi_quarters,koi_bin_oedp_sig,koi_trans_mod,koi_model_dof,koi_model_chisq,koi_datalink_dvr,koi_datalink_dvs,koi_steff,koi_steff_err1,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_smet,koi_smet_err1,koi_smet_err2,koi_srad,koi_srad_err1,koi_srad_err2,koi_smass,koi_smass_err1,koi_smass_err2,koi_sage,koi_sage_err1,koi_sage_err2,koi_sparprov,ra,dec,koi_kepmag,koi_gmag,koi_rmag,koi_imag,koi_zmag,koi_jmag,koi_hmag,koi_kmag,koi_fwm_stat_sig,koi_fwm_sra,koi_fwm_sra_err,koi_fwm_sdec,koi_fwm_sdec_err,koi_fwm_srao,koi_fwm_srao_err,koi_fwm_sdeco,koi_fwm_sdeco_err,koi_fwm_prao,koi_fwm_prao_err,koi_fwm_pdeco,koi_fwm_pdeco_err,koi_dicco_mra,koi_dicco_mra_err,koi_dicco_mdec,koi_dicco_mdec_err,koi_dicco_msky,koi_dicco_msky_err,koi_dikco_mra,koi_dikco_mra_err,koi_dikco_mdec,koi_dikco_mdec_err,koi_dikco_msky,koi_dikco_msky_err
0,1,10797460,K00752.01,Kepler-227 b,CONFIRMED,Done,2018-08-16,CANDIDATE,1.0,0,0,0,0,q1_q17_dr25_sup_koi,NO_COMMENT,9.488,0.0,-0.0,170.539,0.002,-0.002,2455003.539,0.002,-0.002,0.0,,,,,,0.146,0.318,-0.146,2.958,0.082,-0.082,,,,615.8,19.5,-19.5,0.022,0.001,-0.001,3.208,0.332,-1.1,LS+MCMC,2.26,0.26,-0.15,0.085,,,89.66,,,793.0,,,93.59,29.45,-16.65,24.81,2.6,-2.6,Claret (2011 A&A 529 75) ATLAS LS,0.0,0.0,0.229,0.46,q1_q17_dr25_koi,5.136,28.471,35.8,2,142.0,1.0,q1_q17_dr25_tce,11111111111111111000000000000000,0.686,Mandel and Agol (2002 ApJ 580 171),,,010/010797/010797460/dv/kplr010797460-20160209...,010/010797/010797460/dv/kplr010797460-001-2016...,5455.0,81.0,-81.0,4.467,0.064,-0.096,0.14,0.15,-0.15,0.927,0.105,-0.061,0.919,0.052,-0.046,,,,q1_q17_dr25_stellar,291.934,48.142,15.347,15.89,15.27,15.114,15.006,14.082,13.751,13.648,0.002,19.462,0.0,48.142,0.0,0.43,0.51,0.94,0.48,-0.0,0.0,-0.001,0.0,-0.01,0.13,0.2,0.16,0.2,0.17,0.08,0.13,0.31,0.17,0.32,0.16
1,2,10797460,K00752.02,Kepler-227 c,CONFIRMED,Done,2018-08-16,CANDIDATE,0.969,0,0,0,0,q1_q17_dr25_sup_koi,NO_COMMENT,54.418,0.0,-0.0,162.514,0.004,-0.004,2454995.514,0.004,-0.004,0.0,,,,,,0.586,0.059,-0.443,4.507,0.116,-0.116,,,,874.8,35.5,-35.5,0.028,0.009,-0.001,3.024,2.205,-2.496,LS+MCMC,2.83,0.32,-0.19,0.273,,,89.57,,,443.0,,,9.11,2.87,-1.62,77.9,28.4,-28.4,Claret (2011 A&A 529 75) ATLAS LS,0.0,0.0,0.229,0.46,q1_q17_dr25_koi,7.028,20.11,25.8,2,25.0,2.0,q1_q17_dr25_tce,11111111111111111000000000000000,0.002,Mandel and Agol (2002 ApJ 580 171),,,010/010797/010797460/dv/kplr010797460-20160209...,010/010797/010797460/dv/kplr010797460-002-2016...,5455.0,81.0,-81.0,4.467,0.064,-0.096,0.14,0.15,-0.15,0.927,0.105,-0.061,0.919,0.052,-0.046,,,,q1_q17_dr25_stellar,291.934,48.142,15.347,15.89,15.27,15.114,15.006,14.082,13.751,13.648,0.003,19.462,0.0,48.142,0.0,-0.63,0.72,1.23,0.68,0.001,0.001,-0.001,0.001,0.39,0.36,0.0,0.48,0.39,0.36,0.49,0.34,0.12,0.73,0.5,0.45
2,3,10811496,K00753.01,,CANDIDATE,Done,2018-08-16,CANDIDATE,0.0,0,0,0,0,q1_q17_dr25_sup_koi,DEEP_V_SHAPED,19.899,0.0,-0.0,175.85,0.001,-0.001,2455008.85,0.001,-0.001,0.0,,,,,,0.969,5.126,-0.077,1.782,0.034,-0.034,,,,10829.0,171.0,-171.0,0.154,5.034,-0.042,7.296,35.033,-2.755,LS+MCMC,14.6,3.92,-1.31,0.142,,,88.96,,,638.0,,,39.3,31.04,-10.49,53.5,25.7,-25.7,Claret (2011 A&A 529 75) ATLAS LS,0.0,0.0,0.271,0.386,q1_q17_dr25_koi,37.16,187.449,76.3,1,56.0,1.0,q1_q17_dr25_tce,11111101110111011000000000000000,0.662,Mandel and Agol (2002 ApJ 580 171),,,010/010811/010811496/dv/kplr010811496-20160209...,010/010811/010811496/dv/kplr010811496-001-2016...,5853.0,158.0,-176.0,4.544,0.044,-0.176,-0.18,0.3,-0.3,0.868,0.233,-0.078,0.961,0.11,-0.121,,,,q1_q17_dr25_stellar,297.005,48.134,15.436,15.943,15.39,15.22,15.166,14.254,13.9,13.826,0.278,19.8,0.0,48.134,0.0,-0.021,0.069,-0.038,0.071,0.001,0.002,0.001,0.003,-0.025,0.07,-0.034,0.07,0.042,0.072,0.002,0.071,-0.027,0.074,0.027,0.074
3,4,10848459,K00754.01,,FALSE POSITIVE,Done,2018-08-16,FALSE POSITIVE,0.0,0,1,0,0,q1_q17_dr25_sup_koi,MOD_ODDEVEN_DV---MOD_ODDEVEN_ALT---DEEP_V_SHAPED,1.737,0.0,-0.0,170.308,0.0,-0.0,2455003.308,0.0,-0.0,0.0,,,,,,1.276,0.115,-0.092,2.406,0.005,-0.005,,,,8079.2,12.8,-12.8,0.387,0.109,-0.085,0.221,0.009,-0.018,LS+MCMC,33.46,8.5,-2.83,0.027,,,67.09,,,1395.0,,,891.96,668.95,-230.35,3.278,0.136,-0.136,Claret (2011 A&A 529 75) ATLAS LS,0.0,0.0,0.286,0.356,q1_q17_dr25_koi,39.067,541.895,505.6,1,621.0,1.0,q1_q17_dr25_tce,11111110111011101000000000000000,0.0,Mandel and Agol (2002 ApJ 580 171),,,010/010848/010848459/dv/kplr010848459-20160209...,010/010848/010848459/dv/kplr010848459-001-2016...,5805.0,157.0,-174.0,4.564,0.053,-0.168,-0.52,0.3,-0.3,0.791,0.201,-0.067,0.836,0.093,-0.077,,,,q1_q17_dr25_stellar,285.535,48.285,15.597,16.1,15.554,15.382,15.266,14.326,13.911,13.809,0.0,19.036,0.0,48.285,0.0,-0.111,0.031,0.002,0.027,0.003,0.001,-0.001,0.001,-0.249,0.072,0.147,0.078,0.289,0.079,-0.257,0.072,0.099,0.077,0.276,0.076
4,5,10854555,K00755.01,Kepler-664 b,CONFIRMED,Done,2018-08-16,CANDIDATE,1.0,0,0,0,0,q1_q17_dr25_sup_koi,NO_COMMENT,2.526,0.0,-0.0,171.596,0.001,-0.001,2455004.596,0.001,-0.001,0.0,,,,,,0.701,0.235,-0.478,1.655,0.042,-0.042,,,,603.3,16.9,-16.9,0.024,0.004,-0.002,1.986,2.711,-1.745,LS+MCMC,2.75,0.88,-0.35,0.037,,,85.41,,,1406.0,,,926.16,874.33,-314.24,8.75,4.0,-4.0,Claret (2011 A&A 529 75) ATLAS LS,0.0,0.0,0.284,0.366,q1_q17_dr25_koi,4.75,33.192,40.9,1,515.0,1.0,q1_q17_dr25_tce,01111111111111111000000000000000,0.309,Mandel and Agol (2002 ApJ 580 171),,,010/010854/010854555/dv/kplr010854555-20160209...,010/010854/010854555/dv/kplr010854555-001-2016...,6031.0,169.0,-211.0,4.438,0.07,-0.21,0.07,0.25,-0.3,1.046,0.334,-0.133,1.095,0.151,-0.136,,,,q1_q17_dr25_stellar,288.755,48.226,15.509,16.015,15.468,15.292,15.241,14.366,14.064,13.952,0.733,19.25,0.0,48.226,0.0,-0.01,0.35,0.23,0.37,0.0,0.0,-0.0,0.0,0.03,0.19,-0.09,0.18,0.1,0.14,0.07,0.18,0.02,0.16,0.07,0.2



--------------------------------------------------------------------------------
Missing values:


Unnamed: 0,Missing Count,Percentage
koi_ingress_err2,9564,100.000
koi_longp_err2,9564,100.000
koi_teq_err1,9564,100.000
koi_incl_err2,9564,100.000
koi_incl_err1,9564,100.000
...,...,...
koi_kmag,25,0.260
koi_hmag,25,0.260
koi_jmag,25,0.260
koi_rmag,9,0.090



--------------------------------------------------------------------------------
Basic statistics:


Unnamed: 0,rowid,kepid,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,koi_time0bk_err1,koi_time0bk_err2,koi_time0,koi_time0_err1,koi_time0_err2,koi_eccen,koi_eccen_err1,koi_eccen_err2,koi_longp,koi_longp_err1,koi_longp_err2,koi_impact,koi_impact_err1,koi_impact_err2,koi_duration,koi_duration_err1,koi_duration_err2,koi_ingress,koi_ingress_err1,koi_ingress_err2,koi_depth,koi_depth_err1,koi_depth_err2,koi_ror,koi_ror_err1,koi_ror_err2,koi_srho,koi_srho_err1,koi_srho_err2,koi_prad,koi_prad_err1,koi_prad_err2,koi_sma,koi_sma_err1,koi_sma_err2,koi_incl,koi_incl_err1,koi_incl_err2,koi_teq,koi_teq_err1,koi_teq_err2,koi_insol,koi_insol_err1,koi_insol_err2,koi_dor,koi_dor_err1,koi_dor_err2,koi_ldm_coeff4,koi_ldm_coeff3,koi_ldm_coeff2,koi_ldm_coeff1,koi_max_sngle_ev,koi_max_mult_ev,koi_model_snr,koi_count,koi_num_transits,koi_tce_plnt_num,koi_bin_oedp_sig,koi_model_dof,koi_model_chisq,koi_steff,koi_steff_err1,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_smet,koi_smet_err1,koi_smet_err2,koi_srad,koi_srad_err1,koi_srad_err2,koi_smass,koi_smass_err1,koi_smass_err2,koi_sage,koi_sage_err1,koi_sage_err2,ra,dec,koi_kepmag,koi_gmag,koi_rmag,koi_imag,koi_zmag,koi_jmag,koi_hmag,koi_kmag,koi_fwm_stat_sig,koi_fwm_sra,koi_fwm_sra_err,koi_fwm_sdec,koi_fwm_sdec_err,koi_fwm_srao,koi_fwm_srao_err,koi_fwm_sdeco,koi_fwm_sdeco_err,koi_fwm_prao,koi_fwm_prao_err,koi_fwm_pdeco,koi_fwm_pdeco_err,koi_dicco_mra,koi_dicco_mra_err,koi_dicco_mdec,koi_dicco_mdec_err,koi_dicco_msky,koi_dicco_msky_err,koi_dikco_mra,koi_dikco_mra_err,koi_dikco_mdec,koi_dikco_mdec_err,koi_dikco_msky,koi_dikco_msky_err
count,9564.0,9564.0,8054.0,9564.0,9564.0,9564.0,9564.0,9564.0,9110.0,9110.0,9564.0,9110.0,9110.0,9564.0,9110.0,9110.0,9201.0,0.0,0.0,0.0,0.0,0.0,9201.0,9110.0,9110.0,9564.0,9110.0,9110.0,0.0,0.0,0.0,9201.0,9110.0,9110.0,9201.0,9201.0,9201.0,9243.0,9243.0,9243.0,9201.0,9201.0,9201.0,9201.0,0.0,0.0,9200.0,0.0,0.0,9201.0,0.0,0.0,9243.0,9243.0,9243.0,9201.0,9110.0,9110.0,9201.0,9201.0,9201.0,9201.0,8422.0,8422.0,9201.0,9564.0,8422.0,9218.0,8054.0,0.0,0.0,9201.0,9096.0,9081.0,9201.0,9096.0,9096.0,9178.0,9177.0,9177.0,9201.0,9096.0,9096.0,9201.0,9096.0,9096.0,0.0,0.0,0.0,9564.0,9564.0,9563.0,9523.0,9555.0,9410.0,8951.0,9539.0,9539.0,9539.0,8488.0,9058.0,9058.0,9058.0,9058.0,9109.0,9109.0,9109.0,9109.0,8734.0,8734.0,8747.0,8747.0,8965.0,8965.0,8965.0,8965.0,8965.0,8965.0,8994.0,8994.0,8994.0,8994.0,8994.0,8994.0
mean,4782.5,7690628.327,0.481,0.209,0.233,0.198,0.12,75.671,0.002,-0.002,166.183,0.01,-0.01,2454999.183,0.01,-0.01,0.0,,,,,,0.735,1.96,-0.333,5.622,0.34,-0.34,,,,23791.336,123.198,-123.198,0.284,1.782,-0.101,9.164,18.065,-5.489,102.892,17.658,-33.023,0.224,,,82.469,,,1085.386,,,7745.737,3750.698,-4043.522,76.736,23.679,-23.679,0.0,0.0,0.254,0.408,176.846,1025.665,259.895,1.406,385.007,1.244,0.409,,,5706.823,144.636,-162.265,4.31,0.121,-0.143,-0.124,0.229,-0.252,1.729,0.362,-0.395,1.024,0.123,-0.139,,,,292.06,43.81,14.265,14.831,14.222,14.075,13.992,12.993,12.621,12.543,0.151,19.471,0.0,43.829,0.0,-0.316,0.704,-0.166,0.706,-0.0,0.222,-0.001,0.309,-0.012,0.434,-0.045,0.446,1.867,0.49,-0.024,0.425,-0.077,0.437,1.813,0.476
std,2761.033,2653459.081,0.477,4.767,0.423,0.398,0.325,1334.744,0.008,0.008,67.919,0.023,0.023,67.919,0.023,0.023,0.0,,,,,,3.349,9.422,1.25,6.472,0.67,0.67,,,,82242.683,4112.615,4112.615,3.307,9.407,1.241,53.808,76.801,32.337,3077.639,391.139,1193.52,0.566,,,15.224,,,856.351,,,159204.665,55044.209,88388.311,845.275,298.215,298.215,0.0,0.0,0.065,0.106,770.902,4154.122,795.807,0.873,545.756,0.665,0.501,,,796.858,47.052,72.746,0.433,0.133,0.085,0.282,0.077,0.085,6.127,0.931,2.168,0.349,0.086,0.179,,,,4.767,3.601,1.385,1.502,1.384,1.293,1.23,1.292,1.267,1.268,0.253,0.319,0.0,3.6,0.0,20.255,0.664,20.535,0.662,0.058,9.581,0.093,12.782,2.407,0.601,2.574,0.57,2.989,0.646,2.382,0.602,2.554,0.568,2.986,0.648
min,1.0,757450.0,0.0,0.0,0.0,0.0,0.0,0.242,0.0,-0.172,120.516,0.0,-0.569,2454953.516,0.0,-0.569,0.0,,,,,,0.0,0.0,-59.32,0.052,0.0,-20.2,,,,0.0,0.0,-388600.0,0.001,0.0,-59.326,0.0,0.0,-696.089,0.08,0.0,-77180.0,0.006,,,2.29,,,25.0,,,0.0,0.0,-5600031.33,0.373,0.0,-27952.0,0.0,0.0,-0.121,0.125,2.417,7.105,0.0,1.0,0.0,1.0,-1.0,,,2661.0,0.0,-1762.0,0.047,0.0,-1.207,-2.5,0.0,-0.75,0.109,0.0,-116.137,0.0,0.0,-2.432,,,,279.853,36.577,6.966,7.225,7.101,7.627,6.702,4.097,3.014,2.311,0.0,18.657,0.0,36.577,0.0,-742.43,0.0,-417.9,0.0,-4.0,0.0,-6.0,0.0,-25.1,0.067,-75.9,0.067,0.0,0.067,-27.8,0.067,-76.6,0.067,0.0,0.067
25%,2391.75,5556034.25,0.0,0.0,0.0,0.0,0.0,2.734,0.0,-0.0,132.762,0.001,-0.011,2454965.762,0.001,-0.011,0.0,,,,,,0.197,0.04,-0.445,2.438,0.051,-0.35,,,,159.9,9.6,-49.5,0.012,0.0,-0.005,0.229,0.054,-1.13,1.4,0.23,-1.94,0.038,,,83.92,,,539.0,,,20.15,9.19,-287.31,5.358,0.754,-11.3,0.0,0.0,0.229,0.327,3.998,10.733,12.0,1.0,41.0,1.0,0.135,,,5310.0,106.0,-198.0,4.218,0.042,-0.196,-0.26,0.15,-0.3,0.829,0.129,-0.25,0.845,0.072,-0.141,,,,288.661,40.777,13.44,13.896,13.393,13.294,13.276,12.253,11.915,11.843,0.0,19.244,0.0,40.799,0.0,-0.6,0.17,-0.68,0.17,-0.0,0.0,-0.0,0.0,-0.32,0.094,-0.387,0.098,0.17,0.1,-0.31,0.087,-0.39,0.09,0.21,0.094
50%,4782.5,7906892.0,0.334,0.0,0.0,0.0,0.0,9.753,0.0,-0.0,137.225,0.004,-0.004,2454970.225,0.004,-0.004,0.0,,,,,,0.537,0.193,-0.207,3.793,0.142,-0.142,,,,421.1,20.75,-20.75,0.021,0.001,-0.001,0.957,0.437,-0.224,2.39,0.52,-0.3,0.085,,,88.5,,,878.0,,,141.6,72.83,-40.26,15.46,3.1,-3.1,0.0,0.0,0.271,0.392,5.59,19.254,23.0,1.0,143.0,1.0,0.487,,,5767.0,157.0,-160.0,4.438,0.07,-0.128,-0.1,0.25,-0.3,1.0,0.251,-0.111,0.974,0.106,-0.098,,,,292.261,43.678,14.52,15.064,14.471,14.317,14.254,13.236,12.834,12.744,0.006,19.485,0.0,43.694,0.0,-0.001,0.57,-0.034,0.57,0.0,0.0,0.0,0.0,0.0,0.26,0.0,0.28,0.61,0.31,-0.004,0.25,-0.017,0.27,0.583,0.29
75%,7173.25,9873066.5,0.998,0.0,0.0,0.0,0.0,40.715,0.0,-0.0,170.695,0.011,-0.001,2455003.695,0.011,-0.001,0.0,,,,,,0.889,0.378,-0.046,6.276,0.35,-0.051,,,,1473.4,49.5,-9.6,0.095,0.005,-0.001,2.897,2.483,-0.026,14.93,2.32,-0.14,0.214,,,89.77,,,1379.0,,,870.29,519.415,-5.16,45.37,11.3,-0.754,0.0,0.0,0.3,0.464,16.948,71.998,78.0,1.0,469.0,1.0,0.81,,,6112.0,174.0,-114.0,4.543,0.149,-0.088,0.07,0.3,-0.15,1.345,0.364,-0.069,1.101,0.151,-0.061,,,,295.859,46.715,15.322,15.936,15.275,15.063,14.943,13.968,13.551,13.485,0.196,19.727,0.0,46.721,0.0,0.57,1.1,0.5,1.1,0.0,0.001,0.0,0.001,0.309,0.6,0.3,0.61,2.16,0.68,0.29,0.59,0.3,0.6,1.97,0.66
max,9564.0,12935144.0,1.0,465.0,1.0,1.0,1.0,129995.778,0.172,0.0,1472.522,0.569,-0.0,2456305.522,0.569,-0.0,0.0,,,,,,100.806,85.54,0.0,138.54,20.2,0.0,,,,1541400.0,388600.0,0.0,99.871,85.532,0.0,980.854,835.242,0.0,200346.0,21640.0,0.0,44.989,,,90.0,,,14667.0,,,10947554.55,3617132.59,0.0,79614.0,27952.0,0.0,0.0,0.0,0.482,0.949,22982.162,120049.68,9054.7,7.0,2664.0,8.0,1.0,,,15896.0,676.0,0.0,5.364,1.472,0.0,0.56,0.5,0.0,229.908,33.091,0.0,3.735,1.5,0.0,,,,301.721,52.336,20.003,21.15,19.96,19.9,17.403,17.372,17.615,17.038,1.0,20.115,0.0,52.338,0.003,549.5,11.0,712.5,11.0,1.19,663.0,5.0,870.0,45.68,33.0,27.5,22.0,88.6,32.0,46.57,33.0,34.0,22.0,89.6,32.0


### 3.2 K2 Dataset Exploration

In [15]:
# K2: Basic information
print("=" * 80)
print("K2 DATASET EXPLORATION")
print("=" * 80)

print(f"\nDataset shape: {k2_df.shape[0]:,} rows × {k2_df.shape[1]} columns\n")

print("Column names and types:")
print(k2_df.dtypes)

print("\n" + "-" * 80)
print("First few rows:")
display(k2_df.head())

print("\n" + "-" * 80)
print("Missing values:")
missing_k2 = k2_df.isnull().sum()
missing_pct_k2 = (missing_k2 / len(k2_df) * 100).round(2)
missing_df_k2 = pd.DataFrame({
    'Missing Count': missing_k2[missing_k2 > 0],
    'Percentage': missing_pct_k2[missing_k2 > 0]
}).sort_values('Percentage', ascending=False)
display(missing_df_k2)

print("\n" + "-" * 80)
print("Basic statistics:")
display(k2_df.describe())

K2 DATASET EXPLORATION

Dataset shape: 4,004 rows × 295 columns

Column names and types:
rowid             int64
pl_name          object
hostname         object
pl_letter        object
k2_name          object
                 ...   
st_nrvc         float64
st_nspec        float64
pl_nespec       float64
pl_ntranspec    float64
pl_ndispec      float64
Length: 295, dtype: object

--------------------------------------------------------------------------------
First few rows:


Unnamed: 0,rowid,pl_name,hostname,pl_letter,k2_name,epic_hostname,epic_candname,hd_name,hip_name,tic_id,gaia_id,default_flag,disposition,disp_refname,sy_snum,sy_pnum,sy_mnum,cb_flag,discoverymethod,disc_year,disc_refname,disc_pubdate,disc_locale,disc_facility,disc_telescope,disc_instrument,rv_flag,pul_flag,ptv_flag,tran_flag,ast_flag,obm_flag,micro_flag,etv_flag,ima_flag,dkin_flag,soltype,pl_controv_flag,pl_refname,pl_orbper,pl_orbpererr1,pl_orbpererr2,pl_orbperlim,pl_orbsmax,pl_orbsmaxerr1,pl_orbsmaxerr2,pl_orbsmaxlim,pl_rade,pl_radeerr1,pl_radeerr2,pl_radelim,pl_radj,pl_radjerr1,pl_radjerr2,pl_radjlim,pl_masse,pl_masseerr1,pl_masseerr2,pl_masselim,pl_massj,pl_massjerr1,pl_massjerr2,pl_massjlim,pl_msinie,pl_msinieerr1,pl_msinieerr2,pl_msinielim,pl_msinij,pl_msinijerr1,pl_msinijerr2,pl_msinijlim,pl_cmasse,pl_cmasseerr1,pl_cmasseerr2,pl_cmasselim,pl_cmassj,pl_cmassjerr1,pl_cmassjerr2,pl_cmassjlim,pl_bmasse,pl_bmasseerr1,pl_bmasseerr2,pl_bmasselim,pl_bmassj,pl_bmassjerr1,pl_bmassjerr2,pl_bmassjlim,pl_bmassprov,pl_dens,pl_denserr1,pl_denserr2,pl_denslim,pl_orbeccen,pl_orbeccenerr1,pl_orbeccenerr2,pl_orbeccenlim,pl_insol,pl_insolerr1,pl_insolerr2,pl_insollim,pl_eqt,pl_eqterr1,pl_eqterr2,pl_eqtlim,pl_orbincl,pl_orbinclerr1,pl_orbinclerr2,pl_orbincllim,pl_tranmid,pl_tranmiderr1,pl_tranmiderr2,pl_tranmidlim,pl_tsystemref,ttv_flag,pl_imppar,pl_impparerr1,pl_impparerr2,pl_impparlim,pl_trandep,pl_trandeperr1,pl_trandeperr2,pl_trandeplim,pl_trandur,pl_trandurerr1,pl_trandurerr2,pl_trandurlim,pl_ratdor,pl_ratdorerr1,pl_ratdorerr2,pl_ratdorlim,pl_ratror,pl_ratrorerr1,pl_ratrorerr2,pl_ratrorlim,pl_occdep,pl_occdeperr1,pl_occdeperr2,pl_occdeplim,pl_orbtper,pl_orbtpererr1,pl_orbtpererr2,pl_orbtperlim,pl_orblper,pl_orblpererr1,pl_orblpererr2,pl_orblperlim,pl_rvamp,pl_rvamperr1,pl_rvamperr2,pl_rvamplim,pl_projobliq,pl_projobliqerr1,pl_projobliqerr2,pl_projobliqlim,pl_trueobliq,pl_trueobliqerr1,pl_trueobliqerr2,pl_trueobliqlim,st_refname,st_spectype,st_teff,st_tefferr1,st_tefferr2,st_tefflim,st_rad,st_raderr1,st_raderr2,st_radlim,st_mass,st_masserr1,st_masserr2,st_masslim,st_met,st_meterr1,st_meterr2,st_metlim,st_metratio,st_lum,st_lumerr1,st_lumerr2,st_lumlim,st_logg,st_loggerr1,st_loggerr2,st_logglim,st_age,st_ageerr1,st_ageerr2,st_agelim,st_dens,st_denserr1,st_denserr2,st_denslim,st_vsin,st_vsinerr1,st_vsinerr2,st_vsinlim,st_rotp,st_rotperr1,st_rotperr2,st_rotplim,st_radv,st_radverr1,st_radverr2,st_radvlim,sy_refname,rastr,ra,decstr,dec,glat,glon,elat,elon,sy_pm,sy_pmerr1,sy_pmerr2,sy_pmra,sy_pmraerr1,sy_pmraerr2,sy_pmdec,sy_pmdecerr1,sy_pmdecerr2,sy_dist,sy_disterr1,sy_disterr2,sy_plx,sy_plxerr1,sy_plxerr2,sy_bmag,sy_bmagerr1,sy_bmagerr2,sy_vmag,sy_vmagerr1,sy_vmagerr2,sy_jmag,sy_jmagerr1,sy_jmagerr2,sy_hmag,sy_hmagerr1,sy_hmagerr2,sy_kmag,sy_kmagerr1,sy_kmagerr2,sy_umag,sy_umagerr1,sy_umagerr2,sy_gmag,sy_gmagerr1,sy_gmagerr2,sy_rmag,sy_rmagerr1,sy_rmagerr2,sy_imag,sy_imagerr1,sy_imagerr2,sy_zmag,sy_zmagerr1,sy_zmagerr2,sy_w1mag,sy_w1magerr1,sy_w1magerr2,sy_w2mag,sy_w2magerr1,sy_w2magerr2,sy_w3mag,sy_w3magerr1,sy_w3magerr2,sy_w4mag,sy_w4magerr1,sy_w4magerr2,sy_gaiamag,sy_gaiamagerr1,sy_gaiamagerr2,sy_icmag,sy_icmagerr1,sy_icmagerr2,sy_tmag,sy_tmagerr1,sy_tmagerr2,sy_kepmag,sy_kepmagerr1,sy_kepmagerr2,rowupdate,pl_pubdate,releasedate,pl_nnotes,k2_campaigns,k2_campaigns_num,st_nphot,st_nrvc,st_nspec,pl_nespec,pl_ntranspec,pl_ndispec
0,1,BD+20 594 b,BD+20 594,b,K2-56 b,EPIC 210848071,EPIC 210848071.01,,,TIC 26123781,Gaia DR2 58200934326315136,0,CONFIRMED,Espinoza et al. 2016,1.0,1.0,0.0,0.0,Transit,2016.0,<a refstr=ESPINOZA_ET_AL__2016 href=https://ui...,2016-10,Space,K2,0.95 m Kepler Telescope,Kepler CCD Array,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,Published Confirmed,0.0,<a refstr=MAYO_ET_AL__2018 href=https://ui.ads...,41.689,0.003,-0.003,0.0,,,,,2.355,0.31,-0.167,0.0,0.21,0.028,-0.015,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,89.526,0.329,-0.535,0.0,2457068.529,0.003,-0.003,0.0,,0.0,,,,,,,,,,,,,54.721,5.823,-12.996,0.0,0.023,0.002,-0.001,0.0,,,,,,,,,,,,,,,,,,,,,,,,,<a refstr=MAYO_ET_AL__2018 href=https://ui.ads...,,5703.0,50.0,-50.0,0.0,0.956,0.099,-0.055,0.0,0.964,0.03,-0.032,0.0,-0.06,0.08,-0.08,0.0,[Fe/H],,,,,4.38,0.1,-0.1,0.0,,,,,,,,,,,,,,,,,,,,,<a refstr=STASSUN_ET_AL__2019 href=https://ui....,03h34m36.27s,53.651,+20d35m56.47s,20.599,-28.053,166.797,1.312,56.289,63.054,0.06,-0.06,36.571,0.075,-0.075,-51.365,0.051,-0.051,179.461,1.257,-1.24,5.544,0.038,-0.038,11.765,0.161,-0.161,10.849,0.012,-0.012,9.77,0.022,-0.022,9.432,0.022,-0.022,9.368,0.018,-0.018,,,,,,,,,,,,,,,,9.31,0.024,-0.024,9.344,0.02,-0.02,9.332,0.038,-0.038,8.976,0.527,-0.527,10.864,0.0,-0.0,,,,10.402,0.006,-0.006,11.04,,,2018-04-25,2018-03,2018-02-15,1.0,4,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,BD+20 594 b,BD+20 594,b,K2-56 b,EPIC 210848071,EPIC 210848071.01,,,TIC 26123781,Gaia DR2 58200934326315136,0,CONFIRMED,Espinoza et al. 2016,1.0,1.0,0.0,0.0,Transit,2016.0,<a refstr=ESPINOZA_ET_AL__2016 href=https://ui...,2016-10,Space,K2,0.95 m Kepler Telescope,Kepler CCD Array,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,Published Confirmed,0.0,<a refstr=ESPINOZA_ET_AL__2016 href=https://ui...,41.685,0.003,-0.003,0.0,0.241,0.019,-0.017,0.0,2.23,0.14,-0.11,0.0,0.199,0.012,-0.01,0.0,16.3,6.0,-6.1,0.0,0.051,0.019,-0.019,0.0,,,,,,,,,,,,,,,,,16.3,6.0,-6.1,0.0,0.051,0.019,-0.019,0.0,Mass,7.89,3.4,-3.1,0.0,0.0,,,0.0,,,,,546.0,19.0,-18.0,0.0,89.55,0.17,-0.14,0.0,2457151.902,0.004,-0.005,0.0,BJD-TDB,0.0,,,,,,,,,,,,,55.8,3.3,-3.3,0.0,0.022,0.001,-0.001,0.0,,,,,,,,,,,,,3.1,1.1,-1.1,0.0,,,,,,,,,<a refstr=ESPINOZA_ET_AL__2016 href=https://ui...,G,5766.0,99.0,-99.0,0.0,0.928,0.055,-0.04,0.0,0.961,0.032,-0.029,0.0,-0.15,0.05,-0.05,0.0,[Fe/H],-0.056,0.068,-0.064,0.0,4.5,0.08,-0.08,0.0,3.34,1.95,-1.49,0.0,1.7,0.2,-0.26,0.0,3.3,0.31,-0.31,0.0,,,,,-20.336,0.001,-0.001,0.0,<a refstr=STASSUN_ET_AL__2019 href=https://ui....,03h34m36.27s,53.651,+20d35m56.47s,20.599,-28.053,166.797,1.312,56.289,63.054,0.06,-0.06,36.571,0.075,-0.075,-51.365,0.051,-0.051,179.461,1.257,-1.24,5.544,0.038,-0.038,11.765,0.161,-0.161,10.849,0.012,-0.012,9.77,0.022,-0.022,9.432,0.022,-0.022,9.368,0.018,-0.018,,,,,,,,,,,,,,,,9.31,0.024,-0.024,9.344,0.02,-0.02,9.332,0.038,-0.038,8.976,0.527,-0.527,10.864,0.0,-0.0,,,,10.402,0.006,-0.006,11.04,,,2018-04-25,2016-10,2016-07-28,1.0,4,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,BD+20 594 b,BD+20 594,b,K2-56 b,EPIC 210848071,EPIC 210848071.01,,,TIC 26123781,Gaia DR2 58200934326315136,1,CONFIRMED,Espinoza et al. 2016,1.0,1.0,0.0,0.0,Transit,2016.0,<a refstr=ESPINOZA_ET_AL__2016 href=https://ui...,2016-10,Space,K2,0.95 m Kepler Telescope,Kepler CCD Array,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,Published Confirmed,0.0,<a refstr=STASSUN_ET_AL__2017 href=https://ui....,41.685,0.003,-0.003,0.0,,,,,2.578,0.112,-0.112,0.0,0.23,0.01,-0.01,0.0,22.248,9.535,-9.535,0.0,0.07,0.03,-0.03,0.0,,,,,,,,,,,,,,,,,22.248,9.535,-9.535,0.0,0.07,0.03,-0.03,0.0,Mass,,,,,0.0,,,0.0,,,,,,,,,89.55,0.16,-0.16,0.0,,,,,,0.0,,,,,0.049,0.003,-0.003,0.0,,,,,55.8,3.3,-3.3,0.0,,,,,,,,,,,,,,,,,3.1,1.1,-1.1,0.0,,,,,,,,,<a refstr=STASSUN_ET_AL__2017 href=https://ui....,,5766.0,99.0,-99.0,0.0,1.08,0.06,-0.06,0.0,1.67,0.4,-0.4,0.0,-0.15,,,0.0,[Fe/H],,,,,4.5,0.08,-0.08,0.0,,,,,1.89,0.34,-0.34,0.0,,,,,,,,,,,,,<a refstr=STASSUN_ET_AL__2019 href=https://ui....,03h34m36.27s,53.651,+20d35m56.47s,20.599,-28.053,166.797,1.312,56.289,63.054,0.06,-0.06,36.571,0.075,-0.075,-51.365,0.051,-0.051,179.461,1.257,-1.24,5.544,0.038,-0.038,11.765,0.161,-0.161,10.849,0.012,-0.012,9.77,0.022,-0.022,9.432,0.022,-0.022,9.368,0.018,-0.018,,,,,,,,,,,,,,,,9.31,0.024,-0.024,9.344,0.02,-0.02,9.332,0.038,-0.038,8.976,0.527,-0.527,10.864,0.0,-0.0,,,,10.402,0.006,-0.006,11.04,,,2018-04-25,2017-03,2018-04-26,1.0,4,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,EPIC 201111557.01,EPIC 201111557,,,EPIC 201111557,EPIC 201111557.01,,,TIC 176942156,Gaia DR2 3596276829630866432,1,CANDIDATE,Livingston et al. 2018,1.0,0.0,0.0,0.0,Transit,2018.0,<a refstr=MAYO_ET_AL__2018 href=https://ui.ads...,2018-03,Space,K2,0.95 m Kepler Telescope,Kepler CCD Array,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,Published Candidate,0.0,<a refstr=LIVINGSTON_ET_AL__2018 href=https://...,2.302,0.0,-0.0,0.0,,,,,1.12,0.11,-0.08,0.0,0.1,0.01,-0.007,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1054.0,55.0,-55.0,0.0,,,,,2457583.169,0.005,-0.005,0.0,BJD,0.0,0.42,0.33,-0.28,0.0,2.268,,,0.0,1.901,,,0.0,11.8,2.1,-3.0,0.0,0.014,0.001,-0.001,0.0,,,,,,,,,,,,,,,,,,,,,,,,,<a refstr=STASSUN_ET_AL__2019 href=https://ui....,,4616.52,82.36,-115.56,0.0,0.763,0.054,-0.038,0.0,0.73,0.084,-0.081,0.0,-0.03,0.034,-0.034,0.0,[M/H],-0.623,0.018,-0.014,0.0,4.537,0.075,-0.091,0.0,,,,,2.321,0.536,-0.566,0.0,,,,,,,,,,,,,<a refstr=STASSUN_ET_AL__2019 href=https://ui....,12h15m23.10s,183.846,-06d16m05.98s,-6.268,55.483,286.981,-4.224,186.017,73.414,0.096,-0.096,-70.261,0.099,-0.099,-21.284,0.059,-0.059,97.18,0.464,-0.46,10.262,0.049,-0.049,12.737,0.028,-0.028,11.727,0.046,-0.046,9.873,0.023,-0.023,9.391,0.023,-0.023,9.22,0.019,-0.019,,,,,,,,,,,,,,,,9.186,0.022,-0.022,9.243,0.019,-0.019,9.183,0.039,-0.039,8.2,,,11.399,0.001,-0.001,,,,10.761,0.006,-0.006,11.363,,,2018-08-02,2018-08,2018-08-02,0.0,10,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,EPIC 201111557.01,EPIC 201111557,,,EPIC 201111557,EPIC 201111557.01,,,TIC 176942156,Gaia DR2 3596276829630866432,0,CANDIDATE,Livingston et al. 2018,1.0,0.0,0.0,0.0,Transit,2018.0,<a refstr=MAYO_ET_AL__2018 href=https://ui.ads...,2018-03,Space,K2,0.95 m Kepler Telescope,Kepler CCD Array,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,Published Candidate,0.0,<a refstr=MAYO_ET_AL__2018 href=https://ui.ads...,2.302,0.0,-0.0,0.0,,,,,1.313,0.524,-0.121,0.0,0.12,0.05,-0.01,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,87.444,1.875,-9.524,0.0,2457583.162,0.002,-0.002,0.0,BJD,0.0,,,,,,,,,,,,,12.62,2.838,-8.019,0.0,0.017,0.007,-0.001,0.0,,,,,,,,,,,,,,,,,,,,,,,,,<a refstr=MAYO_ET_AL__2018 href=https://ui.ads...,,4720.0,50.0,-50.0,0.0,0.711,0.019,-0.02,0.0,,,,,-0.06,0.08,-0.08,0.0,[Fe/H],,,,,4.5,0.1,-0.1,0.0,,,,,,,,,,,,,,,,,,,,,<a refstr=STASSUN_ET_AL__2019 href=https://ui....,12h15m23.10s,183.846,-06d16m05.98s,-6.268,55.483,286.981,-4.224,186.017,73.414,0.096,-0.096,-70.261,0.099,-0.099,-21.284,0.059,-0.059,97.18,0.464,-0.46,10.262,0.049,-0.049,12.737,0.028,-0.028,11.727,0.046,-0.046,9.873,0.023,-0.023,9.391,0.023,-0.023,9.22,0.019,-0.019,,,,,,,,,,,,,,,,9.186,0.022,-0.022,9.243,0.019,-0.019,9.183,0.039,-0.039,8.2,,,11.399,0.001,-0.001,,,,10.761,0.006,-0.006,11.363,,,2018-02-15,2018-03,2018-02-15,0.0,10,1.0,0.0,0.0,0.0,0.0,0.0,0.0



--------------------------------------------------------------------------------
Missing values:


Unnamed: 0,Missing Count,Percentage
sy_kepmagerr2,4004,100.000
sy_kepmagerr1,4004,100.000
sy_icmagerr2,4004,100.000
sy_icmagerr1,4004,100.000
sy_icmag,4004,100.000
...,...,...
disc_year,17,0.420
discoverymethod,17,0.420
rv_flag,17,0.420
tic_id,4,0.100



--------------------------------------------------------------------------------
Basic statistics:


Unnamed: 0,rowid,default_flag,sy_snum,sy_pnum,sy_mnum,cb_flag,disc_year,rv_flag,pul_flag,ptv_flag,tran_flag,ast_flag,obm_flag,micro_flag,etv_flag,ima_flag,dkin_flag,pl_controv_flag,pl_orbper,pl_orbpererr1,pl_orbpererr2,pl_orbperlim,pl_orbsmax,pl_orbsmaxerr1,pl_orbsmaxerr2,pl_orbsmaxlim,pl_rade,pl_radeerr1,pl_radeerr2,pl_radelim,pl_radj,pl_radjerr1,pl_radjerr2,pl_radjlim,pl_masse,pl_masseerr1,pl_masseerr2,pl_masselim,pl_massj,pl_massjerr1,pl_massjerr2,pl_massjlim,pl_msinie,pl_msinieerr1,pl_msinieerr2,pl_msinielim,pl_msinij,pl_msinijerr1,pl_msinijerr2,pl_msinijlim,pl_cmasse,pl_cmasseerr1,pl_cmasseerr2,pl_cmasselim,pl_cmassj,pl_cmassjerr1,pl_cmassjerr2,pl_cmassjlim,pl_bmasse,pl_bmasseerr1,pl_bmasseerr2,pl_bmasselim,pl_bmassj,pl_bmassjerr1,pl_bmassjerr2,pl_bmassjlim,pl_dens,pl_denserr1,pl_denserr2,pl_denslim,pl_orbeccen,pl_orbeccenerr1,pl_orbeccenerr2,pl_orbeccenlim,pl_insol,pl_insolerr1,pl_insolerr2,pl_insollim,pl_eqt,pl_eqterr1,pl_eqterr2,pl_eqtlim,pl_orbincl,pl_orbinclerr1,pl_orbinclerr2,pl_orbincllim,pl_tranmid,pl_tranmiderr1,pl_tranmiderr2,pl_tranmidlim,ttv_flag,pl_imppar,pl_impparerr1,pl_impparerr2,pl_impparlim,pl_trandep,pl_trandeperr1,pl_trandeperr2,pl_trandeplim,pl_trandur,pl_trandurerr1,pl_trandurerr2,pl_trandurlim,pl_ratdor,pl_ratdorerr1,pl_ratdorerr2,pl_ratdorlim,pl_ratror,pl_ratrorerr1,pl_ratrorerr2,pl_ratrorlim,pl_occdep,pl_occdeperr1,pl_occdeperr2,pl_occdeplim,pl_orbtper,pl_orbtpererr1,pl_orbtpererr2,pl_orbtperlim,pl_orblper,pl_orblpererr1,pl_orblpererr2,pl_orblperlim,pl_rvamp,pl_rvamperr1,pl_rvamperr2,pl_rvamplim,pl_projobliq,pl_projobliqerr1,pl_projobliqerr2,pl_projobliqlim,pl_trueobliq,pl_trueobliqerr1,pl_trueobliqerr2,pl_trueobliqlim,st_teff,st_tefferr1,st_tefferr2,st_tefflim,st_rad,st_raderr1,st_raderr2,st_radlim,st_mass,st_masserr1,st_masserr2,st_masslim,st_met,st_meterr1,st_meterr2,st_metlim,st_lum,st_lumerr1,st_lumerr2,st_lumlim,st_logg,st_loggerr1,st_loggerr2,st_logglim,st_age,st_ageerr1,st_ageerr2,st_agelim,st_dens,st_denserr1,st_denserr2,st_denslim,st_vsin,st_vsinerr1,st_vsinerr2,st_vsinlim,st_rotp,st_rotperr1,st_rotperr2,st_rotplim,st_radv,st_radverr1,st_radverr2,st_radvlim,ra,dec,glat,glon,elat,elon,sy_pm,sy_pmerr1,sy_pmerr2,sy_pmra,sy_pmraerr1,sy_pmraerr2,sy_pmdec,sy_pmdecerr1,sy_pmdecerr2,sy_dist,sy_disterr1,sy_disterr2,sy_plx,sy_plxerr1,sy_plxerr2,sy_bmag,sy_bmagerr1,sy_bmagerr2,sy_vmag,sy_vmagerr1,sy_vmagerr2,sy_jmag,sy_jmagerr1,sy_jmagerr2,sy_hmag,sy_hmagerr1,sy_hmagerr2,sy_kmag,sy_kmagerr1,sy_kmagerr2,sy_umag,sy_umagerr1,sy_umagerr2,sy_gmag,sy_gmagerr1,sy_gmagerr2,sy_rmag,sy_rmagerr1,sy_rmagerr2,sy_imag,sy_imagerr1,sy_imagerr2,sy_zmag,sy_zmagerr1,sy_zmagerr2,sy_w1mag,sy_w1magerr1,sy_w1magerr2,sy_w2mag,sy_w2magerr1,sy_w2magerr2,sy_w3mag,sy_w3magerr1,sy_w3magerr2,sy_w4mag,sy_w4magerr1,sy_w4magerr2,sy_gaiamag,sy_gaiamagerr1,sy_gaiamagerr2,sy_icmag,sy_icmagerr1,sy_icmagerr2,sy_tmag,sy_tmagerr1,sy_tmagerr2,sy_kepmag,sy_kepmagerr1,sy_kepmagerr2,pl_nnotes,k2_campaigns_num,st_nphot,st_nrvc,st_nspec,pl_nespec,pl_ntranspec,pl_ndispec
count,4004.0,4004.0,3987.0,3987.0,3987.0,3987.0,3987.0,3987.0,3987.0,3987.0,3987.0,3987.0,3987.0,3987.0,3987.0,3987.0,3987.0,3987.0,3937.0,3052.0,3052.0,3937.0,812.0,805.0,805.0,812.0,3159.0,2874.0,2874.0,3159.0,3159.0,2874.0,2874.0,3159.0,408.0,366.0,366.0,408.0,408.0,366.0,366.0,408.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,428.0,386.0,386.0,428.0,428.0,386.0,386.0,428.0,352.0,327.0,327.0,352.0,422.0,226.0,226.0,422.0,629.0,447.0,447.0,629.0,845.0,677.0,677.0,845.0,987.0,945.0,945.0,987.0,3916.0,3019.0,3019.0,3916.0,3981.0,1478.0,1076.0,1076.0,1478.0,2085.0,1295.0,1295.0,2085.0,2767.0,1857.0,1854.0,2767.0,2191.0,2151.0,2151.0,2191.0,3004.0,2603.0,2603.0,3004.0,1.0,0.0,0.0,1.0,31.0,30.0,30.0,31.0,246.0,215.0,215.0,246.0,300.0,278.0,278.0,300.0,28.0,28.0,28.0,28.0,10.0,10.0,10.0,10.0,2877.0,2570.0,2564.0,2877.0,3856.0,3226.0,3220.0,3856.0,2089.0,1933.0,1927.0,2089.0,1691.0,1660.0,1660.0,1691.0,771.0,613.0,613.0,771.0,2347.0,2134.0,2134.0,2347.0,310.0,278.0,278.0,310.0,929.0,772.0,772.0,929.0,388.0,321.0,315.0,388.0,169.0,169.0,169.0,169.0,176.0,176.0,170.0,176.0,3981.0,3981.0,3981.0,3981.0,3981.0,3981.0,3940.0,3940.0,3940.0,3940.0,3940.0,3940.0,3940.0,3940.0,3940.0,3856.0,3725.0,3725.0,3734.0,3734.0,3734.0,3844.0,3844.0,3844.0,3939.0,3939.0,3939.0,3958.0,3956.0,3956.0,3958.0,3950.0,3950.0,3958.0,3950.0,3950.0,1798.0,1798.0,1798.0,1798.0,1798.0,1798.0,1798.0,1798.0,1798.0,1798.0,1798.0,1798.0,1798.0,1798.0,1798.0,3846.0,3846.0,3846.0,3846.0,3846.0,3846.0,3844.0,3538.0,3538.0,3843.0,604.0,604.0,3925.0,3925.0,3925.0,0.0,0.0,0.0,3977.0,3977.0,3977.0,3966.0,0.0,0.0,3981.0,3977.0,3981.0,3981.0,3981.0,3981.0,3981.0,3981.0
mean,2002.5,0.451,1.051,1.127,0.0,0.0,2017.554,0.241,0.0,0.0,0.994,0.0,0.0,0.001,0.0,0.0,0.0,0.0,40.476,14.278,-9.203,-0.001,0.105,0.006,-0.006,-0.004,8.445,2.871,-2.791,0.0,0.754,0.257,-0.249,0.0,119.137,14.647,-12.89,0.098,0.375,0.046,-0.041,0.098,265.133,34.072,-31.428,0.0,0.835,0.107,-0.099,0.0,145.193,51.185,-51.185,0.0,0.458,0.161,-0.161,0.0,127.745,14.792,-13.063,0.093,0.402,0.046,-0.041,0.093,3.991,1.174,-0.995,0.065,0.123,0.113,-0.075,0.192,384.226,49.137,-46.002,0.0,903.572,29.394,-28.045,0.004,87.6,1.526,-2.03,-0.004,2457393.661,0.053,-0.054,0.0,0.044,0.439,0.193,-0.201,0.0,2.455,21.194,-21.194,0.0,3.23,0.189,-0.207,0.0,22.487,4.146,-3.861,-0.001,0.061,0.015,-0.013,0.0,0.024,,,1.0,2457757.152,2.885,-3.191,0.0,72.035,71.619,-76.782,0.0,42.201,6.543,-3.372,0.067,11.47,20.91,-21.127,0.0,61.251,10.027,-8.894,0.0,5130.703,110.466,-109.85,0.0,1.177,0.118,-0.122,0.0,0.872,0.114,-0.091,0.0,-0.031,0.098,-0.098,0.0,-0.454,0.043,-0.053,0.0,4.436,0.08,-0.082,0.0,4.787,2.058,-1.814,-0.081,4.734,0.555,-0.522,0.0,4.9,0.768,-0.759,0.149,19.589,1.654,-1.678,0.0,10.891,0.438,-0.451,0.0,179.504,1.304,9.442,195.186,-1.232,179.279,57.247,0.243,-0.243,9.346,0.254,-0.254,-24.552,0.229,-0.229,391.947,23.823,-19.15,5.497,0.056,-0.056,14.042,0.104,-0.104,13.182,0.106,-0.106,11.277,0.025,-0.025,10.835,0.027,-0.027,10.714,0.024,-0.024,16.593,0.149,-0.149,14.612,0.06,-0.06,13.625,0.018,-0.018,13.331,0.004,-0.004,13.355,0.007,-0.007,10.673,0.024,-0.024,10.698,0.023,-0.023,10.542,0.139,-0.139,8.57,0.338,-0.338,12.844,0.001,-0.001,,,,12.198,0.009,-0.009,12.814,,,0.405,1.307,0.034,0.006,0.0,0.006,0.144,0.0
std,1156.0,0.498,0.253,1.384,0.0,0.0,2.033,0.428,0.0,0.0,0.077,0.0,0.0,0.022,0.0,0.0,0.0,0.016,1346.668,579.45,406.258,0.042,0.283,0.041,0.041,0.061,30.083,16.089,14.727,0.018,2.687,1.447,1.309,0.018,291.92,51.279,35.954,0.306,0.918,0.161,0.113,0.306,536.877,77.218,75.028,0.0,1.689,0.243,0.236,0.0,263.972,108.971,108.971,0.0,0.831,0.343,0.343,0.0,317.445,50.66,35.795,0.299,0.999,0.159,0.113,0.299,5.053,1.527,1.248,0.247,0.143,0.095,0.059,0.394,992.227,162.912,160.025,0.0,445.301,32.114,28.954,0.06,4.761,3.166,3.321,0.064,687.252,1.619,1.643,0.0,0.205,0.269,0.117,0.119,0.0,8.508,325.179,325.179,0.0,2.307,0.277,0.28,0.0,20.0,6.925,6.375,0.037,0.087,0.055,0.041,0.0,,,,,398.159,10.916,12.745,0.0,101.279,46.589,54.441,0.0,85.275,53.827,8.66,0.25,70.402,16.094,15.793,0.0,47.998,8.667,6.441,0.0,1233.586,146.291,147.57,0.0,2.409,0.615,0.764,0.0,0.427,0.247,0.205,0.0,0.228,0.079,0.079,0.0,0.948,0.037,0.067,0.0,0.32,0.084,0.103,0.0,3.27,1.729,1.565,0.273,12.586,1.143,1.073,0.0,7.785,0.465,0.469,0.357,11.709,2.241,2.261,0.0,30.29,2.469,2.511,0.0,94.713,15.191,42.04,99.229,3.8,94.269,114.537,0.888,0.888,106.373,0.888,0.888,66.267,0.889,0.889,545.255,133.304,92.096,5.47,0.093,0.093,2.065,0.132,0.132,1.901,0.146,0.146,1.452,0.009,0.009,1.418,0.011,0.011,1.411,0.012,0.012,2.039,2.375,2.375,1.945,1.667,1.667,1.796,0.439,0.439,1.578,0.009,0.009,1.064,0.015,0.015,1.38,0.01,0.01,1.37,0.011,0.011,1.214,0.116,0.116,0.429,0.127,0.127,1.739,0.001,0.001,,,,1.611,0.026,0.026,1.64,,,0.826,0.575,0.181,0.076,0.0,0.104,1.069,0.0
min,1.0,0.0,1.0,0.0,0.0,0.0,2011.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.176,0.0,-22320.0,-1.0,0.003,0.0,-1.0,-1.0,0.406,0.0,-367.2,0.0,0.036,0.0,-32.76,0.0,0.297,0.012,-381.396,-1.0,0.001,0.0,-1.2,-1.0,2.42,0.45,-381.396,0.0,0.008,0.001,-1.2,0.0,0.29,0.17,-382.232,0.0,0.001,0.001,-1.203,0.0,0.29,0.012,-381.396,-1.0,0.001,0.0,-1.2,-1.0,0.032,0.015,-10.0,0.0,0.0,0.0,-0.26,0.0,0.028,0.005,-2087.0,0.0,82.0,0.0,-253.0,0.0,25.034,0.001,-45.0,-1.0,2452910.267,0.0,-70.0,0.0,0.0,-0.01,0.0,-1.01,0.0,0.006,0.0,-9192.391,0.0,-0.068,0.0,-5.3,0.0,1.103,0.0,-177.0,-1.0,0.006,0.0,-0.35,0.0,0.024,,,1.0,2456689.01,0.0,-70.0,0.0,-210.0,0.0,-340.0,0.0,0.39,0.15,-110.0,0.0,-158.0,0.87,-53.0,0.0,3.7,1.7,-23.1,0.0,2520.0,0.0,-6064.9,0.0,0.11,0.0,-24.6,0.0,0.08,0.002,-7.732,0.0,-1.432,0.006,-1.0,0.0,-3.281,0.007,-0.955,0.0,1.773,0.0,-2.86,0.0,0.007,0.001,-7.0,-1.0,0.022,0.0,-11.29,0.0,0.1,0.04,-5.0,0.0,1.4,0.005,-10.0,0.0,-76.855,0.0,-31.0,0.0,8.697,-30.656,-66.817,0.072,-8.845,9.749,0.35,0.025,-8.0,-458.496,0.037,-8.0,-573.134,0.021,-8.0,21.818,0.031,-2235.67,-1.211,0.021,-3.65,5.838,0.0,-1.626,5.84,0.002,-1.133,4.865,0.018,-0.262,3.663,0.015,-0.248,3.447,0.014,-0.268,14.03,0.003,-50.0,11.105,0.0,-50.0,9.903,0.0,-13.155,10.329,0.0,-0.217,9.85,0.001,-0.371,3.322,0.015,-0.359,2.898,0.016,-0.388,2.6,0.015,-0.539,1.972,0.021,-0.54,5.809,0.0,-0.021,,,,5.822,0.006,-0.8,5.954,,,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1001.75,0.0,1.0,0.0,0.0,0.0,2016.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.861,0.0,-0.001,0.0,0.042,0.001,-0.003,0.0,1.8,0.15,-0.67,0.0,0.161,0.013,-0.06,0.0,6.22,1.1,-9.535,0.0,0.02,0.003,-0.03,0.0,7.905,1.15,-13.174,0.0,0.025,0.004,-0.041,0.0,6.753,2.251,-12.678,0.0,0.021,0.007,-0.04,0.0,6.292,1.093,-9.853,0.0,0.02,0.003,-0.031,0.0,0.927,0.185,-1.4,0.0,0.0,0.04,-0.104,0.0,20.5,2.15,-24.5,0.0,606.0,12.0,-34.0,0.0,87.284,0.31,-2.262,0.0,2456983.646,0.001,-0.004,0.0,0.0,0.25,0.09,-0.28,0.0,0.06,0.002,-0.012,0.0,1.89,0.065,-0.26,0.0,10.0,0.69,-4.8,0.0,0.021,0.001,-0.004,0.0,0.024,,,1.0,2457738.826,0.001,-0.89,0.0,0.0,22.0,-120.0,0.0,2.815,0.44,-2.7,0.0,-1.772,7.75,-32.125,0.0,20.05,4.275,-12.325,0.0,4469.51,53.407,-138.0,0.0,0.677,0.025,-0.07,0.0,0.69,0.026,-0.098,0.0,-0.153,0.047,-0.12,0.0,-0.971,0.017,-0.082,0.0,4.29,0.031,-0.1,0.0,1.9,0.49,-3.0,0.0,0.75,0.088,-0.46,0.0,2.0,0.5,-1.0,0.0,9.754,0.1,-2.4,0.0,-5.213,0.002,-0.17,0.0,127.883,-11.18,-27.451,125.339,-4.378,126.345,14.681,0.057,-0.095,-19.851,0.066,-0.105,-30.429,0.046,-0.077,156.951,1.195,-8.385,2.235,0.038,-0.059,12.736,0.03,-0.135,12.071,0.045,-0.114,10.414,0.022,-0.026,9.994,0.022,-0.027,9.899,0.02,-0.024,15.223,0.006,-0.011,13.021,0.001,-0.005,12.341,0.001,-0.004,12.054,0.001,-0.004,12.957,0.004,-0.008,9.83,0.023,-0.024,9.861,0.02,-0.022,9.887,0.055,-0.188,8.359,0.248,-0.449,11.804,0.0,-0.001,,,,11.287,0.006,-0.007,11.771,,,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2002.5,0.0,1.0,1.0,0.0,0.0,2018.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.771,0.0,-0.0,0.0,0.067,0.001,-0.001,0.0,2.6,0.26,-0.27,0.0,0.231,0.023,-0.024,0.0,12.0,2.9,-2.85,0.0,0.038,0.009,-0.009,0.0,34.961,5.403,-5.721,0.0,0.11,0.017,-0.018,0.0,35.003,6.044,-6.044,0.0,0.11,0.019,-0.019,0.0,12.0,2.85,-2.8,0.0,0.038,0.009,-0.009,0.0,2.7,0.66,-0.57,0.0,0.077,0.08,-0.059,0.0,78.0,8.4,-7.8,0.0,805.0,19.0,-19.0,0.0,88.7,0.7,-0.862,0.0,2457166.26,0.002,-0.002,0.0,0.0,0.41,0.198,-0.21,0.0,0.134,0.004,-0.004,0.0,2.8,0.12,-0.144,0.0,17.935,2.0,-2.35,0.0,0.03,0.002,-0.002,0.0,0.024,,,1.0,2457860.59,0.215,-0.25,0.0,73.0,72.0,-71.512,0.0,4.9,0.98,-0.915,0.0,4.45,15.6,-18.25,0.0,54.8,6.6,-7.5,0.0,5286.0,100.0,-99.955,0.0,0.863,0.049,-0.042,0.0,0.88,0.05,-0.05,0.0,-0.01,0.08,-0.08,0.0,-0.325,0.027,-0.027,0.0,4.491,0.08,-0.08,0.0,4.765,1.7,-1.3,0.0,1.765,0.261,-0.24,0.0,2.6,0.8,-0.8,0.0,17.35,1.0,-0.94,0.0,8.64,0.012,-0.015,0.0,172.122,1.055,26.834,206.933,-1.047,171.717,29.633,0.073,-0.073,-3.564,0.083,-0.083,-12.402,0.059,-0.059,264.78,3.178,-3.114,3.75,0.047,-0.047,13.775,0.052,-0.052,12.978,0.08,-0.08,11.292,0.023,-0.023,10.807,0.024,-0.024,10.713,0.023,-0.023,15.864,0.008,-0.008,14.714,0.003,-0.003,13.372,0.002,-0.002,13.172,0.002,-0.002,13.384,0.005,-0.005,10.674,0.023,-0.023,10.685,0.021,-0.021,10.617,0.096,-0.096,8.633,0.34,-0.34,12.673,0.0,-0.0,,,,12.143,0.007,-0.007,12.703,,,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,3003.25,1.0,1.0,2.0,0.0,0.0,2018.5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,14.075,0.001,-0.0,0.0,0.105,0.003,-0.001,0.0,5.878,0.639,-0.14,0.0,0.524,0.057,-0.012,0.0,86.609,9.853,-1.085,0.0,0.273,0.031,-0.003,0.0,387.0,13.0,-1.065,0.0,1.225,0.041,-0.003,0.0,102.823,12.678,-2.251,0.0,0.324,0.04,-0.007,0.0,100.514,10.793,-1.042,0.0,0.316,0.034,-0.003,0.0,5.525,1.685,-0.176,0.0,0.2,0.161,-0.03,0.0,263.077,26.5,-2.0,0.0,1114.0,35.0,-12.0,0.0,89.35,1.565,-0.3,0.0,2457590.154,0.004,-0.001,0.0,0.0,0.572,0.3,-0.101,0.0,0.82,0.012,-0.002,0.0,3.928,0.216,-0.067,0.0,28.8,5.2,-0.77,0.0,0.059,0.003,-0.001,0.0,0.024,,,1.0,2457968.707,1.35,-0.001,0.0,132.0,110.0,-27.0,0.0,31.04,2.7,-0.43,0.0,25.25,32.4,-8.525,0.0,106.65,11.075,-5.175,0.0,5744.0,138.0,-53.0,0.0,1.173,0.08,-0.022,0.0,1.024,0.113,-0.027,0.0,0.11,0.12,-0.047,0.0,0.211,0.077,-0.018,0.0,4.626,0.1,-0.034,0.0,7.175,3.4,-0.5,0.0,3.08,0.542,-0.082,0.0,4.3,1.0,-0.5,0.0,28.72,2.2,-0.1,0.0,32.633,0.17,-0.002,0.0,243.451,15.559,47.044,266.619,1.581,246.007,56.149,0.095,-0.057,13.553,0.105,-0.066,-1.645,0.077,-0.046,446.821,8.696,-1.174,6.425,0.059,-0.038,15.285,0.135,-0.03,14.321,0.114,-0.045,12.191,0.026,-0.022,11.72,0.027,-0.022,11.595,0.024,-0.02,17.332,0.011,-0.006,15.64,0.005,-0.001,14.668,0.004,-0.001,14.423,0.004,-0.001,13.828,0.008,-0.004,11.536,0.024,-0.023,11.551,0.022,-0.02,11.474,0.188,-0.055,8.852,0.449,-0.248,13.932,0.001,-0.0,,,,13.243,0.007,-0.006,13.899,,,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
max,4004.0,1.0,3.0,7.0,0.0,0.0,2025.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,83830.0,31600.0,0.0,1.0,4.5,1.0,0.0,0.0,1080.0,453.8,0.0,1.0,96.4,40.49,0.0,1.0,4131.79,731.009,-0.012,1.0,13.0,2.3,-0.0,1.0,2434.566,349.613,-0.45,0.0,7.66,1.1,-0.001,0.0,955.58,382.232,-0.17,0.0,3.007,1.203,-0.001,0.0,4131.79,731.009,-0.012,1.0,13.0,2.3,-0.0,1.0,51.4,12.0,-0.013,1.0,0.853,0.42,0.0,1.0,9667.923,2087.0,-0.005,0.0,2529.02,253.0,-1.0,1.0,140.52,56.232,-0.001,0.0,2460812.206,65.0,0.0,0.0,1.0,1.69,0.74,-0.0,0.0,135.545,9192.391,-0.0,0.0,53.6,5.9,0.0,0.0,283.0,172.0,-0.0,0.0,0.76,0.553,0.0,0.0,0.024,,,1.0,2458520.0,60.0,-0.0,0.0,368.81,180.0,0.0,0.0,558.7,890.0,-0.15,1.0,173.0,50.0,-0.86,0.0,124.0,28.17,-1.8,0.0,46696.0,6064.9,0.0,0.0,85.0,18.0,0.0,0.0,14.336,7.854,-0.002,0.0,0.528,0.96,-0.006,0.0,2.371,0.325,-0.007,0.0,5.276,2.0,0.0,0.0,13.5,7.0,-0.001,0.0,84.531,11.29,-0.0,0.0,74.92,5.0,-0.04,1.0,41.0,10.0,-0.005,0.0,107.2,31.0,-0.0,0.0,358.639,28.114,65.102,359.397,8.932,356.867,1026.579,8.0,-0.025,901.0,8.0,-0.037,216.071,8.0,-0.021,9319.51,3292.69,-0.031,45.804,3.65,-0.021,20.952,1.626,0.0,20.556,1.133,-0.002,16.82,0.262,-0.018,16.819,0.248,-0.015,15.74,0.268,-0.014,30.0,50.0,-0.003,30.0,50.0,-0.0,27.421,13.155,-0.0,22.543,0.217,-0.0,21.803,0.371,-0.001,16.685,0.359,-0.015,16.506,0.388,-0.016,12.905,0.539,-0.015,9.344,0.54,-0.021,20.248,0.021,-0.0,,,,20.399,0.8,-0.006,19.97,,,6.0,3.0,1.0,1.0,0.0,2.0,17.0,0.0


### 3.3 TESS Dataset Exploration

In [16]:
# TESS: Basic information
print("=" * 80)
print("TESS DATASET EXPLORATION")
print("=" * 80)

print(f"\nDataset shape: {tess_df.shape[0]:,} rows × {tess_df.shape[1]} columns\n")

print("Column names and types:")
print(tess_df.dtypes)

print("\n" + "-" * 80)
print("First few rows:")
display(tess_df.head())

print("\n" + "-" * 80)
print("Missing values:")
missing_tess = tess_df.isnull().sum()
missing_pct_tess = (missing_tess / len(tess_df) * 100).round(2)
missing_df_tess = pd.DataFrame({
    'Missing Count': missing_tess[missing_tess > 0],
    'Percentage': missing_pct_tess[missing_tess > 0]
}).sort_values('Percentage', ascending=False)
display(missing_df_tess)

print("\n" + "-" * 80)
print("Basic statistics:")
display(tess_df.describe())

TESS DATASET EXPLORATION

Dataset shape: 7,703 rows × 87 columns

Column names and types:
rowid                  int64
toi                  float64
toipfx                 int64
tid                    int64
ctoi_alias           float64
pl_pnum                int64
tfopwg_disp           object
rastr                 object
ra                   float64
raerr1               float64
raerr2               float64
decstr                object
dec                  float64
decerr1              float64
decerr2              float64
st_pmra              float64
st_pmraerr1          float64
st_pmraerr2          float64
st_pmralim           float64
st_pmrasymerr        float64
st_pmdec             float64
st_pmdecerr1         float64
st_pmdecerr2         float64
st_pmdeclim          float64
st_pmdecsymerr       float64
pl_tranmid           float64
pl_tranmiderr1       float64
pl_tranmiderr2       float64
pl_tranmidlim          int64
pl_tranmidsymerr       int64
pl_orbper            float64
pl_orbperer

Unnamed: 0,rowid,toi,toipfx,tid,ctoi_alias,pl_pnum,tfopwg_disp,rastr,ra,raerr1,raerr2,decstr,dec,decerr1,decerr2,st_pmra,st_pmraerr1,st_pmraerr2,st_pmralim,st_pmrasymerr,st_pmdec,st_pmdecerr1,st_pmdecerr2,st_pmdeclim,st_pmdecsymerr,pl_tranmid,pl_tranmiderr1,pl_tranmiderr2,pl_tranmidlim,pl_tranmidsymerr,pl_orbper,pl_orbpererr1,pl_orbpererr2,pl_orbperlim,pl_orbpersymerr,pl_trandurh,pl_trandurherr1,pl_trandurherr2,pl_trandurhlim,pl_trandurhsymerr,pl_trandep,pl_trandeperr1,pl_trandeperr2,pl_trandeplim,pl_trandepsymerr,pl_rade,pl_radeerr1,pl_radeerr2,pl_radelim,pl_radesymerr,pl_insol,pl_insolerr1,pl_insolerr2,pl_insollim,pl_insolsymerr,pl_eqt,pl_eqterr1,pl_eqterr2,pl_eqtlim,pl_eqtsymerr,st_tmag,st_tmagerr1,st_tmagerr2,st_tmaglim,st_tmagsymerr,st_dist,st_disterr1,st_disterr2,st_distlim,st_distsymerr,st_teff,st_tefferr1,st_tefferr2,st_tefflim,st_teffsymerr,st_logg,st_loggerr1,st_loggerr2,st_logglim,st_loggsymerr,st_rad,st_raderr1,st_raderr2,st_radlim,st_radsymerr,toi_created,rowupdate
0,1,1000.01,1000,50365310,50365310.01,1,FP,07h29m25.85s,112.358,,,-12d41m45.46s,-12.696,,,-5.964,0.085,-0.085,0.0,1.0,-0.076,0.072,-0.072,0.0,1.0,2459229.63,0.002,-0.002,0,1,2.171,0.0,-0.0,0,1,2.017,0.32,-0.32,0,1,656.886,37.778,-37.778,0,1,5.818,1.911,-1.911,0,1,22601.949,,,,,3127.204,,,,,9.604,0.013,-0.013,0,1,485.735,11.951,-11.951,0,1,10249.0,264.7,-264.7,0,1,4.19,0.07,-0.07,0,1,2.17,0.073,-0.073,0,1,2019-07-24 15:58:33,2024-09-09 10:08:01
1,2,1001.01,1001,88863718,88863718.01,1,PC,08h10m19.31s,122.58,,,-05d30m49.87s,-5.514,,,-4.956,0.102,-0.102,0.0,1.0,-15.555,0.072,-0.072,0.0,1.0,2459987.949,0.002,-0.002,0,1,1.932,0.0,-0.0,0,1,3.166,0.647,-0.647,0,1,1286.0,1186.49,-1186.49,0,1,11.215,2.624,-2.624,0,1,44464.5,,,,,4045.0,,,,,9.423,0.006,-0.006,0,1,295.862,5.91,-5.91,0,1,7070.0,126.4,-126.4,0,1,4.03,0.09,-0.09,0,1,2.01,0.09,-0.09,0,1,2019-07-24 15:58:33,2023-04-03 14:31:04
2,3,1002.01,1002,124709665,124709665.01,1,FP,06h58m54.47s,104.727,,,-10d34m49.64s,-10.58,,,-1.462,0.206,-0.206,0.0,1.0,-2.249,0.206,-0.206,0.0,1.0,2459224.688,0.001,-0.001,0,1,1.868,0.0,-0.0,0,1,1.408,0.184,-0.184,0,1,1500.0,1.758,-1.758,0,1,23.753,,,0,1,2860.61,,,,,2037.0,,,,,9.3,0.058,-0.058,0,1,943.109,106.333,-106.333,0,1,8924.0,124.0,-124.0,0,1,,,,0,1,5.73,,,0,1,2019-07-24 15:58:33,2022-07-11 16:02:02
3,4,1003.01,1003,106997505,106997505.01,1,FP,07h22m14.39s,110.56,,,-25d12m25.26s,-25.207,,,-0.939,0.041,-0.041,0.0,1.0,1.64,0.055,-0.055,0.0,1.0,2458493.396,0.005,-0.005,0,1,2.743,0.001,-0.001,0,1,3.167,0.642,-0.642,0,1,383.41,0.782,-0.782,0,1,,,,0,1,1177.36,,,,,1631.0,,,,,9.3,0.037,-0.037,0,1,7728.17,1899.57,-1899.57,0,1,5388.5,567.0,-567.0,0,1,4.15,1.64,-1.64,0,1,,,,0,1,2019-07-24 15:58:33,2022-02-23 10:10:02
4,5,1004.01,1004,238597883,238597883.01,1,FP,08h08m42.77s,122.178,,,-48d48m10.12s,-48.803,,,-4.496,0.069,-0.069,0.0,1.0,9.347,0.062,-0.062,0.0,1.0,2459987.047,0.004,-0.004,0,1,3.573,0.0,-0.0,0,1,3.37,1.029,-1.029,0,1,755.0,1306.55,-1306.55,0,1,11.311,3.247,-3.247,0,1,54679.3,,,,,4260.0,,,,,9.136,0.006,-0.006,0,1,356.437,4.617,-4.617,0,1,9219.0,171.1,-171.1,0,1,4.14,0.07,-0.07,0,1,2.15,0.06,-0.06,0,1,2019-07-24 15:58:33,2024-09-09 10:08:01



--------------------------------------------------------------------------------
Missing values:


Unnamed: 0,Missing Count,Percentage
raerr1,7703,100.0
raerr2,7703,100.0
decerr1,7703,100.0
decerr2,7703,100.0
pl_eqtsymerr,7703,100.0
pl_eqtlim,7703,100.0
pl_eqterr2,7703,100.0
pl_eqterr1,7703,100.0
pl_insolsymerr,7703,100.0
pl_insollim,7703,100.0



--------------------------------------------------------------------------------
Basic statistics:


Unnamed: 0,rowid,toi,toipfx,tid,ctoi_alias,pl_pnum,ra,raerr1,raerr2,dec,decerr1,decerr2,st_pmra,st_pmraerr1,st_pmraerr2,st_pmralim,st_pmrasymerr,st_pmdec,st_pmdecerr1,st_pmdecerr2,st_pmdeclim,st_pmdecsymerr,pl_tranmid,pl_tranmiderr1,pl_tranmiderr2,pl_tranmidlim,pl_tranmidsymerr,pl_orbper,pl_orbpererr1,pl_orbpererr2,pl_orbperlim,pl_orbpersymerr,pl_trandurh,pl_trandurherr1,pl_trandurherr2,pl_trandurhlim,pl_trandurhsymerr,pl_trandep,pl_trandeperr1,pl_trandeperr2,pl_trandeplim,pl_trandepsymerr,pl_rade,pl_radeerr1,pl_radeerr2,pl_radelim,pl_radesymerr,pl_insol,pl_insolerr1,pl_insolerr2,pl_insollim,pl_insolsymerr,pl_eqt,pl_eqterr1,pl_eqterr2,pl_eqtlim,pl_eqtsymerr,st_tmag,st_tmagerr1,st_tmagerr2,st_tmaglim,st_tmagsymerr,st_dist,st_disterr1,st_disterr2,st_distlim,st_distsymerr,st_teff,st_tefferr1,st_tefferr2,st_tefflim,st_teffsymerr,st_logg,st_loggerr1,st_loggerr2,st_logglim,st_loggsymerr,st_rad,st_raderr1,st_raderr2,st_radlim,st_radsymerr
count,7703.0,7703.0,7703.0,7703.0,7703.0,7703.0,7703.0,0.0,0.0,7703.0,0.0,0.0,7569.0,7569.0,7569.0,7569.0,7569.0,7569.0,7569.0,7569.0,7569.0,7569.0,7703.0,7692.0,7692.0,7703.0,7703.0,7596.0,7572.0,7572.0,7703.0,7703.0,7703.0,7690.0,7690.0,7703.0,7703.0,7703.0,7697.0,7697.0,7703.0,7703.0,7197.0,6080.0,6080.0,7703.0,7703.0,7527.0,0.0,0.0,0.0,0.0,7392.0,0.0,0.0,0.0,0.0,7703.0,7703.0,7703.0,7703.0,7703.0,7488.0,6996.0,6996.0,7703.0,7703.0,7542.0,7229.0,7229.0,7703.0,7703.0,6847.0,5432.0,5432.0,7703.0,7703.0,7196.0,5740.0,5740.0,7703.0,7703.0
mean,3852.0,3749.039,3749.029,245522652.752,245522652.762,1.049,179.804,,,1.157,,,-0.567,0.229,-0.229,0.0,1.0,-9.182,0.223,-0.223,0.0,1.0,2459552.363,0.004,-0.004,0.0,1.0,17.74,0.0,-0.0,0.0,1.0,3.059,0.362,-0.362,0.0,1.0,8256.683,489.867,-489.867,0.0,1.0,10.327,1.459,-1.459,0.0,1.0,2246.183,,,,,1282.691,,,,,11.564,0.01,-0.01,0.0,1.0,478.223,19.529,-19.529,0.0,1.0,5791.481,205.574,-205.574,0.0,1.0,4.305,0.175,-0.175,0.0,1.0,1.404,0.074,-0.074,0.0,1.0
std,2223.809,2152.357,2152.358,161696778.613,161696778.613,0.273,103.706,,,47.465,,,77.004,0.632,0.632,0.0,0.0,66.875,0.623,0.623,0.0,0.0,615.505,0.044,0.044,0.0,0.0,97.737,0.001,0.001,0.0,0.0,1.874,1.932,1.932,0.0,0.0,17502.276,2148.259,2148.259,0.0,0.0,8.527,3.747,3.747,0.0,0.0,10910.072,,,,,686.724,,,,,1.632,0.033,0.033,0.0,0.0,557.914,137.239,137.239,0.0,0.0,1481.355,550.05,550.05,0.0,0.0,0.304,0.352,0.352,0.0,0.0,1.598,0.082,0.082,0.0,0.0
min,1.0,101.01,101.0,2876.0,2876.01,1.0,0.085,,,-89.472,,,-1624.05,0.015,-8.0,0.0,1.0,-1230.62,0.016,-8.0,0.0,1.0,2457925.865,0.0,-3.305,0.0,1.0,0.152,0.0,-0.022,0.0,1.0,0.101,0.001,-167.615,0.0,1.0,24.583,0.013,-103577.016,0.0,1.0,0.553,0.004,-163.506,0.0,1.0,0.0,,,,,37.0,,,,,4.628,0.0,-1.0,0.0,1.0,6.531,0.003,-4020.6,0.0,1.0,2808.0,4.4,-7000.0,0.0,1.0,0.1,0.0,-2.011,0.0,1.0,0.115,0.003,-1.723,0.0,1.0
25%,1926.5,1861.51,1861.5,131138867.0,131138867.01,1.0,96.07,,,-43.355,,,-10.617,0.038,-0.08,0.0,1.0,-14.695,0.038,-0.071,0.0,1.0,2459196.571,0.001,-0.003,0.0,1.0,2.491,0.0,-0.0,0.0,1.0,1.846,0.139,-0.441,0.0,1.0,1418.165,9.302,-299.146,0.0,1.0,4.496,0.437,-1.108,0.0,1.0,85.293,,,,,813.759,,,,,10.394,0.006,-0.007,0.0,1.0,178.464,1.012,-12.072,0.0,1.0,5211.0,122.0,-157.0,0.0,1.0,4.124,0.08,-0.093,0.0,1.0,0.89,0.05,-0.08,0.0,1.0
50%,3852.0,3736.01,3736.0,249945230.0,249945230.01,1.0,161.157,,,4.715,,,-1.571,0.051,-0.051,0.0,1.0,-3.469,0.049,-0.049,0.0,1.0,2459585.368,0.002,-0.002,0.0,1.0,4.089,0.0,-0.0,0.0,1.0,2.732,0.263,-0.263,0.0,1.0,4750.326,74.654,-74.654,0.0,1.0,10.544,0.717,-0.717,0.0,1.0,363.901,,,,,1183.014,,,,,11.837,0.006,-0.006,0.0,1.0,365.008,4.305,-4.305,0.0,1.0,5800.55,129.4,-129.4,0.0,1.0,4.33,0.085,-0.085,0.0,1.0,1.234,0.06,-0.06,0.0,1.0
75%,5777.5,5615.51,5615.5,354091686.0,354091686.01,1.0,283.061,,,43.807,,,8.258,0.08,-0.038,0.0,1.0,4.909,0.071,-0.038,0.0,1.0,2459989.9,0.003,-0.001,0.0,1.0,7.924,0.0,-0.0,0.0,1.0,3.798,0.441,-0.139,0.0,1.0,10350.821,299.146,-9.302,0.0,1.0,14.019,1.108,-0.437,0.0,1.0,1160.92,,,,,1588.645,,,,,12.861,0.007,-0.006,0.0,1.0,647.808,12.072,-1.012,0.0,1.0,6295.65,157.0,-122.0,0.0,1.0,4.5,0.093,-0.08,0.0,1.0,1.66,0.08,-0.05,0.0,1.0
max,7703.0,7509.01,7509.0,2041563029.0,2041563029.01,5.0,359.941,,,89.087,,,2074.52,8.0,-0.015,0.0,1.0,1048.84,8.0,-0.016,0.0,1.0,2460863.076,3.305,-0.0,0.0,1.0,1837.89,0.022,-0.0,0.0,1.0,30.016,167.615,-0.001,0.0,1.0,767910.313,103577.016,-0.013,0.0,1.0,297.112,163.506,-0.004,0.0,1.0,280833.0,,,,,6413.0,,,,,18.332,1.0,0.0,0.0,1.0,14728.3,4020.6,-0.003,0.0,1.0,50000.0,7000.0,-4.4,0.0,1.0,5.961,2.011,-0.0,0.0,1.0,102.03,1.723,-0.003,0.0,1.0


## 4. Compare Dataset Schemas

Analyze columns across all three datasets to identify:
- Common columns (appear in all datasets)
- Shared columns (appear in 2 datasets)
- Unique columns (appear in only 1 dataset)

In [17]:
# Get column sets
kepler_cols = set(kepler_df.columns)
k2_cols = set(k2_df.columns)
tess_cols = set(tess_df.columns)

# Find common and unique columns
common_all = kepler_cols & k2_cols & tess_cols
kepler_k2_only = (kepler_cols & k2_cols) - tess_cols
kepler_tess_only = (kepler_cols & tess_cols) - k2_cols
k2_tess_only = (k2_cols & tess_cols) - kepler_cols
kepler_unique = kepler_cols - k2_cols - tess_cols
k2_unique = k2_cols - kepler_cols - tess_cols
tess_unique = tess_cols - kepler_cols - k2_cols

print("=" * 80)
print("SCHEMA COMPARISON")
print("=" * 80)

print(f"\n📊 Total columns:")
print(f"  - Kepler: {len(kepler_cols)}")
print(f"  - K2: {len(k2_cols)}")
print(f"  - TESS: {len(tess_cols)}")

print(f"\n✅ Common to ALL three datasets ({len(common_all)} columns):")
if common_all:
    for col in sorted(common_all):
        print(f"  - {col}")
else:
    print("  (none)")

print(f"\n🔄 Shared between Kepler & K2 only ({len(kepler_k2_only)} columns):")
if kepler_k2_only:
    for col in sorted(kepler_k2_only):
        print(f"  - {col}")
else:
    print("  (none)")

print(f"\n🔄 Shared between Kepler & TESS only ({len(kepler_tess_only)} columns):")
if kepler_tess_only:
    for col in sorted(kepler_tess_only):
        print(f"  - {col}")
else:
    print("  (none)")

print(f"\n🔄 Shared between K2 & TESS only ({len(k2_tess_only)} columns):")
if k2_tess_only:
    for col in sorted(k2_tess_only):
        print(f"  - {col}")
else:
    print("  (none)")

print(f"\n🔸 Unique to Kepler ({len(kepler_unique)} columns):")
if kepler_unique:
    for col in sorted(kepler_unique):
        print(f"  - {col}")
else:
    print("  (none)")

print(f"\n🔸 Unique to K2 ({len(k2_unique)} columns):")
if k2_unique:
    for col in sorted(k2_unique):
        print(f"  - {col}")
else:
    print("  (none)")

print(f"\n🔸 Unique to TESS ({len(tess_unique)} columns):")
if tess_unique:
    for col in sorted(tess_unique):
        print(f"  - {col}")
else:
    print("  (none)")

SCHEMA COMPARISON

📊 Total columns:
  - Kepler: 141
  - K2: 295
  - TESS: 87

✅ Common to ALL three datasets (3 columns):
  - dec
  - ra
  - rowid

🔄 Shared between Kepler & K2 only (0 columns):
  (none)

🔄 Shared between Kepler & TESS only (0 columns):
  (none)

🔄 Shared between K2 & TESS only (39 columns):
  - decstr
  - pl_eqt
  - pl_eqterr1
  - pl_eqterr2
  - pl_eqtlim
  - pl_insol
  - pl_insolerr1
  - pl_insolerr2
  - pl_insollim
  - pl_orbper
  - pl_orbpererr1
  - pl_orbpererr2
  - pl_orbperlim
  - pl_rade
  - pl_radeerr1
  - pl_radeerr2
  - pl_radelim
  - pl_trandep
  - pl_trandeperr1
  - pl_trandeperr2
  - pl_trandeplim
  - pl_tranmid
  - pl_tranmiderr1
  - pl_tranmiderr2
  - pl_tranmidlim
  - rastr
  - rowupdate
  - st_logg
  - st_loggerr1
  - st_loggerr2
  - st_logglim
  - st_rad
  - st_raderr1
  - st_raderr2
  - st_radlim
  - st_teff
  - st_tefferr1
  - st_tefferr2
  - st_tefflim

🔸 Unique to Kepler (138 columns):
  - kepid
  - kepler_name
  - kepoi_name
  - koi_bin_oedp_sig

## 5. Data Quality Analysis

Examine data quality issues for each dataset:
- Duplicates
- Missing value patterns
- Data type inconsistencies
- Outliers or anomalies

In [18]:
# Check for duplicates
print("=" * 80)
print("DATA QUALITY CHECKS")
print("=" * 80)

print("\n🔍 Duplicate Rows:")
print(f"  - Kepler: {kepler_df.duplicated().sum()} duplicates")
print(f"  - K2: {k2_df.duplicated().sum()} duplicates")
print(f"  - TESS: {tess_df.duplicated().sum()} duplicates")

print("\n📊 Missing Data Summary:")
print(f"  - Kepler: {kepler_df.isnull().sum().sum():,} total missing values")
print(f"  - K2: {k2_df.isnull().sum().sum():,} total missing values")
print(f"  - TESS: {tess_df.isnull().sum().sum():,} total missing values")

print("\n💾 Memory Usage:")
print(f"  - Kepler: {kepler_df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"  - K2: {k2_df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"  - TESS: {tess_df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

DATA QUALITY CHECKS

🔍 Duplicate Rows:
  - Kepler: 0 duplicates
  - K2: 0 duplicates
  - TESS: 0 duplicates

📊 Missing Data Summary:
  - Kepler: 237,112 total missing values
  - K2: 552,676 total missing values
  - TESS: 111,013 total missing values

💾 Memory Usage:
  - Kepler: 19.73 MB
  - K2: 16.77 MB
  - TESS: 7.10 MB


## 6. Data Cleaning Strategy

Based on the exploration above, we'll implement cleaning for each dataset:

### Key Findings:
- **No duplicate rows** in any dataset ✓
- **High missing values**: Kepler (237K), K2 (552K), TESS (111K)
- **Different schemas**: Very few common columns across all three datasets
- **Different purposes**: Kepler/K2 use KOI data, while K2/TESS use confirmed planet data

### Cleaning Approach:
1. **Remove columns with 100% missing values** (uninformative)
2. **Remove columns with >95% missing values** (likely too sparse)
3. **Standardize column names** (lowercase, underscores)
4. **Identify key columns** for each dataset
5. **Handle remaining missing values** based on column importance

In [19]:
# Function to clean individual datasets
def clean_dataset(df, dataset_name, missing_threshold=95):
    """
    Clean a dataset by removing columns with too many missing values
    
    Parameters:
    - df: DataFrame to clean
    - dataset_name: Name for logging
    - missing_threshold: Percentage threshold for removing columns (default 95%)
    
    Returns:
    - Cleaned DataFrame
    """
    print(f"\n{'='*80}")
    print(f"CLEANING {dataset_name.upper()} DATASET")
    print(f"{'='*80}")
    
    original_shape = df.shape
    print(f"Original shape: {original_shape[0]:,} rows × {original_shape[1]} columns")
    
    # 1. Identify columns with excessive missing values
    missing_pct = (df.isnull().sum() / len(df) * 100)
    cols_to_drop = missing_pct[missing_pct >= missing_threshold].index.tolist()
    
    print(f"\n📊 Columns with ≥{missing_threshold}% missing values: {len(cols_to_drop)}")
    if len(cols_to_drop) > 0 and len(cols_to_drop) <= 20:
        for col in cols_to_drop:
            print(f"   - {col}: {missing_pct[col]:.1f}% missing")
    elif len(cols_to_drop) > 20:
        print(f"   (Showing first 10 of {len(cols_to_drop)})")
        for col in cols_to_drop[:10]:
            print(f"   - {col}: {missing_pct[col]:.1f}% missing")
    
    # 2. Drop columns with excessive missing values
    df_cleaned = df.drop(columns=cols_to_drop)
    print(f"\n✂️  Removed {len(cols_to_drop)} columns with excessive missing values")
    
    # 3. Report remaining missing values
    remaining_missing = df_cleaned.isnull().sum().sum()
    total_values = df_cleaned.shape[0] * df_cleaned.shape[1]
    missing_pct_total = (remaining_missing / total_values * 100)
    
    print(f"\n📉 Remaining missing values: {remaining_missing:,} ({missing_pct_total:.2f}% of total)")
    
    # 4. Show columns with most missing values (that remain)
    remaining_missing_by_col = df_cleaned.isnull().sum()
    top_missing = remaining_missing_by_col[remaining_missing_by_col > 0].sort_values(ascending=False).head(10)
    
    if len(top_missing) > 0:
        print(f"\n🔝 Top columns with missing values (after cleaning):")
        for col, count in top_missing.items():
            pct = (count / len(df_cleaned) * 100)
            print(f"   - {col}: {count:,} ({pct:.1f}%)")
    
    # 5. Final shape
    final_shape = df_cleaned.shape
    print(f"\n✅ Final shape: {final_shape[0]:,} rows × {final_shape[1]} columns")
    print(f"   Reduced from {original_shape[1]} to {final_shape[1]} columns ({original_shape[1] - final_shape[1]} removed)")
    
    return df_cleaned

# Clean each dataset
print("Starting data cleaning process...")
print("Using 95% missing value threshold for column removal\n")

kepler_clean = clean_dataset(kepler_df.copy(), "Kepler", missing_threshold=95)
k2_clean = clean_dataset(k2_df.copy(), "K2", missing_threshold=95)
tess_clean = clean_dataset(tess_df.copy(), "TESS", missing_threshold=95)

print(f"\n{'='*80}")
print("CLEANING SUMMARY")
print(f"{'='*80}")
print(f"✓ Kepler: {kepler_df.shape[1]} → {kepler_clean.shape[1]} columns ({kepler_df.shape[0]:,} rows)")
print(f"✓ K2: {k2_df.shape[1]} → {k2_clean.shape[1]} columns ({k2_df.shape[0]:,} rows)")
print(f"✓ TESS: {tess_df.shape[1]} → {tess_clean.shape[1]} columns ({tess_df.shape[0]:,} rows)")
print(f"\n🎯 All datasets cleaned and ready for further analysis or export!")

Starting data cleaning process...
Using 95% missing value threshold for column removal


CLEANING KEPLER DATASET
Original shape: 9,564 rows × 141 columns

📊 Columns with ≥95% missing values: 19
   - koi_eccen_err1: 100.0% missing
   - koi_eccen_err2: 100.0% missing
   - koi_longp: 100.0% missing
   - koi_longp_err1: 100.0% missing
   - koi_longp_err2: 100.0% missing
   - koi_ingress: 100.0% missing
   - koi_ingress_err1: 100.0% missing
   - koi_ingress_err2: 100.0% missing
   - koi_sma_err1: 100.0% missing
   - koi_sma_err2: 100.0% missing
   - koi_incl_err1: 100.0% missing
   - koi_incl_err2: 100.0% missing
   - koi_teq_err1: 100.0% missing
   - koi_teq_err2: 100.0% missing
   - koi_model_dof: 100.0% missing
   - koi_model_chisq: 100.0% missing
   - koi_sage: 100.0% missing
   - koi_sage_err1: 100.0% missing
   - koi_sage_err2: 100.0% missing

✂️  Removed 19 columns with excessive missing values

📉 Remaining missing values: 55,396 (4.75% of total)

🔝 Top columns with missing values (a

## 7. Dataset Combination Decision

### Analysis of Schema Comparison:

Based on the schema comparison in Section 4, these datasets have **very different structures**:
- **Kepler**: 122 columns (KOI-focused with `koi_` prefixes)
- **K2**: 248 columns (Planet-focused with `pl_` and `st_` prefixes)  
- **TESS**: 75 columns (Planet-focused with `pl_` and `st_` prefixes)

**Common columns across all three**: Very few (likely just metadata like `rowid`)

### Recommended Strategy: **Keep Datasets Separate**

**Rationale:**
1. **Different data models**: Kepler uses KOI (Kepler Object of Interest) data, while K2 and TESS use confirmed planet data
2. **Different missions/timeframes**: Each telescope has unique observation characteristics
3. **Minimal column overlap**: Very few shared columns means combining would create extreme sparsity
4. **Different analysis needs**: Researchers typically analyze these datasets separately by mission

### Alternative: Create Mission Identifier
If you want a unified dataset for specific analyses, we can:
1. Select only the common columns across datasets
2. Add a `mission` column to identify the source
3. Concatenate vertically

Let's keep them separate for now and export as individual cleaned files.

In [20]:
# Decision: Keep datasets separate due to different schemas
# This preserves the unique characteristics and columns of each mission

print("="*80)
print("COMBINATION DECISION")
print("="*80)

print("\n📋 Decision: Keep datasets SEPARATE")
print("\nReasons:")
print("  1. Different data models (KOI vs confirmed planets)")
print("  2. Different missions with unique observation characteristics")
print("  3. Minimal column overlap would create excessive sparsity if combined")
print("  4. Standard practice in exoplanet research is mission-specific analysis")

print("\n✓ Each cleaned dataset will be exported separately")
print("✓ This allows for mission-specific analyses while maintaining data integrity")

# Optional: Show what columns would be common if we wanted to combine later
common_cols = set(kepler_clean.columns) & set(k2_clean.columns) & set(tess_clean.columns)
print(f"\n💡 Note: Only {len(common_cols)} columns are common across all three datasets:")
if common_cols:
    for col in sorted(common_cols):
        print(f"     - {col}")
else:
    print("     (none)")
    
print("\n🔄 If needed, you can create a unified dataset later by selecting common columns")

COMBINATION DECISION

📋 Decision: Keep datasets SEPARATE

Reasons:
  1. Different data models (KOI vs confirmed planets)
  2. Different missions with unique observation characteristics
  3. Minimal column overlap would create excessive sparsity if combined
  4. Standard practice in exoplanet research is mission-specific analysis

✓ Each cleaned dataset will be exported separately
✓ This allows for mission-specific analyses while maintaining data integrity

💡 Note: Only 3 columns are common across all three datasets:
     - dec
     - ra
     - rowid

🔄 If needed, you can create a unified dataset later by selecting common columns


## 8. Export Cleaned Data

Save the cleaned datasets to the `data/clean/` folder.

In [21]:
# Create clean data directory if it doesn't exist
clean_dir = Path('../data/clean')
clean_dir.mkdir(parents=True, exist_ok=True)

print("="*80)
print("EXPORTING CLEANED DATASETS")
print("="*80)

# Export each cleaned dataset
print(f"\n📁 Export directory: {clean_dir.absolute()}\n")

# Save Kepler
kepler_path = clean_dir / 'kepler_clean.csv'
kepler_clean.to_csv(kepler_path, index=False)
print(f"✓ Kepler dataset saved: kepler_clean.csv")
print(f"  - {kepler_clean.shape[0]:,} rows × {kepler_clean.shape[1]} columns")
print(f"  - File size: {kepler_path.stat().st_size / 1024**2:.2f} MB")

# Save K2
k2_path = clean_dir / 'k2_clean.csv'
k2_clean.to_csv(k2_path, index=False)
print(f"\n✓ K2 dataset saved: k2_clean.csv")
print(f"  - {k2_clean.shape[0]:,} rows × {k2_clean.shape[1]} columns")
print(f"  - File size: {k2_path.stat().st_size / 1024**2:.2f} MB")

# Save TESS
tess_path = clean_dir / 'tess_clean.csv'
tess_clean.to_csv(tess_path, index=False)
print(f"\n✓ TESS dataset saved: tess_clean.csv")
print(f"  - {tess_clean.shape[0]:,} rows × {tess_clean.shape[1]} columns")
print(f"  - File size: {tess_path.stat().st_size / 1024**2:.2f} MB")

# Summary
print(f"\n{'='*80}")
print("EXPORT COMPLETE")
print(f"{'='*80}")
print(f"\n🎉 All cleaned datasets have been exported to: {clean_dir.absolute()}")
print(f"\n📊 Summary:")
print(f"   - Total rows exported: {kepler_clean.shape[0] + k2_clean.shape[0] + tess_clean.shape[0]:,}")
print(f"   - Kepler: {kepler_clean.shape[1]} columns (reduced from {kepler_df.shape[1]})")
print(f"   - K2: {k2_clean.shape[1]} columns (reduced from {k2_df.shape[1]})")
print(f"   - TESS: {tess_clean.shape[1]} columns (reduced from {tess_df.shape[1]})")
print(f"\n✅ Data cleaning pipeline complete!")
print(f"   Next steps: Use these cleaned datasets for analysis, visualization, or modeling")

EXPORTING CLEANED DATASETS

📁 Export directory: /Users/jorgesandoval/Documents/current/fermix/notebooks/../data/clean

✓ Kepler dataset saved: kepler_clean.csv
  - 9,564 rows × 122 columns
  - File size: 9.42 MB
✓ Kepler dataset saved: kepler_clean.csv
  - 9,564 rows × 122 columns
  - File size: 9.42 MB

✓ K2 dataset saved: k2_clean.csv
  - 4,004 rows × 248 columns
  - File size: 6.66 MB

✓ TESS dataset saved: tess_clean.csv
  - 7,703 rows × 75 columns
  - File size: 3.47 MB

EXPORT COMPLETE

🎉 All cleaned datasets have been exported to: /Users/jorgesandoval/Documents/current/fermix/notebooks/../data/clean

📊 Summary:
   - Total rows exported: 21,271
   - Kepler: 122 columns (reduced from 141)
   - K2: 248 columns (reduced from 295)
   - TESS: 75 columns (reduced from 87)

✅ Data cleaning pipeline complete!
   Next steps: Use these cleaned datasets for analysis, visualization, or modeling

✓ K2 dataset saved: k2_clean.csv
  - 4,004 rows × 248 columns
  - File size: 6.66 MB

✓ TESS data

## Summary & Results

### ✅ Completed Workflow:
1. **Setup Complete** - Libraries imported and environment configured
2. **Data Loaded** - Successfully loaded three NASA exoplanet datasets
3. **Exploration Complete** - Examined structure, types, and quality of each dataset
4. **Schema Compared** - Identified only 3 common columns (ra, dec, rowid) across all datasets
5. **Data Cleaned** - Removed columns with ≥95% missing values
6. **Decision Made** - Keep datasets separate due to different missions and data models
7. **Export Complete** - Saved cleaned datasets to `data/clean/` folder

### 📊 Final Results:

| Dataset | Original | Cleaned | Rows | File Size |
|---------|----------|---------|------|-----------|
| **Kepler** | 141 cols | 122 cols | 9,564 | 9.42 MB |
| **K2** | 295 cols | 248 cols | 4,004 | 6.66 MB |
| **TESS** | 87 cols | 75 cols | 7,703 | 3.47 MB |

### 🎯 Key Findings:
- **No duplicates** found in any dataset
- Removed **78 total columns** with excessive missing values (≥95%)
- **Different data models**: Kepler uses KOI data, K2/TESS use confirmed planet data
- **Minimal overlap**: Only 3 columns common across all datasets
- **Kept separate**: Best practice for mission-specific exoplanet research

### 📁 Output Files:
All cleaned datasets are saved in `data/clean/`:
- `kepler_clean.csv`
- `k2_clean.csv`
- `tess_clean.csv`

### 🚀 Next Steps:
- Perform mission-specific analysis on each dataset
- Create visualizations of planet properties
- Build predictive models
- Compare discovery methods across missions

---
*Data cleaning pipeline completed successfully! The cleaned datasets are ready for analysis.*