# Kepler Exoplanet Search Results

## Conjunto de datos: Kepler Exoplanet Search Results

### Descripción
#### Context
The Kepler Space Observatory is a NASA-build satellite that was launched in 2009. The telescope is dedicated to searching for exoplanets in star systems besides our own, with the ultimate goal of possibly finding other habitable planets besides our own. The original mission ended in 2013 due to mechanical failures, but the telescope has nevertheless been functional since 2014 on a "K2" extended mission.

Kepler had verified 1284 new exoplanets as of May 2016. As of October 2017 there are over 3000 confirmed exoplanets total (using all detection methods, including ground-based ones). The telescope is still active and continues to collect new data on its extended mission.

#### Content
This dataset is a cumulative record of all observed Kepler "objects of interest" — basically, all of the approximately 10,000 exoplanet candidates Kepler has taken observations on.

This dataset has an extensive data dictionary, which can be accessed <a href="https://exoplanetarchive.ipac.caltech.edu/docs/API_kepcandidate_columns.html"> here<a/> . Highlightable columns of note are:

<ul>
  <li>kepoi_name: A KOI is a target identified by the Kepler Project that displays at least one transit-like sequence within Kepler time-series photometry that appears to be of astrophysical origin and initially consistent with a planetary transit hypothesis</li>
  <li>kepler_name: [These names] are intended to clearly indicate a class of objects that have been confirmed or validated as planets—a step up from the planet candidate designation.</li>
  <li>koi_disposition: The disposition in the literature towards this exoplanet candidate. One of CANDIDATE, FALSE POSITIVE, NOT DISPOSITIONED or CONFIRMED.</li>
  <li>koi_pdisposition: The disposition Kepler data analysis has towards this exoplanet candidate. One of FALSE POSITIVE, NOT DISPOSITIONED, and CANDIDATE.</li>
  <li>koi_score: A value between 0 and 1 that indicates the confidence in the KOI disposition. For CANDIDATEs, a higher value indicates more confidence in its disposition, while for FALSE POSITIVEs, a higher value indicates less confidence in that disposition.</li>
</ul>

#### Acknowledgements
This dataset was published as-is by NASA. You can access the original table here. More data from the Kepler mission is available from the same source <a href="https://exoplanetarchive.ipac.caltech.edu/docs/data.html">here<a/>.

#### Inspiration
<ul>
    <li>How often are exoplanets confirmed in the existing literature disconfirmed by measurements from Kepler? How about the other way round?</li>
    <li>What general characteristics about exoplanets (that we can find) can you derive from this dataset?</li>
    <li>What exoplanets get assigned names in the literature? What is the distribution of confidence scores?</li>
</ul>

See also: the Kepler Labeled Time Series and Open Exoplanets Catalogue datasets.

Descarga de los ficheros
<a href="https://www.kaggle.com/datasets/nasa/kepler-exoplanet-search-results">https://www.kaggle.com/datasets/nasa/kepler-exoplanet-search-results<a/>

## Imports

In [4]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split


from sklearn.ensemble import IsolationForest
from sklearn.experimental import enable_iterative_imputer # Necesario para IterativeImputer
from sklearn.impute import KNNImputer, IterativeImputer


In [2]:
import warnings
# Ignoramos algunos warnings que se producen por invocar el modelo sin el nombre de las características
warnings.filterwarnings('ignore', category=RuntimeWarning, message='Mean of empty slice')

## Funciones auxiliares 

In [15]:
# Construcción de una función que realice el particionado completo
def train_val_test_split(df, rstate=42, shuffle=True, stratify=None):
    strat = df[stratify] if stratify else None
    train_set, test_set = train_test_split(
        df, test_size=0.4, random_state=rstate, shuffle=shuffle, stratify=strat)
    strat = test_set[stratify] if stratify else None
    val_set, test_set = train_test_split(
        test_set, test_size=0.5, random_state=rstate, shuffle=shuffle, stratify=strat)
    return (train_set, val_set, test_set)

In [14]:
# Función que separa las variables de entrada y salida
def remove_labels(df, label_name):
    X = df.drop(label_name, axis=1)
    y = df[label_name].copy()
    return (X, y)

In [95]:
# Transormador para codificar únicamente las columnas categoricas y devolver un df
class CustomOneHotEncoder(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        self._oh = OneHotEncoder()
        self._columns = None
        
    def fit(self, X, y=None):
        X_cat = X.select_dtypes(include=['object'])
        self._columns = pd.get_dummies(X_cat).columns
        self._oh.fit(X_cat)
        return self
        
    def transform(self, X, y=None):
        X_copy = X.copy()
        X_cat = X_copy.select_dtypes(include=['object'])
        X_num = X_copy.select_dtypes(exclude=['object'])
        X_cat_oh = self._oh.transform(X_cat)
        X_cat_oh = pd.DataFrame(X_cat_oh.toarray(), 
                                columns=self._columns, 
                                index=X_copy.index)
        X_copy.drop(list(X_cat), axis=1, inplace=True)
        return X_copy.join(X_cat_oh)

In [63]:
# Función que rellena los valores nulos en el DataFrame: Utiliza la mediana para variables numéricas y la moda para columnas categóricas
def fill_null_values(df_to_fill):
    for column in df_to_fill.columns:
        # Verifica si la columna tiene valores nulos
        if df_to_fill[column].isna().any():
            
            # Verificar el tipo de dato de la columna
            if pd.api.types.is_numeric_dtype(df_to_fill[column]):
                # Columna numerica 
                try: 
                    mediana = df_to_fill[column].median()
                    df_to_fill[column] = df_to_fill[column].fillna(mediana)
                    print(f"{column} imputada con la mediana ({mediana:.4f})")
                except:
                    print(f"No se pudo aplicar la mediana a la columna '{column}")
            
            elif df_to_fill[column].dtype == 'object':
                # Columna categorica
                try:
                    # Dado que la moda puede devolver multiples valores, se toma el primero 
                    moda = df_to_fill[column].mode()[0]
                    df_to_fill[column] = df_to_fill[column].fillna(moda)
                    print(f"{column} imputada con la moda ('{moda}').")
                except:
                    print(f"No se pudo aplicar la moda a la columna '{column}")
                    
            else:
                # Caso para otros tipos como booleano, fecha, etc.
                print(f"Columna '{column}': Tipo '{df_to_fill[column].dtype}' no manejado")     

    return df_to_fill

In [97]:
# Aplica Capping (Recorte) a una columna usando el método IQR.
def iqr_capper(df_in, col_name, factor=1.5):
    Q1 = df_in[col_name].quantile(0.25)
    Q3 = df_in[col_name].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - factor * IQR
    upper_bound = Q3 + factor * IQR
    
    # Recortar los valores fuera de los límites
    df_in[col_name] = np.where(df_in[col_name] > upper_bound, upper_bound, df_in[col_name])
    df_in[col_name] = np.where(df_in[col_name] < lower_bound, lower_bound, df_in[col_name])
    
    return df_in

## Lectura del conjunto de datos

In [113]:
df = pd.read_csv("cumulative.csv")

## Visualización preliminar de la información

In [114]:
print("Dimensiones del DataFrame:", df.shape)
print("\nPrimeros 5 registros del DataFrame:")
print(df.head(5))

Dimensiones del DataFrame: (9564, 50)

Primeros 5 registros del DataFrame:
   rowid     kepid kepoi_name   kepler_name koi_disposition koi_pdisposition  \
0      1  10797460  K00752.01  Kepler-227 b       CONFIRMED        CANDIDATE   
1      2  10797460  K00752.02  Kepler-227 c       CONFIRMED        CANDIDATE   
2      3  10811496  K00753.01           NaN  FALSE POSITIVE   FALSE POSITIVE   
3      4  10848459  K00754.01           NaN  FALSE POSITIVE   FALSE POSITIVE   
4      5  10854555  K00755.01  Kepler-664 b       CONFIRMED        CANDIDATE   

   koi_score  koi_fpflag_nt  koi_fpflag_ss  koi_fpflag_co  ...  \
0      1.000              0              0              0  ...   
1      0.969              0              0              0  ...   
2      0.000              0              1              0  ...   
3      0.000              0              1              0  ...   
4      1.000              0              0              0  ...   

   koi_steff_err2  koi_slogg  koi_slogg_err1  k

In [115]:
print("\nResumen de valores nulos por columna (top 10):")
print(df.isnull().sum().sort_values(ascending=False).head(10))


Resumen de valores nulos por columna (top 10):
koi_teq_err2      9564
koi_teq_err1      9564
kepler_name       7270
koi_score         1510
koi_steff_err2     483
koi_srad_err2      468
koi_srad_err1      468
koi_slogg_err2     468
koi_slogg_err1     468
koi_steff_err1     468
dtype: int64


Se ha observado que las columnas *koi_teq_err2* y *koi_teq_err1* contienen valores nulos en la totalidad de los registros (9564).

Por otro lado, los valores nulos para las variables kepler_name y koi_score representan:

In [116]:
total_registros = 9564
num_null_kepler_name = 7270
porcentaje_kepler_name = (num_null_kepler_name / total_registros) * 100

print(f"El porcentaje de valores nulos para kepler_name es: {porcentaje_kepler_name:.2f}%")

El porcentaje de valores nulos para kepler_name es: 76.01%


In [117]:
num_null_koi_score = 1510
porcentaje_koi_score = (num_null_koi_score / total_registros) * 100

print(f"El porcentaje de valores nulos para kepler_name es: {porcentaje_koi_score:.2f}%")

El porcentaje de valores nulos para kepler_name es: 15.79%


In [118]:
# Colocamos la columna rowid como indice
df.set_index("rowid", inplace=True)
df

Unnamed: 0_level_0,kepid,kepoi_name,kepler_name,koi_disposition,koi_pdisposition,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
rowid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,10797460,K00752.01,Kepler-227 b,CONFIRMED,CANDIDATE,1.000,0,0,0,0,...,-81.0,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347
2,10797460,K00752.02,Kepler-227 c,CONFIRMED,CANDIDATE,0.969,0,0,0,0,...,-81.0,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347
3,10811496,K00753.01,,FALSE POSITIVE,FALSE POSITIVE,0.000,0,1,0,0,...,-176.0,4.544,0.044,-0.176,0.868,0.233,-0.078,297.00482,48.134129,15.436
4,10848459,K00754.01,,FALSE POSITIVE,FALSE POSITIVE,0.000,0,1,0,0,...,-174.0,4.564,0.053,-0.168,0.791,0.201,-0.067,285.53461,48.285210,15.597
5,10854555,K00755.01,Kepler-664 b,CONFIRMED,CANDIDATE,1.000,0,0,0,0,...,-211.0,4.438,0.070,-0.210,1.046,0.334,-0.133,288.75488,48.226200,15.509
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9560,10031643,K07984.01,,FALSE POSITIVE,FALSE POSITIVE,0.000,0,0,0,1,...,-152.0,4.296,0.231,-0.189,1.088,0.313,-0.228,298.74921,46.973351,14.478
9561,10090151,K07985.01,,FALSE POSITIVE,FALSE POSITIVE,0.000,0,1,1,0,...,-166.0,4.529,0.035,-0.196,0.903,0.237,-0.079,297.18875,47.093819,14.082
9562,10128825,K07986.01,,CANDIDATE,CANDIDATE,0.497,0,0,0,0,...,-220.0,4.444,0.056,-0.224,1.031,0.341,-0.114,286.50937,47.163219,14.757
9563,10147276,K07987.01,,FALSE POSITIVE,FALSE POSITIVE,0.021,0,0,1,0,...,-236.0,4.447,0.056,-0.224,1.041,0.341,-0.114,294.16489,47.176281,15.385


## Preprocesamiento del conjunto de datos

In [119]:
# Copiamos el conjunto de datos para no alterar el original
df_copy = df.copy()

Dada la alta proporción de datos faltantes, se procede a la eliminación de las siguientes columnas:
* *koi_teq_err1* y *koi_teq_err2*: Eliminadas por tener el 100% de valores nulos.
* *kepler_name*: Eliminada por su alto porcentaje de valores nulos (76.01%), lo que limita su utilidad en el modelado.

**Nota:** La columna *koi_score* se mantendrá, ya que su porcentaje de nulos (15.79%) es manejable a través de técnicas de imputación.

In [120]:
# Eliminamos la columna "kepler_name", "koi_teq_err1", "koi_teq_err2" 
df_copy = df_copy.drop(["kepler_name","koi_teq_err1","koi_teq_err2"], axis=1)

División del conjunto de datos en X (features) e y (target)
El target es *koi_disposition*

In [121]:
# División del conjunto de datos en X, y
X_df, y_df = remove_labels(df_copy, "koi_disposition")
X_df

Unnamed: 0_level_0,kepid,kepoi_name,koi_pdisposition,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
rowid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,10797460,K00752.01,CANDIDATE,1.000,0,0,0,0,9.488036,2.775000e-05,...,-81.0,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347
2,10797460,K00752.02,CANDIDATE,0.969,0,0,0,0,54.418383,2.479000e-04,...,-81.0,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347
3,10811496,K00753.01,FALSE POSITIVE,0.000,0,1,0,0,19.899140,1.494000e-05,...,-176.0,4.544,0.044,-0.176,0.868,0.233,-0.078,297.00482,48.134129,15.436
4,10848459,K00754.01,FALSE POSITIVE,0.000,0,1,0,0,1.736952,2.630000e-07,...,-174.0,4.564,0.053,-0.168,0.791,0.201,-0.067,285.53461,48.285210,15.597
5,10854555,K00755.01,CANDIDATE,1.000,0,0,0,0,2.525592,3.761000e-06,...,-211.0,4.438,0.070,-0.210,1.046,0.334,-0.133,288.75488,48.226200,15.509
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9560,10031643,K07984.01,FALSE POSITIVE,0.000,0,0,0,1,8.589871,1.846000e-04,...,-152.0,4.296,0.231,-0.189,1.088,0.313,-0.228,298.74921,46.973351,14.478
9561,10090151,K07985.01,FALSE POSITIVE,0.000,0,1,1,0,0.527699,1.160000e-07,...,-166.0,4.529,0.035,-0.196,0.903,0.237,-0.079,297.18875,47.093819,14.082
9562,10128825,K07986.01,CANDIDATE,0.497,0,0,0,0,1.739849,1.780000e-05,...,-220.0,4.444,0.056,-0.224,1.031,0.341,-0.114,286.50937,47.163219,14.757
9563,10147276,K07987.01,FALSE POSITIVE,0.021,0,0,1,0,0.681402,2.434000e-06,...,-236.0,4.447,0.056,-0.224,1.041,0.341,-0.114,294.16489,47.176281,15.385


### Imputación de datos faltantes

La imputaión se realizará solo a las features (X_df)

In [122]:
# Numero de nulos en cada columna
X_df.isna().sum()

kepid                   0
kepoi_name              0
koi_pdisposition        0
koi_score            1510
koi_fpflag_nt           0
koi_fpflag_ss           0
koi_fpflag_co           0
koi_fpflag_ec           0
koi_period              0
koi_period_err1       454
koi_period_err2       454
koi_time0bk             0
koi_time0bk_err1      454
koi_time0bk_err2      454
koi_impact            363
koi_impact_err1       454
koi_impact_err2       454
koi_duration            0
koi_duration_err1     454
koi_duration_err2     454
koi_depth             363
koi_depth_err1        454
koi_depth_err2        454
koi_prad              363
koi_prad_err1         363
koi_prad_err2         363
koi_teq               363
koi_insol             321
koi_insol_err1        321
koi_insol_err2        321
koi_model_snr         363
koi_tce_plnt_num      346
koi_tce_delivname     346
koi_steff             363
koi_steff_err1        468
koi_steff_err2        483
koi_slogg             363
koi_slogg_err1        468
koi_slogg_er

Se realizará imputación simple con base en el tipo de variable. 
* Si la variable es numérica, aplicaremos la mediana y 
* Si la variable es categóricas, aplicaremos la moda 

In [123]:
numerical_na = X_df.select_dtypes(include=np.number).isnull().sum()[X_df.select_dtypes(include=np.number).isnull().sum() > 0].index.tolist()
categorical_na = X_df.select_dtypes(include=['object']).isnull().sum()[X_df.select_dtypes(include=['object']).isnull().sum() > 0].index.tolist()

print("\nCOLUMNAS NUMÉRICAS CON NULOS")
print(numerical_na)
print("\nCOLUMNAS CATEGÓRICAS CON NULOS")
print(categorical_na)


COLUMNAS NUMÉRICAS CON NULOS
['koi_score', 'koi_period_err1', 'koi_period_err2', 'koi_time0bk_err1', 'koi_time0bk_err2', 'koi_impact', 'koi_impact_err1', 'koi_impact_err2', 'koi_duration_err1', 'koi_duration_err2', 'koi_depth', 'koi_depth_err1', 'koi_depth_err2', 'koi_prad', 'koi_prad_err1', 'koi_prad_err2', 'koi_teq', 'koi_insol', 'koi_insol_err1', 'koi_insol_err2', 'koi_model_snr', 'koi_tce_plnt_num', 'koi_steff', 'koi_steff_err1', 'koi_steff_err2', 'koi_slogg', 'koi_slogg_err1', 'koi_slogg_err2', 'koi_srad', 'koi_srad_err1', 'koi_srad_err2', 'koi_kepmag']

COLUMNAS CATEGÓRICAS CON NULOS
['koi_tce_delivname']


In [124]:
# Llamamos a la función para imputar X_df
X_df_imputed = fill_null_values(X_df)

koi_score imputada con la mediana (0.3340)
koi_period_err1 imputada con la mediana (0.0000)
koi_period_err2 imputada con la mediana (-0.0000)
koi_time0bk_err1 imputada con la mediana (0.0041)
koi_time0bk_err2 imputada con la mediana (-0.0041)
koi_impact imputada con la mediana (0.5370)
koi_impact_err1 imputada con la mediana (0.1930)
koi_impact_err2 imputada con la mediana (-0.2070)
koi_duration_err1 imputada con la mediana (0.1420)
koi_duration_err2 imputada con la mediana (-0.1420)
koi_depth imputada con la mediana (421.1000)
koi_depth_err1 imputada con la mediana (20.7500)
koi_depth_err2 imputada con la mediana (-20.7500)
koi_prad imputada con la mediana (2.3900)
koi_prad_err1 imputada con la mediana (0.5200)
koi_prad_err2 imputada con la mediana (-0.3000)
koi_teq imputada con la mediana (878.0000)
koi_insol imputada con la mediana (141.6000)
koi_insol_err1 imputada con la mediana (72.8300)
koi_insol_err2 imputada con la mediana (-40.2600)
koi_model_snr imputada con la mediana (23.0

In [125]:
print("\nNulos después de la imputación simple:")
print(X_df_imputed.isnull().sum())


Nulos después de la imputación simple:
kepid                0
kepoi_name           0
koi_pdisposition     0
koi_score            0
koi_fpflag_nt        0
koi_fpflag_ss        0
koi_fpflag_co        0
koi_fpflag_ec        0
koi_period           0
koi_period_err1      0
koi_period_err2      0
koi_time0bk          0
koi_time0bk_err1     0
koi_time0bk_err2     0
koi_impact           0
koi_impact_err1      0
koi_impact_err2      0
koi_duration         0
koi_duration_err1    0
koi_duration_err2    0
koi_depth            0
koi_depth_err1       0
koi_depth_err2       0
koi_prad             0
koi_prad_err1        0
koi_prad_err2        0
koi_teq              0
koi_insol            0
koi_insol_err1       0
koi_insol_err2       0
koi_model_snr        0
koi_tce_plnt_num     0
koi_tce_delivname    0
koi_steff            0
koi_steff_err1       0
koi_steff_err2       0
koi_slogg            0
koi_slogg_err1       0
koi_slogg_err2       0
koi_srad             0
koi_srad_err1        0
koi_srad_err2    

"""""""""""""""""

Se imputará la variable __ usando KNN:

Obs: Excluir esa columna al momento de implementar la imputación simple para aplicar este método

"""""""""""""""""

### Detección de outliers

In [126]:
# Comprobamos si hay valores infinitos
X_df_imputed.isin([np.inf, -np.inf]).any()

kepid                False
kepoi_name           False
koi_pdisposition     False
koi_score            False
koi_fpflag_nt        False
koi_fpflag_ss        False
koi_fpflag_co        False
koi_fpflag_ec        False
koi_period           False
koi_period_err1      False
koi_period_err2      False
koi_time0bk          False
koi_time0bk_err1     False
koi_time0bk_err2     False
koi_impact           False
koi_impact_err1      False
koi_impact_err2      False
koi_duration         False
koi_duration_err1    False
koi_duration_err2    False
koi_depth            False
koi_depth_err1       False
koi_depth_err2       False
koi_prad             False
koi_prad_err1        False
koi_prad_err2        False
koi_teq              False
koi_insol            False
koi_insol_err1       False
koi_insol_err2       False
koi_model_snr        False
koi_tce_plnt_num     False
koi_tce_delivname    False
koi_steff            False
koi_steff_err1       False
koi_steff_err2       False
koi_slogg            False
k

In [None]:
# Columnas para aplicar Capping
cols_to_cap = X_df_imputed['koi_period', 'koi_duration']

print(f"\n--- Detección y Recorte de Outliers (Capping IQR) ---")
print("Mínimos y Máximos Originales:")
print(X_df[cols_to_cap].agg(['min', 'max']))

for col in cols_to_cap:
    X_df = iqr_capper(X_df, col)

print("\nMínimos y Máximos Después del Capping:")
print(X_df[cols_to_cap].agg(['min', 'max']))


--- Detección y Recorte de Outliers (Capping IQR) ---
Mínimos y Máximos Originales:
        koi_period  koi_duration
min       0.241843         0.052
max  129995.778400       138.540

Mínimos y Máximos Después del Capping:
     koi_period  koi_duration
min    0.241843      0.052000
max   97.687418     12.034625


Una vez que se han manejado los valores nulos, así como los outliers, utilizaremos CustomOneHotEncoder para representar como vectores las columnas categóricas

In [None]:
#X_df_encoded = one_hot_encoder.fit_transform(X_df_imputed)

# Calculo de estadísticas descriptivas

In [9]:
# División del conjunto de datos en X, y
X_df, y_df = remove_labels(df_copy, "koi_disposition")
X_df

Unnamed: 0_level_0,kepid,koi_pdisposition,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
rowid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,10797460,CANDIDATE,1.000,0,0,0,0,9.488036,2.775000e-05,-2.775000e-05,...,-81.0,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347
2,10797460,CANDIDATE,0.969,0,0,0,0,54.418383,2.479000e-04,-2.479000e-04,...,-81.0,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347
3,10811496,FALSE POSITIVE,0.000,0,1,0,0,19.899140,1.494000e-05,-1.494000e-05,...,-176.0,4.544,0.044,-0.176,0.868,0.233,-0.078,297.00482,48.134129,15.436
4,10848459,FALSE POSITIVE,0.000,0,1,0,0,1.736952,2.630000e-07,-2.630000e-07,...,-174.0,4.564,0.053,-0.168,0.791,0.201,-0.067,285.53461,48.285210,15.597
5,10854555,CANDIDATE,1.000,0,0,0,0,2.525592,3.761000e-06,-3.761000e-06,...,-211.0,4.438,0.070,-0.210,1.046,0.334,-0.133,288.75488,48.226200,15.509
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9560,10031643,FALSE POSITIVE,0.000,0,0,0,1,8.589871,1.846000e-04,-1.846000e-04,...,-152.0,4.296,0.231,-0.189,1.088,0.313,-0.228,298.74921,46.973351,14.478
9561,10090151,FALSE POSITIVE,0.000,0,1,1,0,0.527699,1.160000e-07,-1.160000e-07,...,-166.0,4.529,0.035,-0.196,0.903,0.237,-0.079,297.18875,47.093819,14.082
9562,10128825,CANDIDATE,0.497,0,0,0,0,1.739849,1.780000e-05,-1.780000e-05,...,-220.0,4.444,0.056,-0.224,1.031,0.341,-0.114,286.50937,47.163219,14.757
9563,10147276,FALSE POSITIVE,0.021,0,0,1,0,0.681402,2.434000e-06,-2.434000e-06,...,-236.0,4.447,0.056,-0.224,1.041,0.341,-0.114,294.16489,47.176281,15.385


In [45]:
# Se utiliza CustomOneHotEncoder para codificar las columnas categoricas
one_hot_encoder = CustomOneHotEncoder()
X_df = one_hot_encoder.fit_transform(X_df)

# Calculo de estadísticas descriptivas

In [46]:
df.describe()

Unnamed: 0,kepid,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
count,9564.0,8054.0,9564.0,9564.0,9564.0,9564.0,9564.0,9110.0,9110.0,9564.0,...,9081.0,9201.0,9096.0,9096.0,9201.0,9096.0,9096.0,9564.0,9564.0,9563.0
mean,7690628.0,0.480829,0.188206,0.231598,0.194898,0.120033,75.671358,0.002148,-0.002148,166.183251,...,-162.265059,4.310157,0.120738,-0.143161,1.728712,0.362292,-0.394806,292.060163,43.810433,14.264606
std,2653459.0,0.476928,0.390897,0.421875,0.396143,0.325018,1334.744046,0.008236,0.008236,67.91896,...,72.746348,0.432606,0.132837,0.085477,6.127185,0.93087,2.168213,4.766657,3.601243,1.385448
min,757450.0,0.0,0.0,0.0,0.0,0.0,0.241843,0.0,-0.1725,120.515914,...,-1762.0,0.047,0.0,-1.207,0.109,0.0,-116.137,279.85272,36.577381,6.966
25%,5556034.0,0.0,0.0,0.0,0.0,0.0,2.733684,5e-06,-0.000276,132.761718,...,-198.0,4.218,0.042,-0.196,0.829,0.129,-0.25,288.66077,40.777173,13.44
50%,7906892.0,0.334,0.0,0.0,0.0,0.0,9.752831,3.5e-05,-3.5e-05,137.224595,...,-160.0,4.438,0.07,-0.128,1.0,0.251,-0.111,292.261125,43.677504,14.52
75%,9873066.0,0.998,0.0,0.0,0.0,0.0,40.715178,0.000276,-5e-06,170.694603,...,-114.0,4.543,0.149,-0.088,1.345,0.364,-0.069,295.85916,46.714611,15.322
max,12935140.0,1.0,1.0,1.0,1.0,1.0,129995.7784,0.1725,0.0,1472.522306,...,0.0,5.364,1.472,0.0,229.908,33.091,0.0,301.72076,52.33601,20.003


In [47]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9564 entries, 1 to 9564
Data columns (total 49 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   kepid              9564 non-null   int64  
 1   kepoi_name         9564 non-null   object 
 2   kepler_name        2294 non-null   object 
 3   koi_disposition    9564 non-null   object 
 4   koi_pdisposition   9564 non-null   object 
 5   koi_score          8054 non-null   float64
 6   koi_fpflag_nt      9564 non-null   int64  
 7   koi_fpflag_ss      9564 non-null   int64  
 8   koi_fpflag_co      9564 non-null   int64  
 9   koi_fpflag_ec      9564 non-null   int64  
 10  koi_period         9564 non-null   float64
 11  koi_period_err1    9110 non-null   float64
 12  koi_period_err2    9110 non-null   float64
 13  koi_time0bk        9564 non-null   float64
 14  koi_time0bk_err1   9110 non-null   float64
 15  koi_time0bk_err2   9110 non-null   float64
 16  koi_impact         9201 non-n

In [48]:
# Verificamos si hay valores nulos
df.isna().any()

kepid                False
kepoi_name           False
kepler_name           True
koi_disposition      False
koi_pdisposition     False
koi_score             True
koi_fpflag_nt        False
koi_fpflag_ss        False
koi_fpflag_co        False
koi_fpflag_ec        False
koi_period           False
koi_period_err1       True
koi_period_err2       True
koi_time0bk          False
koi_time0bk_err1      True
koi_time0bk_err2      True
koi_impact            True
koi_impact_err1       True
koi_impact_err2       True
koi_duration         False
koi_duration_err1     True
koi_duration_err2     True
koi_depth             True
koi_depth_err1        True
koi_depth_err2        True
koi_prad              True
koi_prad_err1         True
koi_prad_err2         True
koi_teq               True
koi_teq_err1          True
koi_teq_err2          True
koi_insol             True
koi_insol_err1        True
koi_insol_err2        True
koi_model_snr         True
koi_tce_plnt_num      True
koi_tce_delivname     True
k

In [49]:
# Numero de nulos en cada columna
df.isna().sum()

kepid                   0
kepoi_name              0
kepler_name          7270
koi_disposition         0
koi_pdisposition        0
koi_score            1510
koi_fpflag_nt           0
koi_fpflag_ss           0
koi_fpflag_co           0
koi_fpflag_ec           0
koi_period              0
koi_period_err1       454
koi_period_err2       454
koi_time0bk             0
koi_time0bk_err1      454
koi_time0bk_err2      454
koi_impact            363
koi_impact_err1       454
koi_impact_err2       454
koi_duration            0
koi_duration_err1     454
koi_duration_err2     454
koi_depth             363
koi_depth_err1        454
koi_depth_err2        454
koi_prad              363
koi_prad_err1         363
koi_prad_err2         363
koi_teq               363
koi_teq_err1         9564
koi_teq_err2         9564
koi_insol             321
koi_insol_err1        321
koi_insol_err2        321
koi_model_snr         363
koi_tce_plnt_num      346
koi_tce_delivname     346
koi_steff             363
koi_steff_er

In [50]:
print("Número de nulls en kepler_name: ", df["kepler_name"].isnull().sum())
print("Número de nulls en koi_teq_err1: ", df["koi_teq_err1"].isnull().sum())
print("Número de nulls en koi_teq_err2: ", df["koi_teq_err2"].isnull().sum())

Número de nulls en kepler_name:  7270
Número de nulls en koi_teq_err1:  9564
Número de nulls en koi_teq_err2:  9564


In [51]:
# Comprobamos si hay valores infinitos
df.isin([np.inf, -np.inf]).any()

kepid                False
kepoi_name           False
kepler_name          False
koi_disposition      False
koi_pdisposition     False
koi_score            False
koi_fpflag_nt        False
koi_fpflag_ss        False
koi_fpflag_co        False
koi_fpflag_ec        False
koi_period           False
koi_period_err1      False
koi_period_err2      False
koi_time0bk          False
koi_time0bk_err1     False
koi_time0bk_err2     False
koi_impact           False
koi_impact_err1      False
koi_impact_err2      False
koi_duration         False
koi_duration_err1    False
koi_duration_err2    False
koi_depth            False
koi_depth_err1       False
koi_depth_err2       False
koi_prad             False
koi_prad_err1        False
koi_prad_err2        False
koi_teq              False
koi_teq_err1         False
koi_teq_err2         False
koi_insol            False
koi_insol_err1       False
koi_insol_err2       False
koi_model_snr        False
koi_tce_plnt_num     False
koi_tce_delivname    False
k

In [52]:
df["koi_disposition"].value_counts()

koi_disposition
FALSE POSITIVE    5023
CONFIRMED         2293
CANDIDATE         2248
Name: count, dtype: int64

Ahora veamos los cambios en la version con inputación por medio de la función auxiliar

In [53]:
df_copy.describe()

Unnamed: 0,kepid,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
count,9218.0,9218.0,9218.0,9218.0,9218.0,9218.0,9218.0,9218.0,9218.0,9218.0,...,9218.0,9218.0,9218.0,9218.0,9218.0,9218.0,9218.0,9218.0,9218.0,9218.0
mean,7692401.0,0.462289,0.178455,0.238772,0.199935,0.123888,57.571081,0.001768956,-0.001768956,164.912096,...,-161.470493,4.313578,0.119052,-0.139743,1.712887,0.352308,-0.385183,292.088841,43.812032,14.271238
std,2650746.0,0.448457,0.382916,0.426356,0.399973,0.329472,118.387952,0.007157063,0.007157063,66.987629,...,71.475957,0.428219,0.130859,0.081144,6.101649,0.914593,2.148232,4.78082,3.599661,1.377314
min,757450.0,0.0,0.0,0.0,0.0,0.0,0.25982,1.1e-08,-0.1568,120.515914,...,-1762.0,0.047,0.0,-1.207,0.109,0.0,-116.137,279.85272,36.577381,6.966
25%,5557058.0,0.0,0.0,0.0,0.0,0.0,2.637068,5.6315e-06,-0.000216775,132.71343,...,-195.0,4.228,0.045,-0.193,0.832,0.133,-0.236,288.689642,40.777038,13.44825
50%,7901976.0,0.334,0.0,0.0,0.0,0.0,9.229543,3.5205e-05,-3.5205e-05,136.944065,...,-159.0,4.438,0.07,-0.127,1.0,0.2465,-0.111,292.29459,43.665781,14.519
75%,9872290.0,0.996,0.0,0.0,0.0,0.0,36.598536,0.000216775,-5.6315e-06,170.044539,...,-115.0,4.54,0.143,-0.09,1.321,0.35075,-0.07,295.90503,46.710811,15.32175
max,12935140.0,1.0,1.0,1.0,1.0,1.0,1071.232624,0.1568,-1.1e-08,1472.522306,...,0.0,5.364,1.472,0.0,229.908,33.091,0.0,301.72076,52.33601,20.003


In [54]:
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9218 entries, 1 to 9564
Data columns (total 45 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   kepid              9218 non-null   int64  
 1   koi_disposition    9218 non-null   object 
 2   koi_pdisposition   9218 non-null   object 
 3   koi_score          9218 non-null   float64
 4   koi_fpflag_nt      9218 non-null   int64  
 5   koi_fpflag_ss      9218 non-null   int64  
 6   koi_fpflag_co      9218 non-null   int64  
 7   koi_fpflag_ec      9218 non-null   int64  
 8   koi_period         9218 non-null   float64
 9   koi_period_err1    9218 non-null   float64
 10  koi_period_err2    9218 non-null   float64
 11  koi_time0bk        9218 non-null   float64
 12  koi_time0bk_err1   9218 non-null   float64
 13  koi_time0bk_err2   9218 non-null   float64
 14  koi_impact         9218 non-null   float64
 15  koi_impact_err1    9218 non-null   float64
 16  koi_impact_err2    9218 non-n

In [56]:
#concentremonos en el raw data de momento, eliminamos los errores de medición 
df_copy2 = df_copy.drop(["koi_period_err1","koi_period_err2",
                         "koi_time0bk_err1","koi_time0bk_err2",
                         "koi_impact_err1","koi_impact_err2",
                         "koi_duration_err1","koi_duration_err2",
                         "koi_depth_err1","koi_depth_err2",
                         "koi_prad_err1","koi_prad_err2",
                         "koi_insol_err1","koi_insol_err2",
                         "koi_steff_err1","koi_steff_err2",
                         "koi_slogg_err1","koi_slogg_err2",
                         "koi_srad_err1","koi_srad_err2"
                         ], axis=1)

In [57]:
df_copy2

Unnamed: 0_level_0,kepid,koi_disposition,koi_pdisposition,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_time0bk,...,koi_insol,koi_model_snr,koi_tce_plnt_num,koi_tce_delivname,koi_steff,koi_slogg,koi_srad,ra,dec,koi_kepmag
rowid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,10797460,CONFIRMED,CANDIDATE,1.000,0,0,0,0,9.488036,170.538750,...,93.59,35.8,1.0,q1_q17_dr25_tce,5455.0,4.467,0.927,291.93423,48.141651,15.347
2,10797460,CONFIRMED,CANDIDATE,0.969,0,0,0,0,54.418383,162.513840,...,9.11,25.8,2.0,q1_q17_dr25_tce,5455.0,4.467,0.927,291.93423,48.141651,15.347
3,10811496,FALSE POSITIVE,FALSE POSITIVE,0.000,0,1,0,0,19.899140,175.850252,...,39.30,76.3,1.0,q1_q17_dr25_tce,5853.0,4.544,0.868,297.00482,48.134129,15.436
4,10848459,FALSE POSITIVE,FALSE POSITIVE,0.000,0,1,0,0,1.736952,170.307565,...,891.96,505.6,1.0,q1_q17_dr25_tce,5805.0,4.564,0.791,285.53461,48.285210,15.597
5,10854555,CONFIRMED,CANDIDATE,1.000,0,0,0,0,2.525592,171.595550,...,926.16,40.9,1.0,q1_q17_dr25_tce,6031.0,4.438,1.046,288.75488,48.226200,15.509
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9560,10031643,FALSE POSITIVE,FALSE POSITIVE,0.000,0,0,0,1,8.589871,132.016100,...,176.40,8.4,1.0,q1_q17_dr25_tce,5638.0,4.296,1.088,298.74921,46.973351,14.478
9561,10090151,FALSE POSITIVE,FALSE POSITIVE,0.000,0,1,1,0,0.527699,131.705093,...,4500.53,453.3,1.0,q1_q17_dr25_tce,5638.0,4.529,0.903,297.18875,47.093819,14.082
9562,10128825,CANDIDATE,CANDIDATE,0.497,0,0,0,0,1.739849,133.001270,...,1585.81,10.6,1.0,q1_q17_dr25_tce,6119.0,4.444,1.031,286.50937,47.163219,14.757
9563,10147276,FALSE POSITIVE,FALSE POSITIVE,0.021,0,0,1,0,0.681402,132.181750,...,5713.41,12.3,1.0,q1_q17_dr25_tce,6173.0,4.447,1.041,294.16489,47.176281,15.385


Después de realizar la limpieza inicial y eliminar las columnas de errores de medición, trabajamos con el dataset df_copy2, que es la versión ya depurada.
En esta sección calculamos las estadísticas descriptivas principales para entender la distribución general de las variables: tendencia central, dispersión y comportamiento por clase (koi_disposition).

In [58]:
#Estadísticas generales del dataset limpio (df_copy2)
df_copy2.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
kepid,9218.0,7692401.0,2650746.0,757450.0,5557058.0,7901976.0,9872290.0,12935140.0
koi_score,9218.0,0.4622886,0.4484573,0.0,0.0,0.334,0.996,1.0
koi_fpflag_nt,9218.0,0.1784552,0.3829162,0.0,0.0,0.0,0.0,1.0
koi_fpflag_ss,9218.0,0.238772,0.4263562,0.0,0.0,0.0,0.0,1.0
koi_fpflag_co,9218.0,0.1999349,0.3999729,0.0,0.0,0.0,0.0,1.0
koi_fpflag_ec,9218.0,0.123888,0.3294717,0.0,0.0,0.0,0.0,1.0
koi_period,9218.0,57.57108,118.388,0.25982,2.637068,9.229543,36.59854,1071.233
koi_time0bk,9218.0,164.9121,66.98763,120.515914,132.7134,136.9441,170.0445,1472.522
koi_impact,9218.0,0.7255348,3.185785,0.0,0.209,0.537,0.88175,100.806
koi_duration,9218.0,5.503251,6.402892,0.052,2.41085,3.737,6.109,138.54


Aquí se puede notar que varias variables astronómicas como koi_period, koi_depth, koi_prad o koi_insol tienen rangos muy amplios, lo cual es normal porque cada sistema planetario es distinto.

In [15]:
#Cálculamos de Q1, Q3 e IQR únicamente para columnas numéricas
Q1 = numeric_df.quantile(0.25)
Q3 = numeric_df.quantile(0.75)
IQR = Q3 - Q1

IQR.to_frame(name="IQR")

Unnamed: 0,IQR
kepid,4315232.0
koi_score,0.996
koi_fpflag_nt,0.0
koi_fpflag_ss,0.0
koi_fpflag_co,0.0
koi_fpflag_ec,0.0
koi_period,33.96147
koi_time0bk,37.33111
koi_impact,0.67275
koi_duration,3.69815


El IQR obtenido para cada variable muestra la dispersión central de los datos.

In [34]:
#Asimetría y curtosis de las variables numéricas
shape_stats = pd.DataFrame({
    'Skewness (asimetría)': df_copy2.skew(numeric_only=True),
    'Kurtosis (curtosis)': df_copy2.kurt(numeric_only=True)
})
shape_stats

Unnamed: 0,Skewness (asimetría),Kurtosis (curtosis)
kepid,-0.167847,-0.916096
koi_score,0.179075,-1.793948
koi_fpflag_nt,1.679817,0.821963
koi_fpflag_ss,1.225664,-0.497856
koi_fpflag_co,1.500753,0.252314
koi_fpflag_ec,2.283617,3.215605
koi_period,2.837088,8.099903
koi_time0bk,3.829318,27.113672
koi_impact,24.115522,634.343364
koi_duration,6.190406,68.990265


La mayoría de las variables presentan asimetría positiva, lo que significa que tienen colas largas hacia la derecha.
Esto es típico en datos astronómicos, donde hay muchos sistemas “normales” y unos pocos con valores muy extremos. También se observan curtosis elevadas, lo que indica distribuciones con colas pesadas.

In [None]:
#Variables con mayor asimetría en valor absoluto
shape_stats_abs_skew = shape_stats.reindex(
    shape_stats['Skewness (asimetría)'].abs().sort_values(ascending=False).index
)

shape_stats_abs_skew.head(5)

Unnamed: 0,Skewness (asimetría),Kurtosis (curtosis)
koi_prad,52.357532,2995.823627
koi_insol,51.160939,3078.150708
koi_impact,24.115522,634.343364
koi_srad,21.140204,552.28706
koi_duration,6.190406,68.990265


In [62]:
#Concatenamos la columna categórica koi_disposition
df_group = pd.concat([df_copy2['koi_disposition'], numeric_df], axis=1)

In [63]:
#Calculamos estadísticas por grupo sin problemas
group_stats = df_group.groupby('koi_disposition').mean(numeric_only=True)

group_stats

Unnamed: 0_level_0,kepid,koi_score,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_time0bk,koi_impact,koi_duration,...,koi_teq,koi_insol,koi_model_snr,koi_tce_plnt_num,koi_steff,koi_slogg,koi_srad,ra,dec,koi_kepmag
koi_disposition,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CANDIDATE,7796868.0,0.800519,0.0,0.014672,0.0,0.0,67.099452,169.46826,0.53901,4.74386,...,886.501146,5354.493173,45.386703,1.294819,5640.197157,4.332735,1.56133,291.793672,43.953132,14.339111
CONFIRMED,8106955.0,0.961497,0.00919,0.011379,0.002626,0.000438,26.779465,157.090885,0.426311,4.303189,...,839.580744,351.101856,87.82477,1.451204,5477.691028,4.410408,1.067121,290.958335,44.374334,14.341682
FALSE POSITIVE,7445115.0,0.067008,0.341751,0.450968,0.386574,0.240109,68.004049,166.581809,0.955025,6.428835,...,1301.386574,11879.960932,438.449179,1.12037,5838.141414,4.258225,2.092963,292.767916,43.47689,14.206214


In [68]:
vars_interes = ['koi_score', 'koi_period', 'koi_prad', 'koi_teq', 'koi_insol']

group_stats = (
    df_copy2
    .groupby('koi_disposition')[vars_interes]
    .agg(['mean', 'median', 'std'])
)

group_stats

Unnamed: 0_level_0,koi_score,koi_score,koi_score,koi_period,koi_period,koi_period,koi_prad,koi_prad,koi_prad,koi_teq,koi_teq,koi_teq,koi_insol,koi_insol,koi_insol
Unnamed: 0_level_1,mean,median,std,mean,median,std,mean,median,std,mean,median,std,mean,median,std
koi_disposition,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2
CANDIDATE,0.800519,0.967,0.272144,67.099452,13.062435,120.914384,15.844021,1.82,317.224293,886.501146,785.0,659.794741,5354.493173,89.81,156065.584837
CONFIRMED,0.961497,1.0,0.146744,26.779465,11.296894,52.211889,2.870858,2.17,3.363599,839.580744,782.0,386.675232,351.101856,88.23,1225.296647
FALSE POSITIVE,0.067008,0.0,0.131051,68.004049,5.594575,136.253552,186.952915,5.535,4270.029806,1301.386574,1066.0,1001.730847,11879.960932,323.185,192564.092189


In [None]:
df_copy2[["koi_disposition", "koi_pdisposition"]].apply(pd.value_counts)
#vemos que con los parametros de kepler el numero de candidatos se duplica

  df_copy2[["koi_disposition", "koi_pdisposition"]].apply(pd.value_counts)
  df_copy2[["koi_disposition", "koi_pdisposition"]].apply(pd.value_counts)


Unnamed: 0,koi_disposition,koi_pdisposition
CANDIDATE,2181,4422.0
CONFIRMED,2285,
FALSE POSITIVE,4752,4796.0


In [22]:
disposition = ['koi_disposition','koi_pdisposition'] 
koi_flags = ['koi_fpflag_nt','koi_fpflag_ss','koi_fpflag_co','koi_fpflag_ec'] 
trans_prop = ['koi_period','koi_time0bk','koi_impact','koi_duration','koi_depth','koi_prad','koi_teq','koi_insol'] 
stellar_par = ['koi_steff','koi_slogg','koi_srad']

In [None]:
agr_perc = (df_copy2["koi_disposition"] == df_copy2["koi_pdisposition"]).mean()*100
agr_perc
#vemos que las disposiciones coinciden en un 70%

np.float64(75.211542633977)

In [None]:
df_copy2[koi_flags].mean()*100
#aqui vemos que la flag ss (stelar eclipse) es la mas común

koi_fpflag_nt    17.845520
koi_fpflag_ss    23.877197
koi_fpflag_co    19.993491
koi_fpflag_ec    12.388805
dtype: float64

In [None]:
df_copy2[koi_flags].sum(axis=1).value_counts()
#con esto vemos cuantos kois tienen mas de una flag y encontramos 4 con las 4 flags, seguramente false positives

0    4388
1    3326
2    1011
3     489
4       4
Name: count, dtype: int64

In [28]:
df_copy2[trans_prop].describe()


Unnamed: 0,koi_period,koi_time0bk,koi_impact,koi_duration,koi_depth,koi_prad,koi_teq,koi_insol
count,9218.0,9218.0,9218.0,9218.0,9218.0,9218.0,9218.0,9218.0
mean,57.571081,164.912096,0.725535,5.503251,23696.18,100.837055,1088.749186,7478.194
std,118.387952,66.987629,3.185785,6.402892,82157.95,3070.861154,840.025688,157792.8
min,0.25982,120.515914,0.0,0.052,0.0,0.08,92.0,0.02
25%,2.637068,132.71343,0.209,2.41085,166.125,1.42,562.0,23.6225
50%,9.229543,136.944065,0.537,3.737,421.1,2.39,878.0,141.6
75%,36.598536,170.044539,0.88175,6.109,1409.1,14.2025,1369.0,844.04
max,1071.232624,1472.522306,100.806,138.54,1541400.0,200346.0,14667.0,10947550.0


In [None]:
df_copy2[trans_prop + stellar_par].corr()
#veamos la correlacion entre las propiedades de transicion y los parametros estelares

Unnamed: 0,koi_period,koi_time0bk,koi_impact,koi_duration,koi_depth,koi_prad,koi_teq,koi_insol,koi_steff,koi_slogg,koi_srad
koi_period,1.0,0.666775,0.05955,0.315817,-0.068785,0.063058,-0.383989,-0.02246,0.022263,-0.053964,0.01069
koi_time0bk,0.666775,1.0,0.050927,0.207852,-0.049513,0.035093,-0.313727,-0.021413,0.004137,-0.008496,-0.003556
koi_impact,0.05955,0.050927,1.0,0.039207,0.006701,0.696052,-0.009011,-0.003426,0.017009,-0.061851,0.024266
koi_duration,0.315817,0.207852,0.039207,1.0,0.074587,0.037524,-0.183056,-0.017885,0.101119,-0.124533,0.01486
koi_depth,-0.068785,-0.049513,0.006701,0.074587,1.0,0.002764,0.081186,-0.005922,0.11694,-0.010797,-0.016115
koi_prad,0.063058,0.035093,0.696052,0.037524,0.002764,1.0,-0.000995,0.003079,-0.013128,-0.098313,0.057014
koi_teq,-0.383989,-0.313727,-0.009011,-0.183056,0.081186,-0.000995,1.0,0.417941,0.246949,-0.529175,0.43954
koi_insol,-0.02246,-0.021413,-0.003426,-0.017885,-0.005922,0.003079,0.417941,1.0,-0.055887,-0.281153,0.52751
koi_steff,0.022263,0.004137,0.017009,0.101119,0.11694,-0.013128,0.246949,-0.055887,1.0,-0.138157,-0.118081
koi_slogg,-0.053964,-0.008496,-0.061851,-0.124533,-0.010797,-0.098313,-0.529175,-0.281153,-0.138157,1.0,-0.638819


## Calculo de Correlaciones

In [None]:
# Transformamos la variable de salida a numérica para calcular correlaciones
X_df_copy = X_df.join(y_df)
X_df_copy["koi_disposition"] = X_df_copy["koi_disposition"].factorize()[0]

In [None]:
# Calculamos correlaciones
corr_matrix = X_df_copy.corr()
corr_matrix["koi_disposition"].sort_values(ascending=False)

In [None]:
X_df_copy.corr()

In [None]:
# Se puede llegar a valorar quedarnos con aquellas que tienen mayor correlación
corr_matrix[corr_matrix["koi_disposition"] > 0.05]

## Reducción del número de características

In [None]:
# Extraemos las 12 caracteristicas con mas relevancia para el algoritmo
columns = list(corr_matrix[corr_matrix["koi_disposition"] > 0.05].index)
columns.remove("koi_disposition")

In [None]:
columns

In [None]:
X_df_reduced = X_df_copy[columns].copy()

In [None]:
X_df_reduced

In [None]:
df_prep = X_df_reduced.join(y_df)

In [None]:
df_prep

## División del conjunto de datos (Conjunto de datos no reducido)

In [None]:
train_set, val_set, test_set = train_val_test_split(X_df_copy)

In [None]:
X_train, y_train = remove_labels(train_set, 'koi_disposition')
X_val, y_val = remove_labels(val_set, 'koi_disposition')
X_test, y_test = remove_labels(test_set, 'koi_disposition')

## Random Forests (Conjunto de datos no reducido)

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf_rnd = RandomForestClassifier(n_estimators=10000, random_state=42, n_jobs=-1)
clf_rnd.fit(X_train, y_train)

In [None]:
# Predecimos con el conjunto de datos de entrenamiento
y_train_pred = clf_rnd.predict(X_train)

In [None]:
print("F1 Score Train Set:", f1_score(y_train_pred, y_train, average='weighted'))

In [None]:
# Predecimos con el conjunto de datos de validación
y_val_pred = clf_rnd.predict(X_val)

In [None]:
print("F1 Score Validation Set:", f1_score(y_val_pred, y_val, average='weighted'))

In [None]:
# Predecimos con el conjunto de datos de prueba
y_test_pred = clf_rnd.predict(X_test)

In [None]:
print("F1 Score Validation Set:", f1_score(y_test_pred, y_test, average='weighted'))