In [1]:
import numpy as np
import pandas as pd

DATA = '/kaggle/input/exoplanet-dataset-scraped-from-science-nasa-gov/exoplanet_data.csv'

df = pd.read_csv(filepath_or_buffer=DATA).replace('Unknown', np.nan).dropna()
df[['Light_Years_From_Earth', 'Stellar_Magnitude']] = df[['Light_Years_From_Earth', 'Stellar_Magnitude']].astype(float)
# we want a single mass as a float so we need to massage the planet mass data
df['mass'] = df['Planet_Mass'].apply(func=lambda x: float(x.split()[0]) * (1.0 if x.split()[1] == 'Earths' else 317.906))
df['log mass'] = np.log(df['mass'])
df['Kepler'] = df['all_planet_names'].str.startswith('Kepler')
df.head()

Unnamed: 0,all_planet_names,Light_Years_From_Earth,Planet_Mass,Stellar_Magnitude,Discovery_Date,mass,log mass,Kepler
0,TOI-4559 b,31.6354,2.59 Earths,13.115,2024,2.59,0.951658,False
1,TOI-1739 b,70.9819,3.52 Earths,10.692,2024,3.52,1.258461,False
2,Kepler-1660 AB b,1188.54,4.992 Jupiters,13.595,2023,1586.986752,7.369592,True
4,TIC 139270665 c,189.875,4.89 Jupiters,10.385,2024,1554.56034,7.348948,False
5,TIC 139270665 b,189.875,0.463 Jupiters,10.385,2024,147.190478,4.991728,False


In [2]:
df.shape

(5372, 8)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5372 entries, 0 to 5625
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   all_planet_names        5372 non-null   object 
 1   Light_Years_From_Earth  5372 non-null   float64
 2   Planet_Mass             5372 non-null   object 
 3   Stellar_Magnitude       5372 non-null   float64
 4   Discovery_Date          5372 non-null   int64  
 5   mass                    5372 non-null   float64
 6   log mass                5372 non-null   float64
 7   Kepler                  5372 non-null   bool   
dtypes: bool(1), float64(4), int64(1), object(2)
memory usage: 341.0+ KB


Let's plot the star magnitude vs. the distance from Earth. We will act like the magnitude is a function of the distance from Earth, which is not really true.

In [4]:
from plotly import express

express.scatter(y=sorted(df['mass'].values))

The mass outliers are such outliers that they make almost everything else look like it is the same size; let's look at the log of the mass instead.

Without looking at the data we might expect our log mass data to be logistic, but instead we see a left half that looks logistic and a right half that does not. We might expect that this is due to the fact that our data was collected using multiple methods, and different methods might find exoplanets of different sizes. 

But about half our data is Kepler data, and both the Kepler and non-Kepler data have the same broad shape, so maybe the kink in the curve is a feature of the actual phenomenon rather than an artifact of the collection method.

In [5]:
import warnings
from plotly import express

warnings.filterwarnings(action='ignore', category=FutureWarning)
express.scatter(data_frame=df.sort_values(by=['log mass'], ascending=True, ignore_index=True), y='log mass', hover_name='all_planet_names', facet_col='Kepler', height=800)

If we plot the steller magnitude against the distance and color by the log of the mass we get an interesting plot but the presence of some stellar magnitude outliers compresses the intereseting part of the plot.

In [6]:
from plotly import express

express.scatter(data_frame=df, x='Light_Years_From_Earth', y='Stellar_Magnitude', log_x=True, hover_name='all_planet_names', color='log mass', height=800)

Let's try instead removing a few stellar magnitude outliers and plotting the log of the mass against the distance and coloring by the stellar magnitude. 

In [7]:
from plotly import express

express.scatter(data_frame=df[df['Stellar_Magnitude'] < 20], x='Light_Years_From_Earth', color='Stellar_Magnitude', log_x=True, hover_name='all_planet_names', y='log mass', height=800)

What do we see? We see that the data is dominated by smaller objects (low log mass) in a specific distance range (200-2000 light years away), orbiting brighter stars. This is not surprising given that the primary method for identifying exoplanets involves detecting their transit across their star as seen from the point of view of the observer.