In [1]:
import pandas as pd

EXO = '/kaggle/input/nasaexoplanets/cleaned_5250.csv'

df = pd.read_csv(filepath_or_buffer=EXO).dropna(subset = ['mass_multiplier', 'mass_wrt'])

df.head()


Unnamed: 0,name,distance,stellar_magnitude,planet_type,discovery_year,mass_multiplier,mass_wrt,radius_multiplier,radius_wrt,orbital_radius,orbital_period,eccentricity,detection_method
0,11 Comae Berenices b,304.0,4.72307,Gas Giant,2007,19.4,Jupiter,1.08,Jupiter,1.29,0.892539,0.23,Radial Velocity
1,11 Ursae Minoris b,409.0,5.013,Gas Giant,2009,14.74,Jupiter,1.09,Jupiter,1.53,1.4,0.08,Radial Velocity
2,14 Andromedae b,246.0,5.23133,Gas Giant,2008,4.8,Jupiter,1.15,Jupiter,0.83,0.508693,0.0,Radial Velocity
3,14 Herculis b,58.0,6.61935,Gas Giant,2002,8.13881,Jupiter,1.12,Jupiter,2.773069,4.8,0.37,Radial Velocity
4,16 Cygni B b,69.0,6.215,Gas Giant,1996,1.78,Jupiter,1.2,Jupiter,1.66,2.2,0.68,Radial Velocity


Let's add a column that will give us the mass as a multiple of the mass of the Earth and another that will give us the radius as a multiple of the radius of the Earth. For the radius we're going to use a factor of 11.2, which is appropriate at the equator of the two planets.

In [2]:
df['mass_earth'] = df.apply(axis='columns', func=lambda x: x['mass_multiplier'] if x['mass_wrt'] == 'Earth' else 317.8 * x['mass_multiplier'])
df['radius_earth'] = df.apply(axis='columns', func=lambda x: x['radius_multiplier'] if x['radius_wrt'] == 'Earth' else 11.2 * x['radius_multiplier'])


In [3]:
df.shape

(5227, 15)

This is a nice dataset; there are 5500-6000 identified exoplanets, and this dataset contains more than 90% of them.

In [4]:
df.nunique()

name                 5227
distance             2530
stellar_magnitude    2863
planet_type             4
discovery_year         31
mass_multiplier      1721
mass_wrt                2
radius_multiplier    1028
radius_wrt              2
orbital_radius       2818
orbital_period       1276
eccentricity          178
detection_method       11
mass_earth           2054
radius_earth         1223
dtype: int64

First of all what are the planet types? 

In [5]:
import warnings
from plotly import express

warnings.filterwarnings(action='ignore', category=FutureWarning)

express.pie(data_frame=df['planet_type'].value_counts().to_frame().reset_index(), names='planet_type',  values='count', color='planet_type').show()

Roughly a third of our sample might be habitable because they are either terrestrial or super earth type.

Second, what method was used to detect these planets?

In [6]:
express.histogram(data_frame=df, x='detection_method', color='detection_method').show()

In [7]:
express.pie(data_frame=df['detection_method'].value_counts().to_frame().reset_index(), names='detection_method', values='count')

Almost all of our exoplanets were identified using the transit method or radial velocity. Now let's try to tie the two sets of categorical data together: do we see any strong correspondence between detection method and the planet type? We would expect not, because two detection methods dominate the dataset and because one method has been used to find planets of all four types.

In [8]:
from plotly import express

express.parallel_categories(data_frame=df, dimensions=['detection_method', 'planet_type'], height=600)

This is kind of interesting: almost all of the terrestrial planets were found using the transit method.

Let's make some histograms of our numerical data just as an exploratory analysis; some of our distributions are so spiky we will need to use a logarithmic plot in the count direction to see some of the smaller bins at all.

In [9]:
from plotly import express

for column in [key for key, value in df.dtypes.to_dict().items() if str(value) in {'int64', 'float64'}]:
    express.histogram(data_frame=df, x=column, log_y=column in {'distance', 'mass_multiplier', 'orbital_radius', 'orbital_period', 'eccentricity', 'mass_earth'}).show()

Can we make any scatter plots that show us anything interesting?

Our mass and radius data is not all expressed in the same units, so we need to split according to the reference planet (either Earth or Jupiter) to get sensble mass/radius scatter plots. Our four known planet types should mostly distinguish themselves according to their density, so we expect to see some clustering by planet type.

In [10]:
from plotly import express

express.scatter(data_frame=df.dropna(subset=['mass_wrt', 'radius_wrt']), x='mass_multiplier', y='radius_multiplier', color='planet_type', facet_col='mass_wrt', facet_row='radius_wrt',
               log_x=True, log_y=True, height=800, hover_name='name').show()

The gas giants really cluster in these plots. Let's make a big plot of everything using the mass and radius we calculated above.

In [11]:
from plotly import express

express.scatter(data_frame=df, x='mass_earth', y='radius_earth', color='planet_type', log_x=True, log_y=True, height=800, hover_name='name',).show()

There is an odd artifact in this data, obviously, where a lot of our data lines up on three line segments.

Let's trim our data down to planets that are potentially habitable based on the data in this dataset. How many of our planets are not gas giants?

In [12]:
print('{} planets of a total of {} or {:5.3f} are not gas giants.'.format(len(df[df['planet_type'] != 'Gas Giant']), len(df), len(df[df['planet_type'] != 'Gas Giant'])/len(df)))

3615 planets of a total of 5227 or 0.692 are not gas giants.


In [13]:
from plotly import express

express.scatter(data_frame=df[df['planet_type'] != 'Gas Giant'], x='mass_earth', y='radius_earth', color='planet_type', log_x=True, log_y=True, height=800, hover_name='name',).show()

In [14]:
from plotly import express

express.scatter(data_frame=df, x='orbital_period', y='radius_earth', hover_name='name', color='detection_method', height=800, log_x=True, log_y=True)

Planets that are similar to Earth have a roughly year-long orbit and have roughly the radius of Earth. How many planets in our data meet those criteria?

In [15]:
orbit_criterion = (df['orbital_period'] < 3) &  (df['orbital_period'] > 1/3)
radius_criterion = (df['radius_earth'] > 1/10) & (df['radius_earth'] < 3)
express.scatter(data_frame=df[orbit_criterion & radius_criterion], x='orbital_period', y='radius_earth', log_x=True, log_y=True, color='detection_method')

In [16]:
express.scatter(data_frame=df[orbit_criterion & radius_criterion], x='orbital_period', y='radius_earth', log_x=True, log_y=True, color='planet_type', hover_name='name')

Very few exoplanets in our data meet these critera, even when we relax them somewhat. Also, most of our detection methods are not represented in this sample.

What do we see if we just plot the terrestrial planets on these axes?

In [17]:
from plotly import express

express.scatter(data_frame=df[df['planet_type'] == 'Terrestrial'], x='orbital_period', y='radius_earth', log_x=True, log_y=True, hover_name='name', color='detection_method')

Most of the Terrestrial planets orbit their hosts way too quickly to be anything similar to Earth.