In [1]:
import pandas as pd

DATA = '/kaggle/input/phl-exoplanet-catalog/phl_exoplanet_catalog_2019.csv'

df = pd.read_csv(filepath_or_buffer=DATA, parse_dates=['P_UPDATED'])
df['habzone'] = df['P_HABZONE_CON'].map({0:False, 1:True})
df['habitable'] = df['P_HABITABLE'].map({0:'A', 1:'B', 2:'C'})
df.head()

Unnamed: 0,P_NAME,P_STATUS,P_MASS,P_MASS_ERROR_MIN,P_MASS_ERROR_MAX,P_RADIUS,P_RADIUS_ERROR_MIN,P_RADIUS_ERROR_MAX,P_YEAR,P_UPDATED,...,P_HABITABLE,P_ESI,S_CONSTELLATION,S_CONSTELLATION_ABR,S_CONSTELLATION_ENG,P_RADIUS_EST,P_MASS_EST,P_SEMI_MAJOR_AXIS_EST,habzone,habitable
0,11 Com b,3.0,6165.8633,-476.742,476.742,,,,2007,2014-05-14,...,0,0.083813,Coma Berenices,Com,Berenice's Hair,12.082709,6165.8633,1.29,False,A
1,11 UMi b,3.0,4684.7848,-794.57001,794.57001,,,,2009,2018-09-06,...,0,0.082414,Ursa Minor,UMi,Little Bear,12.229641,4684.7848,1.53,False,A
2,14 And b,3.0,1525.5744,,,,,,2008,2014-05-14,...,0,0.081917,Andromeda,And,Andromeda,12.848516,1525.5744,0.83,False,A
3,14 Her b,3.0,1481.0785,-47.6742,47.6742,,,,2002,2018-09-06,...,0,0.145241,Hercules,Her,Hercules,12.865261,1481.0785,2.93,False,A
4,16 Cyg B b,3.0,565.73385,-25.42624,25.42624,,,,1996,2018-09-06,...,0,0.368627,Cygnus,Cyg,Swan,13.421749,565.73385,1.66,True,A


In [2]:
df.shape

(4048, 114)

This is a relatively early dataset (from 2019) and a lot of new exoplanets have been confirmed since then, but we have a lot of data per planet.

What percentage of planets are in their host's habitable zone?

In [3]:
import warnings
from plotly import express

warnings.filterwarnings(action='ignore', category=FutureWarning)

express.pie(data_frame=df['P_HABZONE_CON'].map({0:False, 1:True}).value_counts(normalize=False).to_frame().reset_index(), names='P_HABZONE_CON', values='count', color='P_HABZONE_CON')

What percentage of planets are habitable?

In [4]:
express.pie(data_frame=df['P_HABITABLE'].value_counts(normalize=False).to_frame().reset_index(), names='count', values='count', color='P_HABITABLE')

How do our planets look if we plot them in terms of their orbital period and radius? Do the planets in their habitable zone cluster in this plot?

In [5]:
express.scatter(data_frame=df, x='P_PERIOD', y='P_RADIUS', log_x=True, log_y=True, color='habzone', height=800, hover_name='P_NAME', hover_data=['S_NAME']).update_traces(marker={'opacity': 0.5})

They don't cluster exactly but they do appear along a frontier of sorts. 

We need to lower the opacity so we can see all the points in the plot, whether they are in the habitable zone or whether they are habitable, so we need to lower the opacity so we can see all of them.

In [6]:
express.scatter(data_frame=df, x='P_PERIOD', y='P_RADIUS', log_x=True, log_y=True, color='habitable', height=800, hover_name='P_NAME', hover_data=['S_NAME']).update_traces(marker={'opacity': 0.7})

Similarly potentially habitable planets appear mostly along the longer-period, medium-radius frontier.

How many plans are either in their habitable zone or they are habitable?

In [7]:
is_habitable = df['habitable'].isin({'B', 'C'})
is_habitable.sum()

55

In [8]:
in_habitable_zone = df['habzone']
in_habitable_zone.sum()

138

In [9]:
df[is_habitable | in_habitable_zone].shape

(164, 114)

Let's try looking at these planets in different ways.

In [10]:
from plotly import express

express.scatter(data_frame=df[is_habitable | in_habitable_zone], x='P_MASS', y='P_RADIUS', log_x=True, log_y=True, hover_name='P_NAME')

Unfortunately while we have more than 150 exoplanets in this sample, we only have mass and radius data for a few of them.

In [11]:
express.scatter(data_frame=df[is_habitable | in_habitable_zone], x='P_MASS_EST', y='P_RADIUS_EST', log_x=True, log_y=False, hover_name='P_NAME')

We have mass/radius estimates for a lot more of these but the fact that they line up so neatly on this curve is suspicious. It doesn't mean that the mass/radius data are nonsense; it just means that they are probably low-accuracy.

In [12]:
express.scatter(data_frame=df[is_habitable | in_habitable_zone], x='P_MASS_EST', y='P_RADIUS_EST', log_x=True, log_y=False, hover_name='P_NAME', color='P_DETECTION')

Obviously different detection methods will tend to find exoplanets of different mass/radius.

In [13]:
express.scatter(data_frame=df[(is_habitable | in_habitable_zone) & df['P_TEMP_MEASURED'].notnull()], x='P_MASS_EST', y='P_RADIUS_EST', log_x=True, log_y=False, hover_name='P_NAME', color='P_TEMP_MEASURED')

We have essentially no temperature measurements for habitable exoplanets; only three have temperature estimates and they are all zero.

In [14]:
express.scatter(data_frame=df[is_habitable | in_habitable_zone], x='P_MASS_EST', y='P_RADIUS_EST', log_x=True, log_y=False, hover_name='P_NAME', color='P_DISTANCE')

In [15]:
express.scatter(data_frame=df[is_habitable | in_habitable_zone], x='P_MASS_EST', y='P_RADIUS_EST', log_x=True, log_y=False, hover_name='P_NAME', color='S_METALLICITY')