In [1]:
import math
import pandas as pd
import warnings

DATA = '/kaggle/input/all-exoplanets-dataset/all_exoplanets_2021.csv'

warnings.filterwarnings(action='ignore', category=FutureWarning)
df = pd.read_csv(filepath_or_buffer=DATA, index_col=['No.'])
df['log distance'] = df['Distance'].apply(math.log)
df.head()

Unnamed: 0_level_0,Planet Name,Planet Host,Num Stars,Num Planets,Discovery Method,Discovery Year,Discovery Facility,Orbital Period Days,Orbit Semi-Major Axis,Mass,...,Spectral Type,Stellar Effective Temperature,Stellar Radius,Stellar Mass,Stellar Metallicity,Stellar Metallicity Ratio,Stellar Surface Gravity,Distance,Gaia Magnitude,log distance
No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,11 Com b,11 Com,2,1,Radial Velocity,2007,Xinglong Station,326.03,1.29,6165.6,...,G8 III,4742.0,19.0,2.7,-0.35,[Fe/H],2.31,93.1846,4.44038,4.534582
2,11 UMi b,11 UMi,1,1,Radial Velocity,2009,Thueringer Landessternwarte Tautenburg,516.21997,1.53,4684.8142,...,,4213.0,29.79,2.78,-0.02,[Fe/H],1.93,125.321,4.56216,4.830878
3,14 And b,14 And,1,1,Radial Velocity,2008,Okayama Astrophysical Observatory,185.84,0.83,1525.5,...,K0 III,4813.0,11.0,2.2,-0.24,[Fe/H],2.63,75.4392,4.91781,4.323327
4,14 Her b,14 Her,1,2,Radial Velocity,2002,W. M. Keck Observatory,1773.40002,2.93,1481.0878,...,,5338.0,0.93,0.9,0.41,[Fe/H],4.45,17.9323,6.383,2.886604
5,16 Cyg B b,16 Cyg B,3,1,Radial Velocity,1996,Multiple Observatories,798.5,1.66,565.7374,...,,5750.0,1.13,1.08,0.06,[Fe/H],4.36,21.1397,6.06428,3.051153


Most observed star systems have a single observed star so we expect single stars to dominate our dataset.

In [2]:
from plotly import express

express.histogram(data_frame=df, x='Num Stars')

Do we have a prior expectation about how many stars would typically orbit a star (that has planets)? Our own solar system has eight known planets, so we might expect the mean number of planets to be eight. But exoplanets are hard to observe, so we might expect the mean to be closer to one.

In [3]:
from plotly import express

express.histogram(data_frame=df, x='Num Planets')

In [4]:
df['Num Planets'].mean().round(2)

1.79

The mean is closer to two than to one. Is that surprising? Is it more likely a fact about stars or is it an artifact of the way this data is collected?

In [5]:
from plotly import express

express.pie(data_frame=df['Discovery Method'].value_counts().to_frame().reset_index(), names='Discovery Method', values='count', color='Discovery Method')

Almost all of our data was collected using one of two methods: Transit or Radial Velocity.

Let's briefly explore the relationship between discovery year and discovery method.

In [6]:
from plotly import express

express.parallel_categories(data_frame=df.sort_values(by='Discovery Year'), dimensions=['Discovery Method', 'Discovery Year',])

In [7]:
from plotly import express

express.histogram(data_frame=df[df['Discovery Method'].isin({'Transit', 'Radial Velocity'})], x='Discovery Year', color='Discovery Method')

Wow 2016 was a big year for exoplanets discovered using the Transit method.

In [8]:
from plotly import express

express.histogram(data_frame=df[~df['Discovery Method'].isin({'Transit', 'Radial Velocity'})], x='Discovery Year', color='Discovery Method')

Do we expect stellar mass and exoplanet mass to be correlated? 

In [9]:
from plotly import express

express.scatter(data_frame=df, x='Stellar Mass', y='Mass', log_x=True, log_y=True, color='Discovery Method', hover_name='Planet Name', height=700)

It really looks like different methods are better at finding planets of different masses.

In [10]:
from plotly import express

express.scatter(data_frame=df, x='Mass', y='Gaia Magnitude', log_y=True, log_x=True, hover_name='Planet Name', color='log distance', height=700)

In [11]:
from plotly import express

express.scatter(data_frame=df, x='Orbital Period Days', y='Mass', log_x=True, color='Discovery Method', log_y=True, height=800, hover_name='Planet Name')

This tells us that the Transit method is mostly useful for planets with relatively short orbital periods; this is not surprising, as a planet needs to be detected three times to be confirmed, so we would expect the Transit method to find planets that orbit their host star quickly.

Let's turn our attention to the host stars of these exoplanets.

In [12]:
stars_df = df[['Planet Host', 'Spectral Type', 'Stellar Effective Temperature', 'Stellar Radius', 'Stellar Mass', 'Stellar Metallicity', 'Stellar Metallicity Ratio', 'Stellar Surface Gravity',]].drop_duplicates(ignore_index=True)

In [13]:
stars_df.head()

Unnamed: 0,Planet Host,Spectral Type,Stellar Effective Temperature,Stellar Radius,Stellar Mass,Stellar Metallicity,Stellar Metallicity Ratio,Stellar Surface Gravity
0,11 Com,G8 III,4742.0,19.0,2.7,-0.35,[Fe/H],2.31
1,11 UMi,,4213.0,29.79,2.78,-0.02,[Fe/H],1.93
2,14 And,K0 III,4813.0,11.0,2.2,-0.24,[Fe/H],2.63
3,14 Her,,5338.0,0.93,0.9,0.41,[Fe/H],4.45
4,16 Cyg B,,5750.0,1.13,1.08,0.06,[Fe/H],4.36


We don't have the right data to make a Hertzsprung-Russell diagram. How about a temperature vs. mass diagram?

In [14]:
from plotly import express

express.scatter(data_frame=stars_df, x='Stellar Effective Temperature', y='Stellar Mass', hover_name='Planet Host', log_x=True, log_y=True, height=800)

Our diagram bears some resemblance to an ideal H-R diagram except of course the temperature axis is facing in the positive direction. And of course very large and very small stars are missing from our sample. 