## Notebook for Data Processing

#### Labels for categories:
| Label & feature  | 1  | 2  | 3  | 4  | 5  | 6  | 7  |
|---|---|---|---|---|---|---|---|
| P. Zone Class  | Cold  | Warm  | Hot  |   |   |   |  
| P. Mass Class  | Jovian  | Neptunian | Superterran  | Terran  | Subterran  | Mercurian  |  
|  P. Compostion Class | gas  | water-gas |  rocky-water  | rocky-iron  | iron  |   |   
| P. Atmosphere Class  | hydrogen-rich  |  metals-rich | no-atmosphere  |   |   |   |   


For P. Habitable Class, which is our target label, we have a 3 entry output [c1, c2, c3] corresponding to [non-h, meso-, psychro-] and do a softmax on it as our network output giving something like [0.95, 0.03, 0.02] which predicts non-habitable.

Our target labels then need to be as follows (I think):

non-habitable: [1, 0, 0]

mesoplanet: [0, 1, 0]

psychroplanet: [0, 0, 1]

This would work with MSE loss because it only requires that prediction $x$ has the same shape as target $y$

In [1]:
# dictionaries to re-label entries, assigning a numeric value to text label

zone_class = {"Cold": 1, "Warm": 2, "Hot": 3}
mass_class = {"Jovian": 1, "Neptunian": 2, "Superterran": 3, "Terran": 4, "Subterran": 5, "Mercurian": 6}
composition_class = {"gas": 1, 'water-gas': 2, 'rocky-water': 3, 'rocky-iron': 4, 'iron': 5}
atmosphere_class = {'hydrogen-rich': 1, 'metals-rich': 2, 'no-atmosphere': 3}

Importing csv

In [2]:
import pandas as pd
import os

cwd = os.getcwd()
project_path = cwd.rsplit('\\', maxsplit=1)[0]

df = pd.read_csv(os.path.join(project_path, r'data\unprocessed\PHL-EC-Case1.csv'))

Re-label text-based values with corresponding number from above table

In [5]:
# re-label zone class
for label, val in zone_class.items():
    df.loc[df["P. Zone Class"]==label, "P. Zone Class"] = val

# re-label mass class
for label, val in mass_class.items():
    df.loc[df["P. Mass Class"]==label, "P. Mass Class"] = val

# re-label composition class
for label, val in composition_class.items():
    df.loc[df["P. Composition Class"]==label, "P. Composition Class"] = val

# re-label atmosphere class
for label, val in atmosphere_class.items():
    df.loc[df["P. Atmosphere Class"]==label, "P. Atmosphere Class"] = val

df

Unnamed: 0,P. Name,P. Zone Class,P. Mass Class,P. Composition Class,P. Atmosphere Class,P. Habitable Class,P. Min Mass (EU),P. Mass (EU),P. Radius (EU),P. Density (EU),...,S. Size from Planet (deg),S. Hab Zone Min (AU),S. Hab Zone Max (AU),P. HZD,P. HZC,P. HZA,P. HZI,P. ESI,P. Habitable,Unnamed: 46
0,1RXS 1609 b,1,1,1,1,non-habitable,,4451.16,19.04,0.64,...,0.0022,0.540,1.362,800.07,23.51,85.62,0.00,0.05,0,
1,1SWASP J1407 b,1,1,1,1,non-habitable,6358.80,6358.80,10.94,4.86,...,0.1353,0.461,1.143,9.07,15.30,45.41,0.02,0.07,0,
2,2M 0103-55(AB) b,1,1,1,1,non-habitable,4133.22,4133.22,11.40,2.79,...,0.0024,0.136,0.347,793.67,12.57,107.44,0.00,0.06,0,
3,2M 0122-24 b,1,1,1,1,non-habitable,,6358.80,11.20,4.53,...,0.0039,0.136,0.347,490.45,15.72,119.46,0.00,0.08,0,
4,2M 0219-39 b,1,1,1,1,non-habitable,,4419.37,16.13,1.05,...,0.0009,0.062,0.165,3028.82,19.46,133.25,0.00,0.06,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3864,YBP1194 b,3,1,1,2,non-habitable,108.10,108.10,7.97,0.21,...,7.6013,0.743,1.751,-2.33,5.02,0.57,0.15,0.16,0,
3865,YBP1514 b,3,1,1,2,non-habitable,127.18,127.18,8.34,0.22,...,8.9947,0.658,1.552,-2.34,5.23,0.62,0.15,0.15,0,
3866,YZ Cet b,3,4,4,2,non-habitable,0.76,0.76,0.96,0.86,...,5.7526,0.039,0.102,-1.72,-0.17,-0.91,0.34,0.43,0,
3867,YZ Cet c,3,4,4,2,non-habitable,0.99,0.99,1.04,0.88,...,4.2882,0.039,0.102,-1.56,-0.17,-0.77,0.36,0.53,0,
