## Notebook for Data Processing

#### Labels for categories:
| Label & feature  | 1  | 2  | 3  | 4  | 5  | 6  | 7  |
|---|---|---|---|---|---|---|---|
| P. Zone Class  | Cold  | Warm  | Hot  |   |   |   |  
| P. Mass Class  | Jovian  | Neptunian | Superterran  | Terran  | Subterran  | Mercurian  |  
|  P. Compostion Class | gas  | water-gas |  rocky-water  | rocky-iron  | iron  |   |   
| P. Atmosphere Class  | hydrogen-rich  |  metals-rich | no-atmosphere  |   |   |   |   


For P. Habitable Class, which is our target label, we have a 3 entry output [c1, c2, c3] corresponding to [non-h, meso-, psychro-] and do a softmax on it as our network output giving something like [0.95, 0.03, 0.02] which predicts non-habitable.

Our target labels then need to be as follows (I think):

non-habitable: [1, 0, 0]

mesoplanet: [0, 1, 0]

psychroplanet: [0, 0, 1]

This would work with MSE loss because it only requires that prediction $x$ has the same shape as target $y$

In [26]:
# dictionaries to re-label entries, assigning a numeric value to text label

zone_class = {"Cold": 1, "Warm": 2, "Hot": 3}
mass_class = {"Jovian": 1, "Neptunian": 2, "Superterran": 3, "Terran": 4, "Subterran": 5, "Mercurian": 6}
composition_class = {"gas": 1, 'water-gas': 2, 'rocky-water': 3, 'rocky-iron': 4, 'iron': 5}
atmosphere_class = {'hydrogen-rich': 1, 'metals-rich': 2, 'no-atmosphere': 3}

Importing csv

In [68]:
import pandas as pd
import os

cwd = os.getcwd()
project_path = cwd.rsplit('\\', maxsplit=1)[0]
case_num = 1
df = pd.read_csv(os.path.join(project_path, f'data\\processed\\PHL-EC-Case{case_num}.csv'))


Re-label text-based values with corresponding number from above table

In [28]:
# # re-label zone class
# try: 
#     for label, val in zone_class.items():
#         df.loc[df["P. Zone Class"]==label, "P. Zone Class"] = val
# except KeyError as exception:
#     print(f'Excepted Key Error: {exception} - not included in this feature case')

# # re-label mass class
# try:
#     for label, val in mass_class.items():
#         df.loc[df["P. Mass Class"]==label, "P. Mass Class"] = val
# except KeyError as exception:
#     print(f'Excepted Key Error: {exception} - not included in this feature case')

# # re-label composition class
# try:
#     for label, val in composition_class.items():
#         df.loc[df["P. Composition Class"]==label, "P. Composition Class"] = val
# except KeyError as exception:
#     print(f'Excepted Key Error: {exception} - not included in this feature case')

# # re-label atmosphere class
# try:
#     for label, val in atmosphere_class.items():
#         df.loc[df["P. Atmosphere Class"]==label, "P. Atmosphere Class"] = val
# except KeyError as exception:
#     print(f'Excepted Key Error: {exception} - not included in this feature case')



## Bootstrapping 

The next step in data prcoessing is bootstrap aggregation, where we take the data set and produce equal number of samples from non-habitable, meso, and psychro type planets, so that the model is equally trained on all three types.

We have 17 of hab_type 1 and 31 of hab_type 2, with excess of hab_type 0 (non-habitable). The paper mentions 40 times upsampling so we'll assume that means:
$40 \times 17 = 680$

680 of each type giving us a total aggregrate dataset of 2040 samples.

In [64]:
import numpy as np

seed = 12345 # seed to have consistent random samples
num_samples = 680 # number of samples of each type of planet

# split data into 3 dataframes based on habitability label
df_0 = df[df["hab_lbl"] == 0]
df_1 = df[df["hab_lbl"] == 1]
df_2 = df[df["hab_lbl"] == 2]

# generate random indices for selectign samples
np.random.seed(seed)
rand_num_0 = np.random.randint(0, df_0.shape[0], size = num_samples)
rand_num_1 = np.random.randint(0, df_1.shape[0], size = num_samples)
rand_num_2 = np.random.randint(0, df_2.shape[0], size = num_samples)

# convert to numpy arrays and sample
agg_0 = df_0.to_numpy()[rand_num_0]
agg_1 = df_1.to_numpy()[rand_num_1]
agg_2 = df_2.to_numpy()[rand_num_2]

# doing 80/20 train/test split we concatenate
train = np.concatenate((agg_0[0:int(num_samples*0.8), :], agg_1[0:int(num_samples*0.8), :], agg_2[0:int(num_samples*0.8), :]))
test = np.concatenate((agg_0[int(num_samples*0.8):, :], agg_1[int(num_samples*0.8):, :], agg_2[int(num_samples*0.8):, :]))

# separating inputs and targets
train_input, train_target1D = train[:, 1:], train[:, 0]
test_input, test_target1D = test[:, 1:], test[:, 0]

# turning targets from 1 -> [0, 1, 0], 2 -> [0, 0, 1], etc.
train_target = np.empty((len(train_target1D), 3), dtype=int)
train_target[np.where(train_target1D == 0), :] = [1, 0, 0]
train_target[np.where(train_target1D == 1), :] = [0, 1, 0]
train_target[np.where(train_target1D == 2), :] = [0, 0, 1]

test_target = np.empty((len(test_target1D), 3), dtype=int)
test_target[np.where(test_target1D == 0), :] = [1, 0, 0]
test_target[np.where(test_target1D == 1), :] = [0, 1, 0]
test_target[np.where(test_target1D == 2), :] = [0, 0, 1]



(408, 3)

In [35]:
arr = np.array([1,2,3,4])
arr2 = arr[[0, 0, 1, 2, 3, 2]]
arr2

array([1, 1, 2, 3, 4, 3])

In [65]:
test = f'case{1}'
test

'case1'