# AMLD 2021

##### Machine Learning in Science: Encoding physical constraints and good development practices


## Example 01 - Basic Reproducability

In this notebook, we start by demonstrating some of the more fundamental approaches to reproducability. The models that we will be training are simplified versions of the models used in a real astrophysics problem. Later in the workshop, we will improve these models to obey physical laws, such as the conservation of mass.

### Workshop Organizers
Dr. Maria Han Veiga (University of Michigan, USA)

Dr. Miles Timpe (University of Zurich, Switzerland)

### Import libraries

In [1]:
from joblib import dump, load
import numpy as np
import os
import pandas as pd
import random
import sklearn
#import tensorflow as tf

# We can define a global seed value to make our lives easier
seed = 42

# Set random seeds
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)  # NumPy
#tf.random.set_seed(seed) # TensorFlow2
#tf.compat.v1.set_random_seed(random_seed)  # TensorFlow1


from sys import version
print(f"Python version: {version}")
#print(tf.__version__)

Python version: 3.7.9 (default, Aug 31 2020, 12:42:55) 
[GCC 7.3.0]


### Load training data

The datasets for this problem are relatively small by machine learning standards. Therefore, we provide them as CSV files. The train and test datasets together contain data on 14,856 pairwise collisions between planets. To keep things simple, we will only focus on three targets, which are subject to physical conservation laws; the mass of the largest remnant ("lr_mass"), the mass of the second largest remnant ("slr_mass"), and the mass of the collision debris ("debris_mass"). The mass of these three objects should match exactly the total mass that went into the collisions ("mtotal").

In [2]:
target = 'lr_mass' #, 'slr_mass', 'debris_mass'

features = ['collision_id', 'mtotal', 'gamma', 'b_inf', 'v_inf',
            'targ_core_fraction', 'targ_omega', 'targ_theta', 'targ_phi',
            'proj_core_fraction', 'proj_omega', 'proj_theta', 'proj_phi',
            target]

x_train = pd.read_csv('../datasets/train.csv', usecols=features)
x_test  = pd.read_csv('../datasets/test.csv', usecols=features)

y_train = x_train.pop(target)
y_test  = x_test.pop(target)

ids_train = list(x_train.pop('collision_id'))
ids_test  = list(x_test.pop('collision_id'))

### Scale data and save the scaler

While most of the focus in machine learning is on the models, an important component of reproducability are the scaling methods. Here, we are using scikit-learn's standard scaler. Anyone that wants to reproduce our results will need to know exactly how the data was scaled prior to training. Therefore, once we have fit the scaler, we save it so that it can be loaded at a later date.

In [3]:
from sklearn import preprocessing

# Use sklearn's StandardScaler
x_scaler = preprocessing.StandardScaler()

# Fit scaler to training data
x_scaler.fit(x_train)

# Save scalers
dump(x_scaler, f"x_scaler_{target}.joblib")

# Make sure to apply same scaler to train and test!
scaled_x_train = x_scaler.transform(x_train)
scaled_x_test  = x_scaler.transform(x_test)

scaled_x_train = pd.DataFrame(scaled_x_train, columns=x_train.columns)
scaled_x_test  = pd.DataFrame(scaled_x_test, columns=x_test.columns)

In [4]:
# Scale target
y_scaler = preprocessing.StandardScaler()

y_scaler.fit(y_train.values.reshape(-1, 1))

scaled_y_train = y_scaler.transform(y_train.values.reshape(-1, 1))
scaled_y_test  = y_scaler.transform(y_test.values.reshape(-1, 1))

scaled_y_train = pd.Series(data=np.squeeze(scaled_y_train), name=target)
scaled_y_test  = pd.Series(data=np.squeeze(scaled_y_test), name=target)

In [5]:
# Add back IDs after scaling
scaled_x_train['collision_id'] = ids_train
scaled_x_test['collision_id']  = ids_test

scaled_x_train.head()


dump(y_scaler, f"y_scaler_{target}.joblib")

IndentationError: unexpected indent (<ipython-input-5-91ea661a23c9>, line 8)