<a href="https://colab.research.google.com/github/mhuertascompany/Saas-Fee/blob/main/hands-on/session1/Galaxy_Morphology_RF_ANN_SaasFee.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Galaxy Morphology with feauture based ML: Random Forests, ANNs

The goal of these tutorial series is to illustrate a very basic supervised binary classification with different manual features based ML algorithms. The goal is to setup a ML algorithm to determine the visual morphological type of nearby galaxies from the Sloan Digital Sky Survey. The first deep learning papers in Astronomy addressed this problem at low and high redshift (Dielemann+15, Huertas-Company+15).

![](https://drive.google.com/uc?id=1TaiRB1wxui4AKnhuF4iH4LJkmrlb-D6d)

The notebook illustrates first how to train a Random Forest and a using catalog parameters (Stellar Mass and Color). 

In the second notebook we use ANNs.

We use as training set, the visually classified sample of ~14,000 galaxies by Nair&Abraham. For illustration purposes and computational time, we use jpeg RGB images as input. However the same methodology can be applied to fits.

---



#### Before we start, make sure to open this Colab notebook in "PlayGround Mode" (top left) and to change the Runtime type to GPU by navigating to the toolbar and clicking Runtime -> Change runtime type and then changing Hardware accelerator to GPU

---

In [None]:
import numpy as np
from astropy.io import fits
from astropy.table import Table
import os
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, precision_recall_curve, accuracy_score,auc

%pylab inline

## Mount Drive

Before mounting the drive click on [this folder](https://drive.google.com/drive/folders/1PcftgBzBySo1Ync-Wdsp9arTCJ_MfEPE?usp=sharing) and add it to your google drive by following these steps:

*   Go to your drive 
*   Find shared folder ("Shared with me" link)
*   Right click it
*   Click Add to My Drive



In [None]:
from google.colab import drive
drive.mount('/content/drive')

---
#### The notenook is setup to illustrate 2 different classifications:


#### 1.   Early vs. Late: This is an easy example in which we only try to separate between early-type and late-type galaxies.

#### 2.   E vs. S0: The second example is more challenging. We try to separate ellipticals from S0s.

#### By default case 1 is turn on. In order to switch to case 2 set the variable CLASS_EARLY_LATE to False.

---





In [None]:
CLASS_EARLY_LATE=True



## Data Preparation and Visualization

### Load data and prepare data

We will use catalog parameters as input features (color, mass for illustration) which correlate with galaxy morphology. It is well known that early type galaxies are redder and more massive than late type galaxies. So we are going to exploit these correlations to estimate the galaxy morphology with ML.

In [None]:
pathinData="/content/drive/My Drive/EDE2019/morphology"

if CLASS_EARLY_LATE:
  # donwload feature vector and labels
  X_ML = np.load(pathinData+'/feature_E_S.npy')
  #morphological class
  Y_ML = np.load(pathinData+'/label_E_S.npy') 
  #we also load images (for visualization purposes - not used for training)
  I_ML=np.load(pathinData+'/images_ML.npy') 

  

else:
  # donwload feature vector and labels
  X_ML = np.load(pathinData+'/feature_E_S0.npy')
  #morphological class
  Y_ML = np.load(pathinData+'/label_E_S0.npy') 
  #we also load images (for visualization purposes - not used for training)
  I_ML=np.load(pathinData+'/images_ML_E_S0.npy') 
  
#split training and test datasets
X_ML_train = X_ML[0:len(X_ML)//5*4,:]   
X_ML_test = X_ML[len(X_ML)//5*4:,:]
Y_ML_train = Y_ML[0:len(Y_ML)//5*4]
Y_ML_test = Y_ML[len(Y_ML)//5*4:]
I_ML_train = I_ML[0:len(I_ML)//5*4,:,:,:]
I_ML_test = I_ML[len(Y_ML)//5*4:,:,:,:]


### Visualize some images for illustration

In [None]:
randomized_inds_train = np.random.permutation(len(I_ML))

fig = plt.figure()
for i,j in zip(randomized_inds_train[0:4],range(4)):
  ax = fig.add_subplot(2, 2, j+1)
  im = ax.imshow(I_ML[i,:,:].astype(int))
  plt.title('$Morph$='+str(Y_ML[i]))
  fig.tight_layout() 
  fig.colorbar(im)


### Visualize the feature space used for classification (Stellar Mass / Color)

For the classical ML classification we are going to use 2 catalog parameters only (stellar mass and color). This means that all the information contained in the images is reduced to 2 parameters (features) which is what the algorithms see and will use for classification. The following cell plots these parameters for both classes. The two different classes are expected to have different distributions in the feature space so that the ML algorithm can partition the space. The goal is therefore to build an ML algorithm to separate the red from the blue.

In [None]:
xlabel("$Log(M_*)$", fontsize=20)
ylabel("g-r", fontsize=20)
xlim(8,12)
ylim(0,1.2)
scatter(X_ML[Y_ML==1,1],X_ML[Y_ML==1,0],color='blue',s=1,label='Morph1')
scatter(X_ML[Y_ML==0,1],X_ML[Y_ML==0,0],color='red',s=1,label='Morph0')
legend(fontsize=14)

## Random Forest

### Train RF classifier
The first exercise you are asked is to train a Random Forest classifier. The classifer takes as input the 2 parameters (Stellar Mass and Color) and tries to predict the visual morphology. We use here a depth of 5 and default settings. You can change the parameters and explore the effects.

In [None]:
# first define the classifier object called clf
clf = RandomForestClassifier(max_depth=2)

## YOU CAN CREATE SEVERAL CLASSIFIERS WITH DIFFERNET PARAMETERS SO THAT YOU CAN COMPARE THEM
#clf_2 = 
#clf_3 = 
#...

# then train the RF
clf.fit(X_ML_train,Y_ML_train)
"PROVIDE HERE INPUTS FOR TRAINING - SEE DOCUMENTATION"
print("Trained RF Classifier")
print(clf)

# The follwing allows you to see the relative importance of the different features
print("Importance of each feature")
print(clf.feature_importances_)

### Visualize a random Tree
The following tree plots a random tree from the trained RF. For an explanation of the different elements in the graph go to this [link](https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76). Change the parameters of your RF classifier and explore what difference it makes on the classifcation tree below. What happens if you change the max depth from 2 to 5?

In [None]:
# Extract single tree - this numnber can be changed (< n_estimators)
estimator = clf.estimators_[1]

from sklearn.tree import export_graphviz
# Export as dot file
export_graphviz(estimator, out_file='tree.dot', 
                feature_names = ["Color","Mass"],
                class_names = ["Early-Type","Late-Type"],
                rounded = True, proportion = False, 
                precision = 2, filled = True)

# Convert to png using system command (requires Graphviz)
from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=50'])

# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')

### Predictions and evaluation of results
The following cells use the trained model  to predict the morphological class of the test dataset and evaluate the performance of your model. 

In [None]:
print("Predicting...")
print("====================")

# this line is used to call the trained clf and predict in the TEST set 
Y_pred_RF=clf.predict_proba(X_ML_test)[:,1]
#COMPLETE HERE. WHAT DATASET SHOULD YOU USE TO TEST?"
#global accuracy. To define the global accuracy we need to transform the sigmoid output into a binary 
# label (0/1). We use a threshold of 0.5

Y_pred_RF_class=Y_pred_RF*0
Y_pred_RF_class[Y_pred_RF>0.5]=1

print("Global Accuracy RF:", accuracy_score(Y_ML_test, Y_pred_RF_class))


# ROC curve (False positive rate vs. True positive rate)
fpr_RF, tpr_RF, thresholds_RF = roc_curve(Y_ML_test, Y_pred_RF)

#plot ROC
fig = plt.figure() 
title('ROC curve',fontsize=18)
xlabel("FPR", fontsize=20)
ylabel("TPR", fontsize=20)
xlim(0,1)
ylim(0,1)
print(len(fpr_RF))
scatter(fpr_RF,tpr_RF,c=thresholds_RF,linewidth=3,label='RF')
plt.colorbar()
legend(fontsize=14)


### Bad classifcations of RFs
The following cells visualize some random examples of bad classifications in order to explore what the classifier has understood. We also show the feature space of bad classifications to visualize. It If you run multiple times the examples will change. Run for models with different depths (from 2 to 5 for example) and comment. Can you understand the missclasifications?

In [None]:
# objects classifed as early-types by the RF but visually classifed as late-types
bad = np.where((Y_pred_RF<0.5)&(Y_ML_test==1))
randomized_inds_train = np.random.permutation(bad)

fig = plt.figure()
fig.suptitle("Galaxies visually classifed as Class1 but classified as Class0",fontsize=10)
for i,j in zip(randomized_inds_train[0][0:4],range(4)):
  ax = fig.add_subplot(2, 2, j+1)
  im = ax.imshow(I_ML_test[i,:,:])
  plt.title('$Morph$='+str(Y_ML_test[i]))
  fig.tight_layout() 
  fig.colorbar(im)



# objects classifed as late-types by the CNN but visually classifed as early-types
bad2 = ...
## COMPLETE THIS TO SHOW EXAMPLE IMAGES OF BAD CLASSIFICATIONS USING THE EXAMPLE ABOVE
  
#visualize the feature space
fig = plt.figure()
xlabel("$Log(M_*)$", fontsize=20)
ylabel("g-r", fontsize=20)
xlim(8,12)
ylim(0,1.2)
scatter(X_ML_test[bad[0],1],X_ML_test[bad[0],0],color='pink',s=25,label="S class. as E")
#scatter(X_ML_test[bad2[0],1],X_ML_test[bad2[0],0],color='orange',s=25,label='E class. as S') 
legend(fontsize=14)

## ANN Classifier

### Train ANN classifier
We are now going to train an Artifical Neural Network using the same input features. We use here an architecture with only one hdden layer.

The following cell is used to launch TensorBoard which should enable you to visualize the training. Just run the cell. You should see an orange panel appearing. You will need Google Chrome for this to work. If it does not appear, just continue running the other cells. This will not affect the other cells.

In [None]:
# Load the TensorBoard notebook extension
%load_ext tensorboard
%tensorboard --logdir morphology/models/ann

The following cell trains the ANN. If the previous cell worked, you should see some plots appearing in the window above showing the learning history.

In [None]:
from keras.callbacks import EarlyStopping
from keras.callbacks import ModelCheckpoint
from keras.callbacks import TensorBoard
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.normalization import  BatchNormalization
from keras.layers.convolutional import Convolution2D, MaxPooling2D

pathout='morphology/models/ann' #output folder to store the results
model_name = '/ann1'  #name of the final model which is saved in pathout


# this is to delete ALL previous runs. Only set to True if you want to remove them.
RESET=True
if RESET:
  print("deleting")
  os.system("rm -r "+pathout)

# some hyperparamters to be changed
nb_epoch=50
batch_size=50

#Define callbacks to avoid more iterations once convergence
patience_par=10
#earlystopping = EarlyStopping(monitor='val_loss',patience = patience_par,verbose=0,mode='auto' )
#modelcheckpoint = ModelCheckpoint(pathout+model_name+"_best.hd5",monitor='val_loss',verbose=0,save_best_only=True)
tensorboard = TensorBoard(log_dir=pathout)

# definition of neural network. 
# COMPLETE HERE YOUR MODEL. START SIMPLE AND ADD MORE HIDDEN LAYERS. EXPLORE THE RESUTLS
#You can add/remove layers and see the performance. Also change the optimizers.
ann = Sequential()
ann.add(Flatten(input_shape=(2,1)))  ## this is your input layer. Do not change.
########
## ADD YOUR LAYERS HERE.
#########
ann.add(Dense(1,activation="sigmoid"))  ## this is your ouptut layer (1 value). Do not change.

# compilation of model. loss defintion. we are using binary crossentropy.
ann.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

# train the model. YOU CAN CHANGE BATCH SIZE AND NB_EPOCHS TO EXPLORE THE EFFECTS 
Xexp=np.expand_dims(X_ML_train,2)
print(Xexp.shape)
ann.fit(Xexp,Y_ML_train,epochs=nb_epoch,batch_size=50,callbacks=[tensorboard])

# save model just in case it is needed
ann.save(pathout+model_name+".hd5")

### Predictions

In [None]:
print("Predicting...")
print("====================")

# set to true if you want to load a specific model. otherwise it will just use 
# the last model trained
LOAD_MODEL=False
if LOAD_MODEL:
  ann = tf.keras.models.load_model(pathout+model_name+".hd5")


## This line uses the trained model to predict
Y_pred_ANN=ann.predict(np.expand_dims(X_ML_test,2))

### Bad classifications of ANNs

The follwing cells visualize some random examples of bad classifications in order to explore what the classifier has understood. We also show the feature space of bad classifications to visualize. It If you run multiple times the examples will change. Run for models with different depths (from 2 to 5 for example) and comment. Can you understand the missclasifications?

In [None]:
# objects classifed as early-types by the ANN but visually classifed as late-types
bad = np.where((Y_pred_ANN[:,0]<0.5)&(Y_ML_test==1))
randomized_inds_train = np.random.permutation(bad)

fig = plt.figure()
fig.suptitle("Galaxies visually classifed as Class1 but classified as Class0",fontsize=10)
for i,j in zip(randomized_inds_train[0][0:4],range(4)):
  ax = fig.add_subplot(2, 2, j+1)
  im = ax.imshow(I_ML_test[i,:,:])
  plt.title('$Morph$='+str(Y_ML_test[i]))
  fig.tight_layout() 
  fig.colorbar(im)



# objects classifed as late-types by the ANN but visually classifed as early-types
bad2 = np.where((Y_pred_ANN[:,0]>0.5)&(Y_ML_test==0))
randomized_inds_train = np.random.permutation(bad2)

fig = plt.figure()
fig.suptitle("Galaxies visually classifed as Class0 but classified as Class1",fontsize=10)
for i,j in zip(randomized_inds_train[0][0:4],range(4)):
  ax = fig.add_subplot(2, 2, j+1)
  im = ax.imshow(I_ML_test[i,:,:])
  plt.title('$Morph$='+str(Y_ML_test[i]))
  fig.tight_layout() 
  fig.colorbar(im)
  
#visualize the feature space
fig = plt.figure()
xlabel("$Log(M_*)$", fontsize=20)
ylabel("g-r", fontsize=20)
xlim(8,12)
ylim(0,1.2)
scatter(X_ML_test[bad[0],1],X_ML_test[bad[0],0],color='pink',s=25,label="S class. as E")
scatter(X_ML_test[bad2[0],1],X_ML_test[bad2[0],0],color='orange',s=25,label='E class. as S') 
legend(fontsize=14)

## Performance Estimation
We now compute the global accuracy as well as ROC and P-R curves. If you are not familiar with these curves please see the lecture slides or click [here](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) 

In [None]:
#global accuracy. To define the global accuracy we need to transform the sigmoid output into a binary 
# label (0/1). We use a threshold of 0.5

Y_pred_RF_class=Y_pred_RF*0
Y_pred_RF_class[Y_pred_RF>0.5]=1

print("Global Accuracy RF:", accuracy_score(Y_ML_test, Y_pred_RF_class))


#global accuracy


Y_pred_ANN_class=Y_pred_ANN*0
Y_pred_ANN_class[Y_pred_ANN>0.5]=1


print("Global Accuracy ANN:", accuracy_score(Y_ML_test, Y_pred_ANN_class))



# ROC curve (False positive rate vs. True positive rate)
fpr_RF, tpr_RF, thresholds_RF = roc_curve(Y_ML_test, Y_pred_RF)
fpr_ANN, tpr_ANN, thresholds_ANN = roc_curve(Y_ML_test, Y_pred_ANN)

print("AUC RF:", auc(fpr_RF, tpr_RF))

#plot ROC
fig = plt.figure() 
title('ROC curve',fontsize=18)
xlabel("FPR", fontsize=20)
ylabel("TPR", fontsize=20)
xlim(0,1)
ylim(0,1)
scatter(fpr_RF,tpr_RF,linewidth=3,c=thresholds_RF,label='RF')
scatter(fpr_ANN,tpr_ANN,linewidth=3,c=thresholds_ANN,label='ANN')
legend(fontsize=14)
plt.colorbar()



# Precision Recall curve (False positive rate vs. True positive rate)

## PLOT HERE THE PRECISION-RECALL CURVE AS DONE FOR THE ROC CURVE. COMMENT THE DIFFERENCES.

## COMPUTE THESE 3 DIAGNOSTICS FOR DIFFERENT MODELS AND COMPARE THEM IN THE SAME PLOTS. COMMENT.