<a href="https://colab.research.google.com/github/mhuertascompany/Saas-Fee/blob/main/hands-on/chapter1/Galaxy_Morphology_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Galaxy Morphology with "classical ML"

The goal of these tutorial series is to illustrate a very basic supervised binary classification with different ML approaches. The goal is to setup a ML algorithm to determine the visual morphological type of nearby galaxies from the Sloan Digital Sky Survey. The first deep learning papers in Astronomy addressed this problem at low and high redshift (Dielemann+15, Huertas-Company+15).

![](https://drive.google.com/uc?id=1TaiRB1wxui4AKnhuF4iH4LJkmrlb-D6d)

The notebook illustrates first how to train several Machine Learning Classifiers using catalog parameters (Stellar Mass and Color to start with).

 We use as training set, the visually classified sample of ~14,000 galaxies by Nair&Abraham.



In [None]:
import numpy as np
from astropy.io import fits
from astropy.table import Table
import os
from sklearn import preprocessing
import pdb
import pickle

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_curve, precision_recall_curve, accuracy_score,auc

%pylab inline

## Mount Drive

Before mounting the drive click on [this folder](https://drive.google.com/drive/folders/1PcftgBzBySo1Ync-Wdsp9arTCJ_MfEPE?usp=sharing) and add it to your google drive by following these steps:

*   Go to your drive
*   Find shared folder ("Shared with me" link)
*   Right click it
*   "Add Shortcut to Drive"



In [None]:
from google.colab import drive
drive.mount('/content/drive')

---
#### The notenook is setup to illustrate 2 different classifications:


#### 1.   Early vs. Late: This is an easy example in which we only try to separate between early-type and late-type galaxies.

#### 2.   E vs. S0: The second example is more challenging. We try to separate ellipticals from S0s.

#### By default case 1 is turn on. In order to switch to case 2 set the variable CLASS_EARLY_LATE to False.

---





In [None]:
CLASS_EARLY_LATE=True

## Ex. 1: Random Forest Classifer Elliptical/Spiral with 2 parameters

### Load data and prepare data

For the classical approaches, the input are catalog parameters (color, mass for illustration) which correlate with galaxy morphology. It is well known that early type galaxies are redder and more massive than late type galaxies. So we are going to exploit these correlation to estimate the galaxy morphology.

In [None]:
pathinData="/content/drive/My Drive/ED127_2023/morphology"

if CLASS_EARLY_LATE:
  # donwload feature vector and labels
  X_ML = np.load(pathinData+'/feature_E_S.npy')
  #morphological class
  Y_ML = np.load(pathinData+'/label_E_S.npy')
  #we also load images (for visualization purposes - not used for training)
  I_ML=np.load(pathinData+'/images_ML.npy')



else:
  # donwload feature vector and labels
  X_ML = np.load(pathinData+'/feature_E_S0.npy')
  #morphological class
  Y_ML = np.load(pathinData+'/label_E_S0.npy')
  #we also load images (for visualization purposes - not used for training)
  I_ML=np.load(pathinData+'/images_ML_E_S0.npy')

#split training and test datasets
X_ML_train = X_ML[0:len(X_ML)//5*4,:]
X_ML_test = X_ML[len(X_ML)//5*4:,:]
Y_ML_train = Y_ML[0:len(Y_ML)//5*4]
Y_ML_test = Y_ML[len(Y_ML)//5*4:]
I_ML_train = I_ML[0:len(I_ML)//5*4,:,:,:]
I_ML_test = I_ML[len(Y_ML)//5*4:,:,:,:]


### Visualize some images for illustration

In [None]:
randomized_inds_train = np.random.permutation(len(I_ML))

fig = plt.figure()
for i,j in zip(randomized_inds_train[0:4],range(4)):
  ax = fig.add_subplot(2, 2, j+1)
  im = ax.imshow(I_ML[i,:,:].astype(int))
  plt.title('$Morph$='+str(Y_ML[i]))
  fig.tight_layout()
  fig.colorbar(im)


### Visualize the feature space used for classification (Stellar Mass / Color)

For the classical ML classification we are going to use 2 catalog parameters only (stellar mass and color). This means that all the information contained in the images is reduced to 2 parameters (features) which is what the algorithms see and will use for classification. The following cell plots these parameters for both classes. The two different classes are expected to have different distributions in the feature space so that the ML algorithm can partition the space. The goal is therefore to build an ML algorithm to separate the red from the blue.

In [None]:
xlabel("$Log(M_*)$", fontsize=20)
ylabel("g-r", fontsize=20)
xlim(8,12)
ylim(0,1.2)
scatter(X_ML[Y_ML==1,1],X_ML[Y_ML==1,0],color='blue',s=1,label='Morph1')
scatter(X_ML[Y_ML==0,1],X_ML[Y_ML==0,0],color='red',s=1,label='Morph0')
legend(fontsize=14)

### Train RF classifier
The first exercise you are asked is to train a Random Forest classifier. The classifer takes as input the 2 parameters (Stellar Mass and Color) and tries to predict the visual morphology. You can change the hyper parameters and explore the effects.

In [None]:
# you can add the classifiers into a list to access them later
classifiers = []


for md, nest in zip ([2,10,100],[10,100,500]):

# first define the classifier object called clf
  clf = RandomForestClassifier(max_depth=md, n_estimators=nest)

## YOU CAN CREATE SEVERAL CLASSIFIERS WITH DIFFERNET PARAMETERS SO THAT YOU CAN COMPARE THEM.
## TRY CHANGING THE MAX_DEPTH (e.g. 2,10,100) AND N_ESTIMATORS PARAMETER (e.g. 10,100,500)
#clf_2 = RandomForestClassifier()
#clf_3 =  RandomForestClassifier()
#...

# then train the RF
  clf.fit(X_ML_train,Y_ML_train)
  classifiers.append(clf)
# ADD OTHER CLASSIFIERS


  # The follwing allows you to see the relative importance of the different features
  print("Importance of each feature")
  print(clf.feature_importances_)


### Visualize a random Tree
The following tree plots a random tree from the trained RF. For an explanation of the different elements in the graph go to this [link](https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76). Change the parameters of your RF classifier and explore what difference it makes on the classifcation tree below. What happens if you change the max depth from 2 to 5?

In [None]:

class_number = 0

# Extract single tree - this numnber can be changed (< n_estimators)
estimator = classifiers[class_number].estimators_[1]

from sklearn.tree import export_graphviz
# Export as dot file
export_graphviz(estimator, out_file='tree.dot',
                feature_names = ["Color","Mass"],
                class_names = ["Early-Type","Late-Type"],
                rounded = True, proportion = False,
                precision = 2, filled = True)

# Convert to png using system command (requires Graphviz)
from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=50'])

# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')

### Predictions and evaluation of results
The following cells use the trained model  to predict the morphological class of the test dataset and evaluate the performance of your model. It is assumed that the different RF classifiers are in a list. Howwever, feel free to change the implementation.

In [None]:

Y_pred_RF=[[0] * len(X_ML_test) for i in range(len(classifiers))]

for i,rf in enumerate(classifiers):

  print("Predicting...")
  print("====================")

  # this line is used to call the trained clf and predict in the TEST set

  Y_pred_RF[i]=rf.predict_proba("COMPLETE")[:,1] # predict_proba returns one column/class. We take only one.




### Visualization of decision boundaries
Plot the decision boundaries of the different RF classifiers. You can use the function introduced in the previous session or code a new one. Comment the results.

In [None]:
# this piece of code creates a mesh grid to cover the parameter space
h = .02  # step size in the mesh
x_min, x_max = X_ML_train[:, 0].min() - 1, X_ML_train[:, 0].max() + 1
y_min, y_max = X_ML_train[:, 1].min() - 1, X_ML_train[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# title for the plots
titles = ['RF1',
          'RF2',
          'RF3'
          ]

for i, clf in enumerate(classifiers):
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max] x[y_min, y_max].
    plt.subplot(2, 2, i + 1)
    plt.subplots_adjust(wspace=0.4, hspace=0.4)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # call the classifer for each point in the grid

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.contourf(yy, xx, Z, cmap=plt.cm.coolwarm, alpha=0.8)

    # Plot also the training points
    plt.scatter(X_ML_train[:, 1], X_ML_train[:, 0], c=Y_ML_train, cmap=plt.cm.coolwarm,s=1)
    plt.xlabel('stellar mass')
    plt.ylabel('color')
    plt.ylim(xx.min(), xx.max())
    plt.xlim(yy.min(), yy.max())
    plt.xticks(())
    plt.yticks(())
    plt.title(titles[i])

plt.show()



We now compute the global accuracy as well as ROC and P-R curves. If you are not familiar with these curves please see the lecture slides or click [here](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)

In [None]:
#global accuracy. To define the global accuracy we need to transform the sigmoid output into a binary
# label (0/1). We use a threshold of 0.5

color=['blue','red','green']

#plot ROC
fig = plt.figure()
title('ROC curve',fontsize=18)
xlabel("FPR", fontsize=20)
ylabel("TPR", fontsize=20)
xlim(0,1)
ylim(0,1)

for i in range(len(classifiers)):
  Y_pred_RF_class=Y_pred_RF[i]*0
  Y_pred_RF_class[np.array(Y_pred_RF[i])>0.5]=1
  print("Global Accuracy RF:"+str(i), accuracy_score(Y_ML_test, Y_pred_RF_class))
  # ROC curve (False positive rate vs. True positive rate)
  ## PLOT ROC curve
  legend(fontsize=14)


# Precision Recall curve

## REPEAT STEPS WITH P-R CURVE. (Look at this function for more details: https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html)


The follwing cells visualize some random examples of bad classifications in order to explore what the classifier has understood. We also show the feature space of bad classifications to visualize. It If you run multiple times the examples will change. Run for models with different depths (from 2 to 5 for example) and comment. Can you understand the missclasifications?

### Bad classifcations of RFs

In [None]:
class_number = 0 # choose the classifier

# objects classifed as early-types by the RF but visually classifed as late-types
bad = np.where((Y_pred_RF[class_number]<0.5)&(Y_ML_test==1))
randomized_inds_train = np.random.permutation(bad)
fig = plt.figure()
fig.suptitle("Galaxies visually classifed as Class1 but classified as Class0",fontsize=10)
for i,j in zip(randomized_inds_train[0][0:4],range(4)):
  ax = fig.add_subplot(2, 2, j+1)
  im = ax.imshow(I_ML_test[i,:,:])
  plt.title('$Morph$='+str(Y_ML_test[i]))
  fig.tight_layout()
  fig.colorbar(im)



# objects classifed as late-types by the RF but visually classifed as early-types
# COMPLETE
##bad2 = ...

#visualize the feature space
fig = plt.figure()
xlabel("$Log(M_*)$", fontsize=20)
ylabel("g-r", fontsize=20)
xlim(8,12)
ylim(0,1.2)
scatter(X_ML_test[bad[0],1],X_ML_test[bad[0],0],color='pink',s=25,label="S class. as E")
scatter(X_ML_test[bad2[0],1],X_ML_test[bad2[0],0],color='orange',s=25,label='E class. as S')
legend(fontsize=14)

## Ex. 2 Can you repeat the steps above with an SVM classifier?

Explore the documentation of scikitlearn and try to implement an SVM based classifier for the binary problem above. Explore bad classifications and plot in the same plot the ROC curves of the RF and SVM classifiers. Comment.

In [None]:
from sklearn import svm

# TRY 3 DIFFERENT SVM CLASSIFIERS WITH 3 TYPES OF KERNELS (LINEAR, RBF AND POLYNOMIAL)
svm_classifiers =[]
C = 1.0  # SVM regularization parameter
svc = svm_classifiers.append(svm.SVC(kernel='linear', C=C,probability=True).fit(X_ML_train, Y_ML_train))
#rbf_svc = ...
#poly_svc = ...


Plot the decision boundaries of the different SVM classifiers

In [None]:
## ADD CODE TO PLOT DECISION BOUNDARIES OF SVM CLASSIFIERS

    # Plot also the training points
    plt.scatter(X_ML_train[:, 1], X_ML_train[:, 0], c=Y_ML_train, cmap=plt.cm.coolwarm,s=1)
    plt.xlabel('stellar mass')
    plt.ylabel('color')
    plt.ylim(xx.min(), xx.max())
    plt.xlim(yy.min(), yy.max())
    plt.xticks(())
    plt.yticks(())
    plt.title(titles[i])

plt.show()

Predict and plot ROC and P-R curves

In [None]:
Y_pred_SVM=[[0] * len(X_ML_test) for i in range(len(svm_classifiers))]

for rf,i in zip(svm_classifiers,range(len(svm_classifiers))):

  print("Predicting...")
  print("====================")

  # this line is used to call the trained clf and predict in the TEST set
  Y_pred_SVM[i]=rf.predict_proba(X_ML_test)[:,1]


In [None]:
## ADD CODE TO PLOT ROC AND P-R CURVES

## Ex 3: Random Forest Classifier E/S0

Build a RF or an SVM classifier for the E/S0 classification. Plot in the same figure the ROC and P-R curves for the two cases. Explore the bad classifications. Comment the results.

In [None]:
CLASS_EARLY_LATE=False

In [None]:
pathinData="/content/drive/My Drive/ED127_2022/morphology"

if CLASS_EARLY_LATE:
  # donwload feature vector and labels
  X_ML = np.load(pathinData+'/feature_E_S.npy')
  #morphological class
  Y_ML = np.load(pathinData+'/label_E_S.npy')
  #we also load images (for visualization purposes - not used for training)
  I_ML=np.load(pathinData+'/images_ML.npy')



else:
  # donwload feature vector and labels
  X_ML = np.load(pathinData+'/feature_E_S0.npy')
  #morphological class
  Y_ML = np.load(pathinData+'/label_E_S0.npy')
  #we also load images (for visualization purposes - not used for training)
  I_ML=np.load(pathinData+'/images_ML_E_S0.npy')

#split training and test datasets
X_ML_train = X_ML[0:len(X_ML)//5*4,:]
X_ML_test = X_ML[len(X_ML)//5*4:,:]
Y_ML_train = Y_ML[0:len(Y_ML)//5*4]
Y_ML_test = Y_ML[len(Y_ML)//5*4:]
I_ML_train = I_ML[0:len(I_ML)//5*4,:,:,:]
I_ML_test = I_ML[len(Y_ML)//5*4:,:,:,:]

In [None]:
randomized_inds_train = np.random.permutation(len(I_ML))

fig = plt.figure()
for i,j in zip(randomized_inds_train[0:4],range(4)):
  ax = fig.add_subplot(2, 2, j+1)
  im = ax.imshow(I_ML[i,:,:].astype(int))
  plt.title('$Morph$='+str(Y_ML[i]))
  fig.tight_layout()
  fig.colorbar(im)

In [None]:
xlabel("$Log(M_*)$", fontsize=20)
ylabel("g-r", fontsize=20)
xlim(8,12)
ylim(0,1.2)
scatter(X_ML[Y_ML==1,1],X_ML[Y_ML==1,0],color='blue',s=1,label='Morph1')
scatter(X_ML[Y_ML==0,1],X_ML[Y_ML==0,0],color='red',s=1,label='Morph0')
legend(fontsize=14)

In [None]:
classifiers_ES0 = []

# first define the classifier object called clf
#clf = ...

## YOU CAN CREATE SEVERAL CLASSIFIERS WITH DIFFERNET PARAMETERS SO THAT YOU CAN COMPARE THEM
#clf_2 = ...
#clf_3 =  ...
#...



## Ex. 4: Increasing the number of features

In the previous sections we have used only 2 parameters to perform the classification. We will try now to use more dimensions. The following cell loads a catalog with 5 parameters: stellar mass, color, Sersic Index, Velocity Dispersion and axis ratio. Repeat the steps above for the E/S0 case with different sets of parameters, compare and comment.

In [None]:
pathinData="/content/drive/My Drive/ED127_2022/morphology"

if CLASS_EARLY_LATE:
  # donwload feature vector and labels
  X_ML = np.load(pathinData+'/feature_E_S_large.npy')
  #morphological class
  Y_ML = np.load(pathinData+'/label_E_S_large.npy')
  #we also load images (for visualization purposes - not used for training)
  I_ML=np.load(pathinData+'/images_ML_large.npy')



else:
  # donwload feature vector and labels
  X_ML = np.load(pathinData+'/feature_E_S0_large.npy')
  #morphological class
  Y_ML = np.load(pathinData+'/label_E_S0_large.npy')
  #we also load images (for visualization purposes - not used for training)
  I_ML=np.load(pathinData+'/images_ML_E_S0_large.npy')

#split training and test datasets
X_ML_train = X_ML[0:len(X_ML)//5*4,:]
X_ML_test = X_ML[len(X_ML)//5*4:,:]
Y_ML_train = Y_ML[0:len(Y_ML)//5*4]
Y_ML_test = Y_ML[len(Y_ML)//5*4:]
I_ML_train = I_ML[0:len(I_ML)//5*4,:,:,:]
I_ML_test = I_ML[len(Y_ML)//5*4:,:,:,:]

In [None]:
classifiers_ES0 = []

# first define the classifier object called clf
#clf = ...

## YOU CAN CREATE SEVERAL CLASSIFIERS WITH DIFFERNET PARAMETERS SO THAT YOU CAN COMPARE THEM
#clf_2 =
#clf_3 =
#...

# then train the RF


## Ex 4. Boosting algorithms


Can you improve the results using [Boosting Algorithms](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) ? You can take it any of the previuos ML algorithms (RF or SVM) and apply a boosting algorithm such as AdaBoost. Plot ROC and P-R curves and compare.

In [None]:
from sklearn.ensemble import AdaBoostClassifier
abc = AdaBoostClassifier("complete")

#Training
abc.fit("complete")
Y_pred_boost = abc.predict_proba("complete")[:,1]






Plot ROC and P-R