<a href="https://colab.research.google.com/github/mtsizh/galaxy-morphology-manifold-learning/blob/main/find_best_reduction_parameters.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

If you performed dataset curation on your own - upload `curated_imgs.zip` and skip to the next step. Otherwise you can run the following code and download the curated dataseet from GitHub.

In [1]:
!wget -q https://raw.githubusercontent.com/mtsizh/galaxy-morphology-manifold-learning/main/curated_dataset/curated_imgs_multipart.zip && echo "HEAD dowloaded" || "ERROR downloading HEAD"

for i in range(1,8):
  !wget -q https://raw.githubusercontent.com/mtsizh/galaxy-morphology-manifold-learning/main/curated_dataset/curated_imgs_multipart.z0{i}  && echo "PART {i} of 7 OK" || "ERROR downloading PART {i}"

print('MERGING PARTS')
!zip -FF curated_imgs_multipart.zip --out curated_imgs.zip > /dev/null && rm curated_imgs_multipart.z* && echo "COMPLETE" || "FAILED"


HEAD dowloaded
PART 1 of 7 OK
PART 2 of 7 OK
PART 3 of 7 OK
PART 4 of 7 OK
PART 5 of 7 OK
PART 6 of 7 OK
PART 7 of 7 OK
MERGING PARTS
COMPLETE


Unzip the curated dataset.

In [2]:
!unzip -q -o curated_imgs.zip && echo "UNZIPPED" || "FAIL"

UNZIPPED


Few libraries are not installed by default. the following code installs `optuna` and `umap-learn`.

In [5]:
!pip install optuna
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py

try:
  import optuna
  from cuml.manifold import TSNE
  from cuml.manifold import UMAP
  from cuml.decomposition import PCA
  from google.colab import output
  output.clear()
except:
  print('ERROR')
finally:
  print('COMPLETE')

COMPLETE


Run the following code to generate a report on different methods. `optuna` is used to get the best parameters for each of the methods: t-SNE, uMap, IsoMap, LLE, PCA. Result is saved in form of a `json` file.

In [4]:
import optuna
from sklearn.model_selection import cross_val_score, train_test_split
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
from PIL import Image
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import json
import pprint
import cupy as cp
from cuml.manifold import TSNE
from cuml.manifold import UMAP
from cuml.decomposition import PCA
import warnings
from sklearn.manifold import LocallyLinearEmbedding, Isomap


# use different class maps to get different estimations
class_map = {1: 'round', 2: 'inbetween', 3: 'cigar'}
#class_map = {4: 'edge on', 5: 'edge off'}
#class_map = {6: 'smooth', 7: 'featured'}
methods = ['LLE', 'Isomap', 'uMap', 't-SNE', 'PCA']
n_bootstrap_samples = 5000
n_parameter_trials = 10


df = pd.read_parquet('curated_dataset.parquet')
regex_filter = '|'.join(class_map.values())
filtered_df = df[df['class'].str.contains(regex_filter, regex=True)]
bootstrapped_df = filtered_df.sample(n=n_bootstrap_samples, random_state=25)
X = np.zeros((len(bootstrapped_df), 120, 120))
y = np.zeros(len(bootstrapped_df))

for key, val in class_map.items():
  y[bootstrapped_df['class'].str.contains(val, regex=True)] = key

print('Dataset balance:')
for k,v in class_map.items():
  print(f'class {v} has {np.sum(y == k)} items')
print('-----------------------------------------')

print('LOAD IMAGES')
paths = bootstrapped_df['png_loc'].str.replace('dr5', 'curated_imgs')
with tqdm(total=len(paths)) as progress:
  for idx, file_path in enumerate(paths):
    with Image.open(file_path) as img:
      X[idx,:,:] = np.array(img)
      progress.update()
X_flattened = X.reshape(X.shape[0], -1)

log_intermediate_results = []

def objective(trial, methods):
  global log_intermediate_results
  dr_method = trial.suggest_categorical('dr_method', methods)

  if dr_method == 't-SNE':
    n_components = 2 # cuml supports only 2 components, sklearn tsne works 3 days
    perplexity = trial.suggest_int('perplexity', 5, min(50, X.shape[0]-1)) # perplexity < samples
    n_neighbors = trial.suggest_int('n_neighbors', 3*(perplexity+1), max(3*(perplexity+1), min(50, X.shape[0]-1)))
    reducer = TSNE(n_components=n_components, perplexity=perplexity,
                   method='fft', n_neighbors=n_neighbors)
    log_intermediate_results.append({'dr_method': 't-SNE', 'perplexity': perplexity, 'n_neighbors': n_neighbors})
  elif dr_method == 'LLE':
    n_components = trial.suggest_int('n_components', 2, min(200, X.shape[0]-1)) # components < samples
    n_neighbors = trial.suggest_int('n_neighbors', min(2, n_components),
                                    max(50, n_components)) # neighbors <= samples
    reducer = LocallyLinearEmbedding(n_components=n_components, n_neighbors=n_neighbors, n_jobs=-1)
    log_intermediate_results.append({'dr_method': 'LLE', 'n_components': n_components, 'n_neighbors': n_neighbors})
  elif dr_method == 'Isomap':
    n_components = trial.suggest_int('n_components', 2, min(200, X.shape[0]-1)) # components < samples
    n_neighbors = trial.suggest_int('n_neighbors', 10, min(50, X.shape[0]//2))
    reducer = Isomap(n_components=n_components, n_neighbors=n_neighbors, n_jobs=-1)
    log_intermediate_results.append({'dr_method': 'Isomap', 'n_components': n_components, 'n_neighbors': n_neighbors})
  elif dr_method == 'PCA':
    n_components = trial.suggest_int('n_components', 2, np.min([200, X.shape[0]-1, X.shape[1]-1]))
    reducer = PCA(n_components=n_components, svd_solver='full')
    log_intermediate_results.append({'dr_method': 'PCA', 'n_components': n_components})
  elif dr_method == 'uMap':
    n_components = trial.suggest_int('n_components', 2, np.min([200, X.shape[0]//2, X.shape[1]-1]))
    n_neighbors = trial.suggest_int('n_neighbors', 5, min(50, X.shape[0]-1)) # neighbors < samples
    reducer = UMAP(n_neighbors=n_neighbors, n_components=n_components)
    log_intermediate_results.append({'dr_method': 'UMAP', 'n_components': n_components, 'n_neighbors': n_neighbors})

  try: #ISOMAP fails without any reason
    X_reduced = reducer.fit_transform(X_flattened)
  except ValueError as e:
    print(f"Skipping {method} trial due to error: {e}")
    log_intermediate_results = log_intermediate_results[:-1]
    return -np.inf # not to spoil the result

  clf = make_pipeline(StandardScaler(),
                      LogisticRegression(solver='lbfgs', max_iter=1000, random_state=42))
  quality = np.mean(cross_val_score(clf, X_reduced, y, cv=5))
  log_intermediate_results[-1]['quality'] = quality

  return quality

# ignore annoying futurewarnings
warnings.filterwarnings("ignore", message=".*force_all_finite.*", category=FutureWarning)
warnings.filterwarnings("ignore", message=".*default method of TSNE.*", category=UserWarning)

result = []
for method in methods:
  print(f'********************************{method}***************************')
  study = optuna.create_study(direction="maximize")
  study.optimize(lambda T: objective(T, [method]),
                n_trials=n_parameter_trials, show_progress_bar=True, n_jobs=1)
  print("Best parameters:", study.best_params, "Best value:", study.best_value)
  result.append(study.best_params)
  result[-1]['best_vavlue'] = study.best_value


pretty_json_str = pprint.pformat(result, compact=True).replace("'",'"')
with open("results.json", "w") as outfile:
    outfile.write(pretty_json_str)

pretty_json_str = pprint.pformat(log_intermediate_results, compact=True).replace("'",'"')
with open("log_intermediate_results.json", "w") as outfile:
    outfile.write(pretty_json_str)

Dataset balance:
class round has 1662 items
class inbetween has 2559 items
class cigar has 779 items
-----------------------------------------
LOAD IMAGES


  0%|          | 0/5000 [00:00<?, ?it/s]

[I 2025-02-17 19:11:43,680] A new study created in memory with name: no-name-73dd1d6d-b77a-419d-80a7-a891978137b9


********************************LLE***************************


  0%|          | 0/10 [00:00<?, ?it/s]

[I 2025-02-17 19:16:44,088] Trial 0 finished with value: 0.915 and parameters: {'dr_method': 'LLE', 'n_components': 177, 'n_neighbors': 174}. Best is trial 0 with value: 0.915.
[I 2025-02-17 19:17:18,013] Trial 1 finished with value: 0.7074 and parameters: {'dr_method': 'LLE', 'n_components': 37, 'n_neighbors': 2}. Best is trial 0 with value: 0.915.
[I 2025-02-17 19:20:50,248] Trial 2 finished with value: 0.9179999999999999 and parameters: {'dr_method': 'LLE', 'n_components': 193, 'n_neighbors': 136}. Best is trial 2 with value: 0.9179999999999999.
[I 2025-02-17 19:21:40,909] Trial 3 finished with value: 0.9259999999999999 and parameters: {'dr_method': 'LLE', 'n_components': 77, 'n_neighbors': 50}. Best is trial 3 with value: 0.9259999999999999.
[I 2025-02-17 19:22:30,841] Trial 4 finished with value: 0.9288000000000001 and parameters: {'dr_method': 'LLE', 'n_components': 121, 'n_neighbors': 34}. Best is trial 4 with value: 0.9288000000000001.
[I 2025-02-17 19:23:40,052] Trial 5 finish

[I 2025-02-17 19:27:26,897] A new study created in memory with name: no-name-53c21635-d7aa-468a-b04d-1ad354005471


[I 2025-02-17 19:27:26,876] Trial 9 finished with value: 0.9206000000000001 and parameters: {'dr_method': 'LLE', 'n_components': 181, 'n_neighbors': 116}. Best is trial 7 with value: 0.9344000000000001.
Best parameters: {'dr_method': 'LLE', 'n_components': 138, 'n_neighbors': 10} Best value: 0.9344000000000001
********************************Isomap***************************


  0%|          | 0/10 [00:00<?, ?it/s]

[I 2025-02-17 19:28:14,664] Trial 0 finished with value: 0.8760000000000001 and parameters: {'dr_method': 'Isomap', 'n_components': 87, 'n_neighbors': 10}. Best is trial 0 with value: 0.8760000000000001.
[I 2025-02-17 19:29:06,711] Trial 1 finished with value: 0.8664 and parameters: {'dr_method': 'Isomap', 'n_components': 74, 'n_neighbors': 22}. Best is trial 0 with value: 0.8760000000000001.
[I 2025-02-17 19:30:09,484] Trial 2 finished with value: 0.8610000000000001 and parameters: {'dr_method': 'Isomap', 'n_components': 180, 'n_neighbors': 50}. Best is trial 0 with value: 0.8760000000000001.
[I 2025-02-17 19:31:04,606] Trial 3 finished with value: 0.8501999999999998 and parameters: {'dr_method': 'Isomap', 'n_components': 49, 'n_neighbors': 34}. Best is trial 0 with value: 0.8760000000000001.
[I 2025-02-17 19:32:03,783] Trial 4 finished with value: 0.8625999999999999 and parameters: {'dr_method': 'Isomap', 'n_components': 153, 'n_neighbors': 42}. Best is trial 0 with value: 0.87600000

[I 2025-02-17 19:36:26,184] A new study created in memory with name: no-name-79029dc3-3058-4fbb-9977-86fbc44449cc


[I 2025-02-17 19:36:26,170] Trial 9 finished with value: 0.8642 and parameters: {'dr_method': 'Isomap', 'n_components': 96, 'n_neighbors': 29}. Best is trial 0 with value: 0.8760000000000001.
Best parameters: {'dr_method': 'Isomap', 'n_components': 87, 'n_neighbors': 10} Best value: 0.8760000000000001
********************************uMap***************************


  0%|          | 0/10 [00:00<?, ?it/s]

[I 2025-02-17 19:36:32,578] Trial 0 finished with value: 0.9112000000000002 and parameters: {'dr_method': 'uMap', 'n_components': 78, 'n_neighbors': 39}. Best is trial 0 with value: 0.9112000000000002.
[I 2025-02-17 19:36:36,978] Trial 1 finished with value: 0.908 and parameters: {'dr_method': 'uMap', 'n_components': 52, 'n_neighbors': 47}. Best is trial 0 with value: 0.9112000000000002.
[I 2025-02-17 19:36:40,904] Trial 2 finished with value: 0.9066000000000001 and parameters: {'dr_method': 'uMap', 'n_components': 68, 'n_neighbors': 44}. Best is trial 0 with value: 0.9112000000000002.
[I 2025-02-17 19:36:44,305] Trial 3 finished with value: 0.9084 and parameters: {'dr_method': 'uMap', 'n_components': 57, 'n_neighbors': 32}. Best is trial 0 with value: 0.9112000000000002.
[I 2025-02-17 19:36:47,409] Trial 4 finished with value: 0.9057999999999999 and parameters: {'dr_method': 'uMap', 'n_components': 50, 'n_neighbors': 28}. Best is trial 0 with value: 0.9112000000000002.
[I 2025-02-17 1

[I 2025-02-17 19:36:59,530] A new study created in memory with name: no-name-8e301f43-19f7-454a-9a97-0d14f2cb394e


[I 2025-02-17 19:36:59,522] Trial 9 finished with value: 0.8870000000000001 and parameters: {'dr_method': 'uMap', 'n_components': 33, 'n_neighbors': 33}. Best is trial 5 with value: 0.916.
Best parameters: {'dr_method': 'uMap', 'n_components': 82, 'n_neighbors': 26} Best value: 0.916
********************************t-SNE***************************


  0%|          | 0/10 [00:00<?, ?it/s]

[I 2025-02-17 19:37:01,297] Trial 0 finished with value: 0.5067999999999999 and parameters: {'dr_method': 't-SNE', 'perplexity': 7, 'n_neighbors': 47}. Best is trial 0 with value: 0.5067999999999999.
[W] [19:37:01.667188] # of Nearest Neighbors should be at least 3 * perplexity. Your results might be a bit strange...
[I 2025-02-17 19:37:02,621] Trial 1 finished with value: 0.5267999999999999 and parameters: {'dr_method': 't-SNE', 'perplexity': 33, 'n_neighbors': 102}. Best is trial 1 with value: 0.5267999999999999.
[W] [19:37:03.034334] # of Nearest Neighbors should be at least 3 * perplexity. Your results might be a bit strange...
[I 2025-02-17 19:37:03,984] Trial 2 finished with value: 0.512 and parameters: {'dr_method': 't-SNE', 'perplexity': 46, 'n_neighbors': 141}. Best is trial 1 with value: 0.5267999999999999.
[I 2025-02-17 19:37:05,295] Trial 3 finished with value: 0.5166000000000001 and parameters: {'dr_method': 't-SNE', 'perplexity': 15, 'n_neighbors': 50}. Best is trial 1 wi

[I 2025-02-17 19:37:13,499] A new study created in memory with name: no-name-c7be7142-b880-4b74-9492-8e61416fb38f


[I 2025-02-17 19:37:13,493] Trial 9 finished with value: 0.5146000000000001 and parameters: {'dr_method': 't-SNE', 'perplexity': 25, 'n_neighbors': 78}. Best is trial 7 with value: 0.5426.
Best parameters: {'dr_method': 't-SNE', 'perplexity': 33, 'n_neighbors': 102} Best value: 0.5426
********************************PCA***************************


  0%|          | 0/10 [00:00<?, ?it/s]

[I 2025-02-17 19:38:21,389] Trial 0 finished with value: 0.6054 and parameters: {'dr_method': 'PCA', 'n_components': 78}. Best is trial 0 with value: 0.6054.
[I 2025-02-17 19:39:30,080] Trial 1 finished with value: 0.6064 and parameters: {'dr_method': 'PCA', 'n_components': 73}. Best is trial 1 with value: 0.6064.
[I 2025-02-17 19:40:38,470] Trial 2 finished with value: 0.5837999999999999 and parameters: {'dr_method': 'PCA', 'n_components': 15}. Best is trial 1 with value: 0.6064.
[I 2025-02-17 19:41:46,909] Trial 3 finished with value: 0.6052000000000001 and parameters: {'dr_method': 'PCA', 'n_components': 54}. Best is trial 1 with value: 0.6064.
[I 2025-02-17 19:42:55,453] Trial 4 finished with value: 0.6049999999999999 and parameters: {'dr_method': 'PCA', 'n_components': 105}. Best is trial 1 with value: 0.6064.
[I 2025-02-17 19:44:03,821] Trial 5 finished with value: 0.5934 and parameters: {'dr_method': 'PCA', 'n_components': 19}. Best is trial 1 with value: 0.6064.
[I 2025-02-17 1

In [5]:
with open('results.json') as json_file:
  data = json.load(json_file)
print(data)

with open('log_intermediate_results.json') as json_file:
  data = json.load(json_file)
print(data)

[{'best_vavlue': 0.9344000000000001, 'dr_method': 'LLE', 'n_components': 138, 'n_neighbors': 10}, {'best_vavlue': 0.8760000000000001, 'dr_method': 'Isomap', 'n_components': 87, 'n_neighbors': 10}, {'best_vavlue': 0.916, 'dr_method': 'uMap', 'n_components': 82, 'n_neighbors': 26}, {'best_vavlue': 0.5426, 'dr_method': 't-SNE', 'n_neighbors': 102, 'perplexity': 33}, {'best_vavlue': 0.6077999999999999, 'dr_method': 'PCA', 'n_components': 63}]
[{'dr_method': 'LLE', 'n_components': 177, 'n_neighbors': 174, 'quality': 0.915}, {'dr_method': 'LLE', 'n_components': 37, 'n_neighbors': 2, 'quality': 0.7074}, {'dr_method': 'LLE', 'n_components': 193, 'n_neighbors': 136, 'quality': 0.9179999999999999}, {'dr_method': 'LLE', 'n_components': 77, 'n_neighbors': 50, 'quality': 0.9259999999999999}, {'dr_method': 'LLE', 'n_components': 121, 'n_neighbors': 34, 'quality': 0.9288000000000001}, {'dr_method': 'LLE', 'n_components': 105, 'n_neighbors': 66, 'quality': 0.9266}, {'dr_method': 'LLE', 'n_components':