<a href="https://colab.research.google.com/github/mtsizh/galaxy-morphology-manifold-learning/blob/main/find_best_reduction_parameters.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

If you performed dataset curation on your own - upload `curated_imgs.zip` and skip to the next step. Otherwise you can run the following code and download the curated dataseet from GitHub.

In [2]:
!wget -q https://raw.githubusercontent.com/mtsizh/galaxy-morphology-manifold-learning/main/curated_dataset/curated_imgs_multipart.zip && echo "HEAD dowloaded" || "ERROR downloading HEAD"

for i in range(1,8):
  !wget -q https://raw.githubusercontent.com/mtsizh/galaxy-morphology-manifold-learning/main/curated_dataset/curated_imgs_multipart.z0{i}  && echo "PART {i} of 7 OK" || "ERROR downloading PART {i}"

print('MERGING PARTS')
!zip -FF curated_imgs_multipart.zip --out curated_imgs.zip > /dev/null && rm curated_imgs_multipart.z* && echo "COMPLETE" || "FAILED"


HEAD dowloaded
PART 1 of 7 OK
PART 2 of 7 OK
PART 3 of 7 OK
PART 4 of 7 OK
PART 5 of 7 OK
PART 6 of 7 OK
PART 7 of 7 OK
MERGING PARTS
COMPLETE


Unzip the curated dataset.

In [3]:
!unzip -q -o curated_imgs.zip && echo "UNZIPPED" || "FAIL"

UNZIPPED


Few libraries are not installed by default. the following code installs `optuna` and `umap-learn`.

In [4]:
!pip install optuna
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py

try:
  import optuna
  from cuml.manifold import TSNE
  from cuml.manifold import UMAP
  from cuml.decomposition import PCA
  from google.colab import output
  output.clear()
except:
  print('ERROR')
finally:
  print('COMPLETE')

COMPLETE


Run the following code to generate a report on different methods. `optuna` is used to get the best parameters for each of the methods: t-SNE, uMap, IsoMap, LLE, PCA. Result is saved in form of a `json` file.

In [None]:
import optuna
from sklearn.model_selection import cross_val_score, train_test_split
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
from PIL import Image
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import json
import pprint
import cupy as cp
from cuml.manifold import TSNE
from cuml.manifold import UMAP
from cuml.decomposition import PCA
import warnings
from sklearn.manifold import LocallyLinearEmbedding, Isomap


# use different class maps to get different estimations
class_map = {1: 'round', 2: 'inbetween', 3: 'cigar'}
#class_map = {4: 'edge on', 5: 'edge off'}
#class_map = {6: 'smooth', 7: 'featured'}
methods = ['uMap', 't-SNE', 'PCA', 'LLE', 'Isomap']
n_bootstrap_samples = 5000
n_parameter_trials = 50


df = pd.read_parquet('curated_dataset.parquet')
regex_filter = '|'.join(class_map.values())
filtered_df = df[df['class'].str.contains(regex_filter, regex=True)]
bootstrapped_df = filtered_df.sample(n=n_bootstrap_samples, random_state=25)
X = np.zeros((len(bootstrapped_df), 120, 120))
y = np.zeros(len(bootstrapped_df))

for key, val in class_map.items():
  y[bootstrapped_df['class'].str.contains(val, regex=True)] = key

print('Dataset balance:')
for k,v in class_map.items():
  print(f'class {v} has {np.sum(y == k)} items')
print('-----------------------------------------')

print('LOAD IMAGES')
paths = bootstrapped_df['png_loc'].str.replace('dr5', 'curated_imgs')
with tqdm(total=len(paths)) as progress:
  for idx, file_path in enumerate(paths):
    with Image.open(file_path) as img:
      X[idx,:,:] = np.array(img)
      progress.update()
X_flattened = X.reshape(X.shape[0], -1)



def objective(trial, methods):
  dr_method = trial.suggest_categorical('dr_method', methods)

  if dr_method == 't-SNE':
    n_components = 2 # cuml supports only 2 components, sklearn tsne works 3 days
    perplexity = trial.suggest_int('perplexity', 5, min(50, X.shape[0]-1)) # perplexity < samples
    n_neighbors = trial.suggest_int('n_neighbors', 3*(perplexity+1), max(3*(perplexity+1), min(50, X.shape[0]-1)))
    reducer = TSNE(n_components=n_components, perplexity=perplexity,
                   method='fft', n_neighbors=n_neighbors)
  elif dr_method == 'LLE':
    n_components = trial.suggest_int('n_components', 2, min(200, X.shape[0]-1)) # components < samples
    n_neighbors = trial.suggest_int('n_neighbors', min(2, n_components),
                                    max(50, n_components)) # neighbors <= samples
    reducer = LocallyLinearEmbedding(n_components=n_components, n_neighbors=n_neighbors)
  elif dr_method == 'Isomap':
    n_components = trial.suggest_int('n_components', 2, min(200, X.shape[0]-1)) # components < samples
    n_neighbors = trial.suggest_int('n_neighbors', 10, min(50, X.shape[0]//2))
    reducer = Isomap(n_components=n_components, n_neighbors=n_neighbors)
  elif dr_method == 'PCA':
    n_components = trial.suggest_int('n_components', 2, np.min([200, X.shape[0]-1, X.shape[1]-1]))
    reducer = PCA(n_components=n_components, svd_solver='full')
  elif dr_method == 'uMap':
    n_components = trial.suggest_int('n_components', 2, np.min([200, X.shape[0]//2, X.shape[1]-1]))
    n_neighbors = trial.suggest_int('n_neighbors', 5, min(50, X.shape[0]-1)) # neighbors < samples
    reducer = UMAP(n_neighbors=n_neighbors, n_components=n_components)


  try: #ISOMAP fails without any reason
    X_reduced = reducer.fit_transform(X_flattened)
  except ValueError as e:
    print(f"Skipping {method} trial due to error: {e}")
    return -np.inf # not to spoil the result

  #n_estimators = trial.suggest_int('n_estimators', 50, 300)
  #clf = RandomForestClassifier(n_estimators=n_estimators)
  clf = make_pipeline(StandardScaler(),
                      LogisticRegression(solver='lbfgs', max_iter=1000, random_state=42))

  return np.mean(cross_val_score(clf, X_reduced, y, cv=5))

# ignore annoying futurewarnings
warnings.filterwarnings("ignore", message=".*force_all_finite.*", category=FutureWarning)
warnings.filterwarnings("ignore", message=".*default method of TSNE.*", category=UserWarning)

result = []
for method in methods:
  print(f'********************************{method}***************************')
  study = optuna.create_study(direction="maximize")
  study.optimize(lambda T: objective(T, [method]),
                n_trials=n_parameter_trials, show_progress_bar=True, n_jobs=1)
  print("Best parameters:", study.best_params, "Best value:", study.best_value)
  result.append(study.best_params)
  result[-1]['best_vavlue'] = study.best_value


pretty_json_str = pprint.pformat(result, compact=True).replace("'",'"')
with open("results.json", "w") as outfile:
    outfile.write(pretty_json_str)

Dataset balance:
class round has 1662 items
class inbetween has 2559 items
class cigar has 779 items
-----------------------------------------
LOAD IMAGES


  0%|          | 0/5000 [00:00<?, ?it/s]

[I 2025-02-16 21:46:49,010] A new study created in memory with name: no-name-4a85727c-ceb9-4781-9290-f6669d66a33a


********************************uMap***************************


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-02-16 21:46:51,852] Trial 0 finished with value: 0.9008 and parameters: {'dr_method': 'uMap', 'n_components': 29, 'n_neighbors': 24}. Best is trial 0 with value: 0.9008.
[I 2025-02-16 21:46:55,714] Trial 1 finished with value: 0.9103999999999999 and parameters: {'dr_method': 'uMap', 'n_components': 75, 'n_neighbors': 28}. Best is trial 1 with value: 0.9103999999999999.
[I 2025-02-16 21:47:00,245] Trial 2 finished with value: 0.913 and parameters: {'dr_method': 'uMap', 'n_components': 79, 'n_neighbors': 37}. Best is trial 2 with value: 0.913.
[I 2025-02-16 21:47:03,337] Trial 3 finished with value: 0.9108 and parameters: {'dr_method': 'uMap', 'n_components': 51, 'n_neighbors': 21}. Best is trial 2 with value: 0.913.
[I 2025-02-16 21:47:05,136] Trial 4 finished with value: 0.9046 and parameters: {'dr_method': 'uMap', 'n_components': 22, 'n_neighbors': 21}. Best is trial 2 with value: 0.913.
[I 2025-02-16 21:47:11,580] Trial 5 finished with value: 0.9092 and parameters: {'dr_metho

[I 2025-02-16 21:51:04,724] A new study created in memory with name: no-name-8a352b1f-2935-4702-ad50-88bf412bd631


[I 2025-02-16 21:51:04,715] Trial 49 finished with value: 0.9100000000000001 and parameters: {'dr_method': 'uMap', 'n_components': 80, 'n_neighbors': 13}. Best is trial 35 with value: 0.9188000000000001.
Best parameters: {'dr_method': 'uMap', 'n_components': 99, 'n_neighbors': 20} Best value: 0.9188000000000001
********************************t-SNE***************************


  0%|          | 0/50 [00:00<?, ?it/s]

[W] [21:51:05.127499] # of Nearest Neighbors should be at least 3 * perplexity. Your results might be a bit strange...
[I 2025-02-16 21:51:06,191] Trial 0 finished with value: 0.5236000000000001 and parameters: {'dr_method': 't-SNE', 'perplexity': 34, 'n_neighbors': 105}. Best is trial 0 with value: 0.5236000000000001.
[W] [21:51:06.544336] # of Nearest Neighbors should be at least 3 * perplexity. Your results might be a bit strange...
[I 2025-02-16 21:51:07,534] Trial 1 finished with value: 0.509 and parameters: {'dr_method': 't-SNE', 'perplexity': 38, 'n_neighbors': 117}. Best is trial 0 with value: 0.5236000000000001.
[I 2025-02-16 21:51:08,932] Trial 2 finished with value: 0.5294 and parameters: {'dr_method': 't-SNE', 'perplexity': 17, 'n_neighbors': 54}. Best is trial 2 with value: 0.5294.
[W] [21:51:09.317712] # of Nearest Neighbors should be at least 3 * perplexity. Your results might be a bit strange...
[I 2025-02-16 21:51:10,364] Trial 3 finished with value: 0.5112 and paramet

[I 2025-02-16 21:52:14,526] A new study created in memory with name: no-name-99e712b2-78e5-4430-a290-6bee54809562


[I 2025-02-16 21:52:14,518] Trial 49 finished with value: 0.5338 and parameters: {'dr_method': 't-SNE', 'perplexity': 30, 'n_neighbors': 93}. Best is trial 22 with value: 0.5352.
Best parameters: {'dr_method': 't-SNE', 'perplexity': 27, 'n_neighbors': 84} Best value: 0.5352
********************************PCA***************************


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-02-16 21:53:25,726] Trial 0 finished with value: 0.6032 and parameters: {'dr_method': 'PCA', 'n_components': 57}. Best is trial 0 with value: 0.6032.
[I 2025-02-16 21:54:36,206] Trial 1 finished with value: 0.6078 and parameters: {'dr_method': 'PCA', 'n_components': 61}. Best is trial 1 with value: 0.6078.
[I 2025-02-16 21:55:46,635] Trial 2 finished with value: 0.6046 and parameters: {'dr_method': 'PCA', 'n_components': 43}. Best is trial 1 with value: 0.6078.
[I 2025-02-16 21:56:57,235] Trial 3 finished with value: 0.6066 and parameters: {'dr_method': 'PCA', 'n_components': 94}. Best is trial 1 with value: 0.6078.
[I 2025-02-16 21:58:07,914] Trial 4 finished with value: 0.601 and parameters: {'dr_method': 'PCA', 'n_components': 118}. Best is trial 1 with value: 0.6078.
[I 2025-02-16 21:59:18,458] Trial 5 finished with value: 0.6082 and parameters: {'dr_method': 'PCA', 'n_components': 65}. Best is trial 5 with value: 0.6082.
[I 2025-02-16 22:00:28,855] Trial 6 finished with va

[I 2025-02-16 22:51:03,042] A new study created in memory with name: no-name-f8023e9a-1024-4af9-b58f-41883e3c1990


[I 2025-02-16 22:51:03,037] Trial 49 finished with value: 0.6078 and parameters: {'dr_method': 'PCA', 'n_components': 61}. Best is trial 15 with value: 0.6102.
Best parameters: {'dr_method': 'PCA', 'n_components': 60} Best value: 0.6102
********************************LLE***************************


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-02-16 22:51:44,230] Trial 0 finished with value: 0.9269999999999999 and parameters: {'dr_method': 'LLE', 'n_components': 34, 'n_neighbors': 16}. Best is trial 0 with value: 0.9269999999999999.
[I 2025-02-16 22:52:44,576] Trial 1 finished with value: 0.9314 and parameters: {'dr_method': 'LLE', 'n_components': 178, 'n_neighbors': 56}. Best is trial 1 with value: 0.9314.
[I 2025-02-16 22:54:12,019] Trial 2 finished with value: 0.9268000000000001 and parameters: {'dr_method': 'LLE', 'n_components': 133, 'n_neighbors': 83}. Best is trial 1 with value: 0.9314.
[I 2025-02-16 22:56:01,863] Trial 3 finished with value: 0.9152000000000001 and parameters: {'dr_method': 'LLE', 'n_components': 98, 'n_neighbors': 97}. Best is trial 1 with value: 0.9314.
[I 2025-02-16 22:58:03,961] Trial 4 finished with value: 0.9222000000000001 and parameters: {'dr_method': 'LLE', 'n_components': 152, 'n_neighbors': 118}. Best is trial 1 with value: 0.9314.
[I 2025-02-16 22:59:04,543] Trial 5 finished with v

[I 2025-02-16 23:36:14,504] A new study created in memory with name: no-name-981e9e96-f5d4-4efa-85c6-32f04e4ff982


[I 2025-02-16 23:36:14,493] Trial 49 finished with value: 0.9269999999999999 and parameters: {'dr_method': 'LLE', 'n_components': 51, 'n_neighbors': 24}. Best is trial 37 with value: 0.9368000000000001.
Best parameters: {'dr_method': 'LLE', 'n_components': 45, 'n_neighbors': 12} Best value: 0.9368000000000001
********************************Isomap***************************


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-02-16 23:37:38,247] Trial 0 finished with value: 0.8628 and parameters: {'dr_method': 'Isomap', 'n_components': 169, 'n_neighbors': 48}. Best is trial 0 with value: 0.8628.
[I 2025-02-16 23:38:59,512] Trial 1 finished with value: 0.8555999999999999 and parameters: {'dr_method': 'Isomap', 'n_components': 73, 'n_neighbors': 38}. Best is trial 0 with value: 0.8628.
[I 2025-02-16 23:40:22,845] Trial 2 finished with value: 0.8625999999999999 and parameters: {'dr_method': 'Isomap', 'n_components': 172, 'n_neighbors': 42}. Best is trial 0 with value: 0.8628.
[I 2025-02-16 23:41:40,922] Trial 3 finished with value: 0.8480000000000001 and parameters: {'dr_method': 'Isomap', 'n_components': 34, 'n_neighbors': 50}. Best is trial 0 with value: 0.8628.
[I 2025-02-16 23:42:55,480] Trial 4 finished with value: 0.8476000000000001 and parameters: {'dr_method': 'Isomap', 'n_components': 48, 'n_neighbors': 33}. Best is trial 0 with value: 0.8628.
[I 2025-02-16 23:44:23,382] Trial 5 finished with 

In [None]:
with open('results.json') as json_file:
  data = json.load(json_file)
data