<a href="https://colab.research.google.com/github/mtsizh/galaxy-morphology-manifold-learning/blob/main/curate_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The code below downloads `gz_decals_volunteers_5.parquet` dataset and curates. We use only these datapoints where there were enough answers and high confidence.

Here is an overview of classes we define.

|class|conditions|
|---|---|
has spiral arms| (`has-spiral-arms_yes_fraction` > 0.8) **and** (`has-spiral-arms_total-votes` > 15)|
no spiral arms| (`has-spiral-arms_yes_fraction` < 0.2) **and** (`has-spiral-arms_total-votes` > 15)|
|||
|||
|round|(`how-rounded_round_fraction` > 0.8) **and** (`how-rounded_total-votes` > 10)|
|inbetween|(`how-rounded_in-between_fraction` > 0.8) **and** (`how-rounded_total-votes` > 10)|
|cigar|(`how-rounded_cigar-shaped_fraction` > 0.8) **and** (`how-rounded_total-votes` > 10)|
|||
|||
|edge on|(`disk-edge-on_yes_fraction` > 0.8) **and** (`disk-edge-on_total-votes` > 10)|
|edge off|(`disk-edge-on_no_fraction` > 0.8) **and** (`disk-edge-on_total-votes` > 10)|
|||
|||
|smooth|(`smooth-or-featured_smooth_fraction` > 0.8) **and** (`smooth-or-featured_total-votes` > 10)|
|featured|(`smooth-or-featured_smooth_fraction` < 0.2) **and** (`smooth-or-featured_total-votes` > 10)|

Each galaxy may belong to few classes, but from different groups.

In [2]:
#@title Download and curate table

from functools import partial
import operator
import numpy as np
import pandas as pd

print('DOWNLOADING TABLE')
!if [ ! -f "gz_decals_volunteers_5.parquet" ]; then wget https://zenodo.org/records/4573248/files/gz_decals_volunteers_5.parquet; fi

df = pd.read_parquet('gz_decals_volunteers_5.parquet')
working_columns = ['iauname',
                   'png_loc',
                   'smooth-or-featured_smooth_fraction',
                   'smooth-or-featured_total-votes',
                   'has-spiral-arms_yes_fraction',
                   'has-spiral-arms_total-votes',
                   'how-rounded_total-votes',
                   'how-rounded_round_fraction',
                   'how-rounded_in-between_fraction',
                   'how-rounded_cigar-shaped_fraction',
                   'disk-edge-on_total-votes',
                   'disk-edge-on_yes_fraction',
                   'disk-edge-on_no_fraction']


df = df[working_columns] # remove unnecessary columns
df['class'] = "" # add class column

lt = lambda x : np.isfinite and partial(operator.gt, x)
gt = lambda x : np.isfinite and partial(operator.lt, x)

conditions = {
    'has spiral arms': {'has-spiral-arms_yes_fraction': gt(0.8),
                        'has-spiral-arms_total-votes': gt(15)},
    'no spiral arms': {'has-spiral-arms_yes_fraction': lt(0.2),
                       'has-spiral-arms_total-votes': gt(15)},
    'round': {'how-rounded_round_fraction': gt(0.8),
              'how-rounded_total-votes': gt(10)},
    'inbetween': {'how-rounded_in-between_fraction': gt(0.8),
                  'how-rounded_total-votes': gt(10)},
    'cigar': {'how-rounded_cigar-shaped_fraction': gt(0.8),
              'how-rounded_total-votes': gt(10)},
    'edge on': {'disk-edge-on_yes_fraction': gt(0.8),
                'disk-edge-on_total-votes': gt(10)},
    'edge off': {'disk-edge-on_no_fraction': gt(0.8),
                 'disk-edge-on_total-votes': gt(10)},
    'smooth': {'smooth-or-featured_smooth_fraction': gt(0.8),
               'smooth-or-featured_total-votes': gt(10)},
    'featured': {'smooth-or-featured_smooth_fraction': lt(0.2),
                 'smooth-or-featured_total-votes': gt(10)}
}

for class_name, cond_rules in conditions.items():
  bool_arr = np.ones(len(df), dtype=bool)
  for col, condition_func in cond_rules.items():
    bool_arr &= condition_func(df[col])
  df.loc[bool_arr, 'class'] += f', {class_name}'

df['class'] = df['class'].str.lstrip(',')
df = df[df['class'] != ""] #remove unclassified

print('SAVING CURATED TABLE')
df.to_parquet('curated_dataset.parquet', index=False)

DOWNLOADING TABLE
--2025-02-25 12:10:16--  https://zenodo.org/records/4573248/files/gz_decals_volunteers_5.parquet
Resolving zenodo.org (zenodo.org)... 188.185.48.194, 188.185.45.92, 188.185.43.25, ...
Connecting to zenodo.org (zenodo.org)|188.185.48.194|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 40528939 (39M) [application/octet-stream]
Saving to: ‘gz_decals_volunteers_5.parquet’


2025-02-25 12:10:44 (1.45 MB/s) - ‘gz_decals_volunteers_5.parquet’ saved [40528939/40528939]

SAVING CURATED TABLE


Downloading and curating images based on curated table. Please note, the procedure is quite slow (downloading 4 parts x 20 min per part). Images are cropped to 120 x 120 central part. The result is archived in case you want to download it (final size will be approximately 400 Mb, the full original dataset is 4 parts x 20 Gb each).

Two archives should be generated

|Archive|image size|color|used for|expected size|preprocessed available|
|---|---|---|---|---|---|
|curated_imgs.zip|120x120|grayscale|dimensionality reduction|386 Mb|✅|
|curated_imgs_large.zip|424x424|RGB|Petrosian radii calculation|15 Gb|❌|

You can connect GoogleDrive and store archive with images for future purposes to avoid the whole process of downloading and curating the data.
The preprocessed archive may be downloaded from our GitHub.

In [3]:
#@title Download and curate images
import zipfile
import pandas as pd
import os
from tqdm.auto import tqdm
import pathlib

df = pd.read_parquet('curated_dataset.parquet')
all_files = df['png_loc'].tolist()
all_files = [str(pathlib.Path(*pathlib.Path(f).parts[1:])) for f in all_files]

save_to_dir = 'curated_imgs'
save_large_dir = 'curated_imgs_large'
if not os.path.exists(save_to_dir):
  os.makedirs(save_to_dir, exist_ok=True)
if not os.path.exists(save_large_dir):
  os.makedirs(save_large_dir, exist_ok=True)

parts_urls = [
    'https://zenodo.org/records/4573248/files/gz_decals_dr5_png_part1.zip',
    'https://zenodo.org/records/4573248/files/gz_decals_dr5_png_part2.zip',
    'https://zenodo.org/records/4573248/files/gz_decals_dr5_png_part3.zip',
    'https://zenodo.org/records/4573248/files/gz_decals_dr5_png_part4.zip'
]

for part_idx, url in enumerate(parts_urls):
  print(f'PART {part_idx+1} of {len(parts_urls)}')
  _, filename = os.path.split(url)
  if not os.path.isfile(filename):
    !wget {url}
  if not os.path.isfile(filename):
    print(f'ERROR LOADING PART {part_idx+1}')
    continue
  print(f'UNZIPPING {filename}')
  with zipfile.ZipFile(filename) as z:
    files_in_part = list(set(all_files) & set(z.namelist()))
    with tqdm(total=len(files_in_part)) as progress:
      for filename1 in files_in_part:
        full_path = os.path.join(save_large_dir, filename1)
        path = pathlib.Path(full_path)
        path.parent.mkdir(parents=True, exist_ok=True)
        with open(full_path, 'wb') as f:
          f.write(z.read(filename1))
        progress.update()
  print('FILES EXTRACTED')
  !rm {filename}

from PIL import Image

# Define the cropping function
def crop_center(image, crop_width, crop_height):
    width, height = image.size
    left = (width - crop_width) // 2
    top = (height - crop_height) // 2
    right = left + crop_width
    bottom = top + crop_height
    return image.crop((left, top, right, bottom))

def count_files(folder):
    count = 0
    for root, _, files in os.walk(folder):
        count += len(files)
    return count

print('CROPPING IMAGES to 120x120')
with tqdm(total=count_files(save_large_dir)) as progress:
  for root, _, files in os.walk(save_large_dir):
      for file in files:
          file_path = os.path.join(root, file)
          relative_path = os.path.relpath(file_path, save_large_dir)
          save_to = os.path.join(save_to_dir, relative_path)
          os.makedirs(os.path.dirname(save_to), exist_ok=True)
          with Image.open(file_path) as img:
            gray_img = img.convert("L")
            cropped_img = crop_center(gray_img, 120, 120)
            cropped_img.save(save_to)
            progress.update()

print('ZIPPING just in case')
!zip -q -r curated_imgs.zip curated_imgs curated_dataset.parquet && echo "[ZIPPED] cropped images" || echo "[ERROR] zipping cropped images"
!zip -q -r curated_imgs_large.zip curated_imgs_large curated_dataset.parquet && echo "[ZIPPED] large images" || echo "[ERROR] zipping large images"


print('PROCESSING COMPLETE!')

PART 1 of 4
--2025-02-25 12:11:22--  https://zenodo.org/records/4573248/files/gz_decals_dr5_png_part1.zip
Resolving zenodo.org (zenodo.org)... 188.185.43.25, 188.185.48.194, 188.185.45.92, ...
Connecting to zenodo.org (zenodo.org)|188.185.43.25|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 27789905885 (26G) [application/octet-stream]
Saving to: ‘gz_decals_dr5_png_part1.zip’


2025-02-25 13:38:07 (5.09 MB/s) - ‘gz_decals_dr5_png_part1.zip’ saved [27789905885/27789905885]

UNZIPPING gz_decals_dr5_png_part1.zip


  0%|          | 0/17675 [00:00<?, ?it/s]

FILES EXTRACTED
PART 2 of 4
--2025-02-25 13:40:11--  https://zenodo.org/records/4573248/files/gz_decals_dr5_png_part2.zip
Resolving zenodo.org (zenodo.org)... 188.185.48.194, 188.185.45.92, 188.185.43.25, ...
Connecting to zenodo.org (zenodo.org)|188.185.48.194|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22281798528 (21G) [application/octet-stream]
Saving to: ‘gz_decals_dr5_png_part2.zip’


2025-02-25 14:54:11 (4.79 MB/s) - ‘gz_decals_dr5_png_part2.zip’ saved [22281798528/22281798528]

UNZIPPING gz_decals_dr5_png_part2.zip


  0%|          | 0/14644 [00:00<?, ?it/s]

FILES EXTRACTED
PART 3 of 4
--2025-02-25 14:55:49--  https://zenodo.org/records/4573248/files/gz_decals_dr5_png_part3.zip
Resolving zenodo.org (zenodo.org)... 188.185.45.92, 188.185.48.194, 188.185.43.25, ...
Connecting to zenodo.org (zenodo.org)|188.185.45.92|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 32510336192 (30G) [application/octet-stream]
Saving to: ‘gz_decals_dr5_png_part3.zip’


2025-02-25 16:01:25 (7.88 MB/s) - ‘gz_decals_dr5_png_part3.zip’ saved [32510336192/32510336192]

UNZIPPING gz_decals_dr5_png_part3.zip


  0%|          | 0/17418 [00:00<?, ?it/s]

FILES EXTRACTED
PART 4 of 4
--2025-02-25 16:03:41--  https://zenodo.org/records/4573248/files/gz_decals_dr5_png_part4.zip
Resolving zenodo.org (zenodo.org)... 188.185.48.194, 188.185.45.92, 188.185.43.25, ...
Connecting to zenodo.org (zenodo.org)|188.185.48.194|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21155375722 (20G) [application/octet-stream]
Saving to: ‘gz_decals_dr5_png_part4.zip’


2025-02-25 16:41:50 (8.82 MB/s) - ‘gz_decals_dr5_png_part4.zip’ saved [21155375722/21155375722]

UNZIPPING gz_decals_dr5_png_part4.zip


  0%|          | 0/5400 [00:00<?, ?it/s]

FILES EXTRACTED
CROPPING IMAGES to 120x120


  0%|          | 0/55137 [00:00<?, ?it/s]

ZIPPING just in case
[ZIPPED] cropped images
[ZIPPED] large images
PROCESSING COMPLETE!


In [None]:
!cp curated_imgs.zip ./drive/MyDrive/
!cp curated_imgs_large.zip ./drive/MyDrive/

Or you can download the data if you wish

In [None]:
from google.colab import files

files.download('curated_imgs.zip')
files.download('curated_imgs_large.zip')