# Carnegie-Irvine Galaxy Survey Image Classifier Project

### Goal: Build a binary image classifier that can separate images into either normal or barred spiral classes.

## By Kyle K.M. Kabasares

<figure>
    <img src="https://cgs.obs.carnegiescience.edu/CGS/data/images/NGC1433_clean_color.jpg" width="500" /> 
</figure> 

(Image of NGC 1433 from the Carnegie-Irvine Galaxy Survey)

Galaxies are one of the most magnificient types of celestial objects in our universe. From giant ellipticals to beautiful grand design spirals, to even the most irregular of dwarfs, galaxies come in a variety of flavors and have been classified into several different types and subtypes for centuries. Nowadays, large galaxy surveys such as the [Sloan Digital Sky Survey](https://en.wikipedia.org/wiki/Sloan_Digital_Sky_Survey) are used to understand a multitude of galaxy properties across a broad range in galaxy type. One such survey that I am going to study in this dataset is the [Carnegie-Irvine Galaxy Survey](https://cgs.obs.carnegiescience.edu/CGS/Home.html). This survey's goal was to obtain high-quality optical and and near-infrared imaging of several hundred galaxies in the Southern Hemisphere with Las Campanas Observatory's 2.5-meter Du Pont Telescope. 

I've chosen this dataset for a number of reasons:

1. The data has already been cleaned and preprocessed in the past, and the properties of each galaxy are   already in well-organized tables.

2. Many galaxy images have been star-cleaned, meaning that physical features of a given galaxy are exemplified, and more readily apparent for a machine-learning algorithm to observe and use to differentiate among a set of binary classes.

In [1]:
import requests
from bs4 import BeautifulSoup
from pprint import pprint
import numpy as np 
import os
import skimage
from skimage.io import imread
from skimage.transform import resize
import csv

The first step in this project is to identify which galaxies in the sample are classified as "Normal Spirals" and which galaxies are "Barred Spirals". The CGS survey has over 600 galaxies in the sample, but only a subset of this 600 will fall cleanly into 1 of these 2 galaxy types.

After we identify which galaxies are of the two binary classes that the machine learning algorithm will try to differentiate, we then need to download each star-cleaned .jpg image into two separate directories to train the model on.

In [4]:
# Make a request to the Carnegie Irvine Galaxy Survey Table WebPage
url = 'https://cgs.obs.carnegiescience.edu/CGS/database_tables/sample1.html'
response = requests.get(url)

In [5]:
# Examine the contents of the page to get a better idea of where we can extract the names 
print(response.text)

<html><head><title>The Carnegie-Irvine Galaxy Survey (CGS)</title></head>
<body bgcolor="white" text="003300" link="993333" alink="993333" vlink="0000CC">
<font face="Helvetica Neue">

<h1 align="center"> The Carnegie-Irvine Galaxy Survey (CGS)</h1>
<hr>

<h2 align="center"> The Sample and Basic Properties </h2>

<p>

<table border="2" cellspacing="0" cellpadding="5" width="1050" align="center">

<thead>
<tr>
   <th colspan="1"><font size="+1">Index  <br><br></font> (1)</th>
   <th colspan="1"><font size="+1">Name   <br><br></font> (2)</th>
   <th colspan="1"><font size="+1"><math>&alpha;</math>(J2000) <br>(<i>h m s</i>)</font></font> <br>(3)</th>
   <th colspan="1"><font size="+1"><math>&beta;</math>(J2000) <br>(&deg; &prime; &Prime;)</font> <br>(4)</th>
   <th colspan="1"><font size="+1"><math> <msub> <mi>B</mi> <mn><i>T</i></mn> </msub> <br>(mag)</font> <br>(5)</th>
   <th colspan="1"><font size="+1"><math> <mi>T</mi> </math> <br>Type </font> <br>(6)</th>
   <th colspan="1"><font si

I see that the names of the galaxies are found in the format: objname="name"& consistently for each galaxy. I will use BeautifulSoup to extract the galaxy name from between the 'objname=' and the '&' symbol.

In [6]:
response = requests.get(url)

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Find all links with objname= in their href attribute
links = soup.find_all('a', href=lambda href: href and 'objname=' in href)

# Extract the object names from the links, replace '-' with '_' and remove 'G's if the object name contains 'ESO'
objects = []
for link in links:
    objname = link['href'].split('objname=')[1].split('&')[0].replace('-', '_')
    if 'ESO' in objname:
        objname = objname.replace('G', '')
    objects.append(objname)

In [7]:
# Objects 2 holds the full set of objects 
objects2 = objects

In [8]:
objects2

['ESO009_010',
 'ESO027_001',
 'ESO027_008',
 'ESO056_115',
 'ESO060_019',
 'ESO091_003',
 'ESO097_013',
 'ESO121_006',
 'ESO121_026',
 'ESO136_012',
 'ESO137_018',
 'ESO137_034',
 'ESO137_038',
 'ESO138_005',
 'ESO138_010',
 'ESO138_029',
 'ESO183_030',
 'ESO185_054',
 'ESO186_062',
 'ESO208_021',
 'ESO209_009',
 'ESO213_011',
 'ESO219_021',
 'ESO221_026',
 'ESO221_032',
 'ESO265_007',
 'ESO269_057',
 'ESO270_017',
 'ESO271_010',
 'ESO273_014',
 'ESO274_001',
 'ESO311_012',
 'ESO320_026',
 'ESO321_025',
 'ESO351_030',
 'ESO356_004',
 'ESO358_063',
 'ESO362_011',
 'ESO373_008',
 'ESO380_001',
 'ESO380_006',
 'ESO383_087',
 'ESO384_002',
 'ESO436_027',
 'ESO440_011',
 'ESO442_026',
 'ESO445_089',
 'ESO479_004',
 'ESO494_026',
 'ESO495_021',
 'ESO506_004',
 'ESO507_025',
 'ESO556_015',
 'ESO582_012',
 'IC438',
 'IC764',
 'IC1459',
 'IC1633',
 'IC1953',
 'IC1954',
 'IC1993',
 'IC2000',
 'IC2006',
 'IC2035',
 'IC2051',
 'IC2056',
 'IC2150',
 'IC2163',
 'IC2311',
 'IC2367',
 'IC2469',
 'IC2

Next, I import a CSV file that contains the galaxy classification (type). These will serve as the labels for the machine-learning algorithm to learn from.

In [96]:
filename = 'CGS_Data_Table_1_Paper1.csv'
names = []
leda_type = []
with open(filename) as f:
    reader = csv.reader(f)
    header_row = next(reader)
    
    # Get names from this profile 
    for row in reader:
        name = row[1]
        galaxytype = row[2]
        names.append(name)
        leda_type.append(galaxytype)

In [145]:
# Show the first set of galaxy names and galaxy-types 
names[0:10],leda_type[0:10]

(['ESO 027-G001',
  'ESO 027-G008',
  'ESO 056-G115',
  'ESO 060-G019',
  'ESO 091-G003',
  'ESO 097-G013',
  'ESO 121-G006',
  'ESO 121-G026',
  'ESO 136-G012',
  'ESO 137-G018'],
 ['Sbc', 'SBc', 'SBc', 'SBm', 'SBcd', 'Sab', 'Sb', 'Sc', 'SBbc', 'SBc'])

In [98]:
# For some reason it did not include the first element of the list, 
# so we will manually append the first element
leda_type.insert(0,'Sbc')

In [144]:
# Display the first 10 leda-types 
leda_type[0:10]

['Sbc', 'SBc', 'SBc', 'SBm', 'SBcd', 'Sab', 'Sb', 'Sc', 'SBbc', 'SBc']

I have the object names stored in 'objects2' and the Leda-type stored in leda_type. I want to separate the galaxies into spirals and barred spirals. There does exist galaxies classified as Intermediate Spirals (SAB galaxies) that are a type that fall in between traditional normal spirals and barred spirals, and they will not be included in the list. I want the image classifier to intimately learn the properties of clearly identified normal spirals and clearly identified barred spirals first, and then see if it can apply what it learns from these two binary classes before using the classifier on SAB galaxies.

In [131]:
# Write a string that will sort through the different types of galaxies
# Normal Spirals: Sa, Sab, Sb Sbc, Sc, Scd, Sd, Sm
# Barred Spirals: SBa, SBab, SBb, SBbc, SBc, SBcd, SBd, SBm

galaxy_and_type = []
n = len(leda_type)
    # Loop through the objects2 variable and the 
for i in range(0,n):
    if (leda_type[i]  in ['Sa', 'Sab', 'Sb', 'Sbc', 'Sc', 'Scd', 'Sd', 'Sm']):
        galaxy_and_type.append([objects2[i],'Normal Spiral'])
    elif (leda_type[i] in ['SBa', 'SBab', 'SBb', 'SBbc', 'SBc', 'SBcd', 'SBd', 'SBm']):
        galaxy_and_type.append([objects2[i],'Barred Spiral'])
    else:
          pass

In [139]:
galaxy_and_type[0][0]

'ESO009_010'

Now with the galaxy name and the labels stored in this list I use these two pieces of information to download each star-cleaned galaxy .jpg image from the CGS database into two separate directories and separate them into normal-spiral galaxies and barred spiral galaxies.

In [154]:
# The first index is the element in the list and the second represents either 
# 0: Galaxy Name
# 1: Galaxy Type
galaxy_and_type[0][0],galaxy_and_type[0][1]

('ESO009_010', 'Normal Spiral')

In [159]:
# Count how many Normal Spirals there are with a loop 
normal_counter = 0
barred_counter = 0

for i in range(0,len(galaxy_and_type)):
    if galaxy_and_type[i][1] == 'Normal Spiral':
        normal_counter += 1
    elif galaxy_and_type[i][1] == 'Barred Spiral':
        barred_counter += 1
        
print("The number of normal spirals is",normal_counter)
print("The number of barred spirals is", barred_counter)

The number of normal spirals is 201
The number of barred spirals is 104


Finally, we use a for-loop to iterate through the CGS webpages and download the star-cleaned images for each galaxy into different directories.

In [167]:
# Iterate over each object name, construct the image URL, and download the image
# Create a new folder called "Normal_Spirals" and "Barred_Spirals" to store the images
if not os.path.exists("Normal_Spirals"):
    os.makedirs("Normal_Spirals")

if not os.path.exists("Barred_Spirals"):
    os.makedirs("Barred_Spirals")

# Use the loop to download the images into the different directories 
for k in range(0,len(galaxy_and_type)):
    if galaxy_and_type[k][1] == 'Normal Spiral':
        image_url = f"https://cgs.obs.carnegiescience.edu/CGS/data/images/{galaxy_and_type[k][0]}_clean_color.jpg"
        response = requests.get(image_url)
        with open(f"Normal_Spirals/{galaxy_and_type[k][0]}.jpg", "wb") as f:
            f.write(response.content)
            
    # Barred Spirals
    elif galaxy_and_type[k][1] == 'Barred Spiral':
        image_url = f"https://cgs.obs.carnegiescience.edu/CGS/data/images/{galaxy_and_type[k][0]}_clean_color.jpg"
        response = requests.get(image_url)
        with open(f"Barred_Spirals/{galaxy_and_type[k][0]}.jpg", "wb") as f:
            f.write(response.content)

With these two directories now populated, I notice that there are nearly twice as many normal galaxies in our sample dataset then there are barred galaxies, which leads to a class imbalance. Additiionally, I noticed that not all the images were cleanly downloaded, as in there are some .jpg files that are corrupted, which probably means there are URL links that were entered in incorrectly and that the format of the names didn't correspond to the correct URL link. These include galaxies:

**Barred Galaxies**
1. **ESO 056-G115** (the Large Magellanic Cloud) which is a satellite galaxy of the Milky-Way, and is not a prototypical spiral galaxy and its star-cleaned image shows as much. Given this, I will exclude it from being used in the dataset.

2. **ESO 209-G009** Manually downloaded and added to the dataset.

3. **IC 4946**  Manually downloaded and added to the dataset.

4. **NGC 625** Manually downloaded and added to the dataset.

5. **PGC 14879** Manually downloaded and added to the dataset. (I believe that it is because the 14879 is preceded by a 0 in the URL)

**Normal Galaxies**

1. **ESO 097-G013** This is the Circinus galaxy and interestingly was an object that the Hubble Space Telescope  viewed on my 4th birthday. Upon inspecting the CGS website, it appears there is no such star-cleaned image (might have to ask my PhD advisor about this).

2. **ESO 138-G010** Unlike some of the other ESO galaxy URL links, this one actually includes the "G" in the link, which is why my script didn't catch this one. I had seen that the first set of ESO galaxies typically did not include the "G" in the URL link. Weird. I went ahead and manually downloaded it into the directory.

3. **ESO 380-G006** Same issue as the previous galaxy.

4. **NGC 24** Apparently the URL link has '0024' in it.

5. **NGC 255** Similar issue as NGC 24 where the URL has "0255".

6. **NGC 578** Similar issue to NGC 255.

7. **NGC 895** Similar issue to NGC 255.

8. **NGC 986** Similar issue to NGC 255.

9. **NGC 988** Similar issue to NGC 255.

With these missing .jpg files now included, we are almost about ready to proceed with the training. However, we still need to extract the pixel values from these images and then properly assign the labels to each in a format that we can train an algorithm in Python on.