# Galaxy Zoo Dataset Classification using Convolutional Neural Networks
## Authors: Nerea Losada and Iñigo Ortega

## Description of the project

On this notebook, the Galaxy Zoo dataset is used with the aim of classifying each provided galaxy with its respective class, using CNN.

Convolutional neural networks are ones of the most employed DNN architectures. They are particularly efficient for computer vision tasks such as image classification, so it is efficient for our task.


Galaxy classification consists of, given an image, predict the probability that it belongs in a particular galaxy class (generally determined by its morphology).

## Objectives

The goal of the project is to design a convolutional network that outputs the probability that a given galaxy image belongs to one of the possible categories. This is a supervised classification problem. The dataset was used for one of the Kaggle challenges.

However, because of having a huge amount of images to train and test, the dataset was reduced in such a way that only a subset of the total amount of the images are used. This way, we used 10000, which we consider to be sufficient.


In order to do that, we will:

1) Preprocess the images

2) Design the network architecture and train it

3) Validate the network


## What is done in the notebook:

In this notebook we can find:

1) 

2) 

3) 

...

The database:

https://drive.google.com/open?id=15yza7bXOm0VF63zlbPJAe3_NA3nw8ZL8

## Importing the libraries
We start by importing all relevant libraries to be used in the notebook.

**pandas**: In order to retrive the data from the csv files provided by the team responsible of Galaxy Zoo Challenge at Kaggle.

**tensorflow.keras**: On this notebook **Keras** is used, especifically, the version bundled with **TensorFlow**.

**matplotlib.pyplot**: For plotting.

**keras_preprocessing.***: It is used for image manipulation, for preprocessing or demonstration porpouses.

**os**, **random** and **shutil**: They are used at the moment of reading files from the system into python.

In [1]:
%matplotlib inline
import pandas as pd
import os, random, shutil

import tensorflow as tf
import tensorflow.keras as keras
import keras_preprocessing
from keras_preprocessing import image
from keras_preprocessing.image import ImageDataGenerator
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np
from PIL import Image

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


## 1. Preprocessing

The  “Galaxy  Zoo”  dataset  consists  of  a  total  141 553  images.   These  are  split  into  61578  images for training –each with their respective probability distributions for the classifications for each of the inputs– and 79975 images for testing. As a crowd-sourced volunteer effort, images of the dataset were classified across 11 different categories. Each  of  categories  have  attributes  which  volunteers  can  rank,  there  are  37  attributes  in  total.

The votes on these volunteer categorizations are normalized to a floating point number between 0 and 1 inclusive. A number close to 1 indicates many users identifiedthis  category  for  the  galaxy  image  with  a  high  level  of  confidence,  while  numbers  close  to  0  indicateotherwise. These numbers represent the overall morphology of a galaxy in 37 attributes.

All images in the dataset are of size 424×424 and the object of interest is always centered. In order to reduce the dimensionality of the images, during preprocessing images are cropped to 212×212, half their original size, images are also down sampled to half size again 106×106, discarding unnecessary information  in  each  image  which  could  impair  the  network.   Down  sampling  can  help  the  CNN  learn which regions are related to each specific expression as well as improve performance when training.

### 1.1 Reading the data

The variable defining where is the working directory storing all the files of the dataset.

In [2]:
base_path = r'.'

Importing into Python the training set's images from _images_training_rev1_ and their labels from the CSV file, both provided at Kaggle on the Galaxy Zoo Challenge (https://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge).

In [3]:
training_solutions = os.path.join(base_path, 'training_solutions_rev1.csv')
training_images    = os.path.join(base_path, 'images_training_rev1')

Getting the values of the CSV file into a table.

In [4]:
df = pd.read_csv(training_solutions)
df.shape

(61578, 38)

Let us see the solutions for all the images.

In [5]:
df

Unnamed: 0,GalaxyID,Class1.1,Class1.2,Class1.3,Class2.1,Class2.2,Class3.1,Class3.2,Class4.1,Class4.2,...,Class9.3,Class10.1,Class10.2,Class10.3,Class11.1,Class11.2,Class11.3,Class11.4,Class11.5,Class11.6
0,100008,0.383147,0.616853,0.000000,0.000000,0.616853,0.038452,0.578401,0.418398,0.198455,...,0.000000,0.279952,0.138445,0.000000,0.000000,0.092886,0.000000,0.000000,0.0,0.325512
1,100023,0.327001,0.663777,0.009222,0.031178,0.632599,0.467370,0.165229,0.591328,0.041271,...,0.018764,0.000000,0.131378,0.459950,0.000000,0.591328,0.000000,0.000000,0.0,0.000000
2,100053,0.765717,0.177352,0.056931,0.000000,0.177352,0.000000,0.177352,0.000000,0.177352,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000
3,100078,0.693377,0.238564,0.068059,0.000000,0.238564,0.109493,0.129071,0.189098,0.049466,...,0.000000,0.094549,0.000000,0.094549,0.189098,0.000000,0.000000,0.000000,0.0,0.000000
4,100090,0.933839,0.000000,0.066161,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61573,999948,0.510379,0.489621,0.000000,0.059207,0.430414,0.000000,0.430414,0.226257,0.204157,...,0.000000,0.226257,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.226257
61574,999950,0.901216,0.098784,0.000000,0.000000,0.098784,0.000000,0.098784,0.000000,0.098784,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000
61575,999958,0.202841,0.777376,0.019783,0.116962,0.660414,0.067245,0.593168,0.140022,0.520391,...,0.000000,0.000000,0.090673,0.049349,0.000000,0.067726,0.000000,0.000000,0.0,0.072296
61576,999964,0.091000,0.909000,0.000000,0.045450,0.863550,0.022452,0.841098,0.795330,0.068220,...,0.000000,0.068398,0.318132,0.408799,0.227464,0.408799,0.090668,0.023065,0.0,0.045334


As we have said, there are lots of images, so we are going to work with a subset of them.

### 1.2 Reducing the amount of images

Since the amount of images is huge, we use 10000 of them, which are enough.

In [6]:
n_images = 10000
df_reduced = df[:n_images]
df_reduced

Unnamed: 0,GalaxyID,Class1.1,Class1.2,Class1.3,Class2.1,Class2.2,Class3.1,Class3.2,Class4.1,Class4.2,...,Class9.3,Class10.1,Class10.2,Class10.3,Class11.1,Class11.2,Class11.3,Class11.4,Class11.5,Class11.6
0,100008,0.383147,0.616853,0.000000,0.000000,0.616853,0.038452,0.578401,0.418398,0.198455,...,0.000000,0.279952,0.138445,0.000000,0.000000,0.092886,0.000000,0.000000,0.000000,0.325512
1,100023,0.327001,0.663777,0.009222,0.031178,0.632599,0.467370,0.165229,0.591328,0.041271,...,0.018764,0.000000,0.131378,0.459950,0.000000,0.591328,0.000000,0.000000,0.000000,0.000000
2,100053,0.765717,0.177352,0.056931,0.000000,0.177352,0.000000,0.177352,0.000000,0.177352,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,100078,0.693377,0.238564,0.068059,0.000000,0.238564,0.109493,0.129071,0.189098,0.049466,...,0.000000,0.094549,0.000000,0.094549,0.189098,0.000000,0.000000,0.000000,0.000000,0.000000
4,100090,0.933839,0.000000,0.066161,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,248461,0.083000,0.889000,0.028000,0.000000,0.889000,0.055118,0.833882,0.861441,0.027559,...,0.000000,0.305812,0.444504,0.111126,0.000000,0.027566,0.000000,0.222252,0.305812,0.305812
9996,248466,0.439049,0.527396,0.033555,0.000000,0.527396,0.000000,0.527396,0.298742,0.228654,...,0.000000,0.298742,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.298742
9997,248470,0.256421,0.737000,0.006579,0.038667,0.698333,0.000000,0.698333,0.217351,0.480982,...,0.000000,0.172379,0.044973,0.000000,0.000000,0.000000,0.077471,0.000000,0.000000,0.139880
9998,248471,0.796456,0.203544,0.000000,0.000000,0.203544,0.000000,0.203544,0.000000,0.203544,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


### 1.3 Resizing the images

Reducing  the  size  of  each  image  has  the  secondary  quality  of  being  more  memory  efficient  which  is beneficial for training and network size.

In [7]:
# Initial shape of the images: 424x424

# Desired shape of input
shape_x, shape_y = 150, 150

# Number of channels of the images
channels = 3

In [8]:
def resize_image(image, target_width=150, target_height=150):
    im = Image.open(image)
    size=(target_width, target_height)
    out = im.resize(size)
    
    return np.array(out)

In [9]:
names = df_reduced['GalaxyID']
names = names.map(str)

In [10]:
# Create two lists
# X: List formed by arrays of 3 dimensions of 212x212x3,
# containing the pixels of the images
# Y: List formed by arrays of 1 dimension of 37,
# containing the labaels of each image
x, y = [], []
print("Reading files...")
for i in range(len(names)):
    image = os.path.join(training_images, names[i] + ".jpg")
    image = resize_image(image)
    x.append(image)
    y.append(df_reduced.iloc[i].values)

x = np.asarray(x)
y = np.asarray(y)
print("Finished")

Reading files...
Finished


In [11]:
y = y[:,1:]

In [12]:
x.shape

(10000, 150, 150, 3)

### 1.4 Training and validation sets

In [13]:
n = x.shape[0]
n_samples = int(n*0.7)
X_train, X_val = (x[:n_samples], x[n_samples:])
y_train, y_val = (y[:n_samples], y[n_samples:])

In [14]:
########### mostrar alguna imagen enplan ejemplito

im2 = Image.open("./images_training_rev1/281938.jpg")
size=(150,150)
out = im2.resize(size)
type(out)

PIL.Image.Image

## 2. Designing the network and training

In [15]:
keras.backend.clear_session()
model = tf.keras.models.Sequential([
    # first convolution layer, input is an 150x150 image x3 colors
    tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(150, 150, 3)),
    tf.keras.layers.MaxPooling2D(2, 2),
    # second convolution layer
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    # third convolution layer
    tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    # fourth convolution layer
    tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    # flatten the image pixels
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dropout(0.5),
    # 512 neuron fully connected hidden layer
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dense(37, activation='softmax')
])

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [16]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 148, 148, 64)      1792      
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 74, 74, 64)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 72, 72, 64)        36928     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 36, 36, 64)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 34, 34, 128)       73856     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 17, 17, 128)       0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 15, 15, 128)       147584    
__________

In [17]:
model.compile(loss=tf.keras.losses.categorical_crossentropy,
              optimizer='adam',
              metrics=['categorical_accuracy'])

In [18]:
y_train.shape

(7000, 37)

In [19]:
model.fit(X_train, y_train, epochs=79, shuffle=True, batch_size=32, verbose=1)

Instructions for updating:
Use tf.cast instead.
Epoch 1/79

KeyboardInterrupt: 

In [None]:
# model.save('galaxy-convnet-mlnn.h5')

In [None]:
# score = model.evaluate(X_val, y_val, batch_size=32)