# Galaxy Zoo Dataset Classification using Convolutional Neural Networks
## Authors: Nerea Losada and Iñigo Ortega

## Description of the project

On this notebook, the Galaxy Zoo dataset is used with the aim of classifying each provided galaxy with its respective class, using CNN.

Convolutional neural networks are ones of the most employed DNN architectures. They are particularly efficient for computer vision tasks such as image classification, so it is efficient for our task.


Galaxy classification consists of, given an image, predict the probability that it belongs in a particular galaxy class (generally determined by its morphology).

## Objectives

The goal of the project is to design a convolutional network that outputs the probability that a given galaxy image belongs to one of the possible categories. This is a supervised classification problem. The dataset was used for one of the Kaggle challenges.

However, because of having a huge amount of images to train and test, the dataset was reduced in such a way that only a subset of the total amount of the images are used. This way, we used 10000, which we consider to be sufficient.


In order to do that, we will:

1) Preprocess the images

2) Design the network architecture and train it

3) Validate the network


## What is done in the notebook:

In this notebook we can find:

1) 

2) 

3) 

...

The database:

https://drive.google.com/open?id=15yza7bXOm0VF63zlbPJAe3_NA3nw8ZL8

## Importing the libraries
We start by importing all relevant libraries to be used in the notebook.

**pandas**: In order to retrive the data from the csv files provided by the team responsible of Galaxy Zoo Challenge at Kaggle.

**tensorflow.keras**: On this notebook **Keras** is used, especifically, the version bundled with **TensorFlow**.

**matplotlib.pyplot**: For plotting.

**keras_preprocessing.***: It is used for image manipulation, for preprocessing or demonstration porpouses.

**os**, **random** and **shutil**: They are used at the moment of reading files from the system into python.

In [2]:
%matplotlib inline
import pandas as pd
import os, random, shutil

import tensorflow as tf
import tensorflow.keras as keras
import keras_preprocessing
from keras_preprocessing import image
from keras_preprocessing.image import ImageDataGenerator
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np

## 1. Preprocessing

The  “Galaxy  Zoo”  dataset  consists  of  a  total  141 553  images.   These  are  split  into  61578  images for training –each with their respective probability distributions for the classifications for each of the inputs– and 79975 images for testing. As a crowd-sourced volunteer effort, images of the dataset were classified across 11 different categories. Each  of  categories  have  attributes  which  volunteers  can  rank,  there  are  37  attributes  in  total.

The votes on these volunteer categorizations are normalized to a floating point number between 0 and 1 inclusive. A number close to 1 indicates many users identifiedthis  category  for  the  galaxy  image  with  a  high  level  of  confidence,  while  numbers  close  to  0  indicateotherwise. These numbers represent the overall morphology of a galaxy in 37 attributes.

All images in the dataset are of size 424×424 and the object of interest is always centered. In order to reduce the dimensionality of the images, during preprocessing images are cropped to 212×212, half their original size, images are also down sampled to half size again 106×106, discarding unnecessary information  in  each  image  which  could  impair  the  network.   Down  sampling  can  help  the  CNN  learn which regions are related to each specific expression as well as improve performance when training.

### 1.1 Reading the data

The variable defining where is the working directory storing all the files of the dataset.

In [3]:
base_path = r'.'

Importing into Python the training set's images from _images_training_rev1_ and their labels from the CSV file, both provided at Kaggle on the Galaxy Zoo Challenge (https://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge).

In [4]:
training_solutions = os.path.join(base_path, 'training_solutions_rev1.csv')
training_images    = os.path.join(base_path, 'images_training_rev1')

Getting the values of the CSV file into a table.

In [5]:
df = pd.read_csv(training_solutions)
df.shape

(61578, 38)

Let us see the solutions for all the images.

In [6]:
df

Unnamed: 0,GalaxyID,Class1.1,Class1.2,Class1.3,Class2.1,Class2.2,Class3.1,Class3.2,Class4.1,Class4.2,...,Class9.3,Class10.1,Class10.2,Class10.3,Class11.1,Class11.2,Class11.3,Class11.4,Class11.5,Class11.6
0,100008,0.383147,0.616853,0.000000,0.000000,0.616853,0.038452,0.578401,0.418398,0.198455,...,0.000000,0.279952,0.138445,0.000000,0.000000,0.092886,0.000000,0.000000,0.0,0.325512
1,100023,0.327001,0.663777,0.009222,0.031178,0.632599,0.467370,0.165229,0.591328,0.041271,...,0.018764,0.000000,0.131378,0.459950,0.000000,0.591328,0.000000,0.000000,0.0,0.000000
2,100053,0.765717,0.177352,0.056931,0.000000,0.177352,0.000000,0.177352,0.000000,0.177352,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000
3,100078,0.693377,0.238564,0.068059,0.000000,0.238564,0.109493,0.129071,0.189098,0.049466,...,0.000000,0.094549,0.000000,0.094549,0.189098,0.000000,0.000000,0.000000,0.0,0.000000
4,100090,0.933839,0.000000,0.066161,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61573,999948,0.510379,0.489621,0.000000,0.059207,0.430414,0.000000,0.430414,0.226257,0.204157,...,0.000000,0.226257,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.226257
61574,999950,0.901216,0.098784,0.000000,0.000000,0.098784,0.000000,0.098784,0.000000,0.098784,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000
61575,999958,0.202841,0.777376,0.019783,0.116962,0.660414,0.067245,0.593168,0.140022,0.520391,...,0.000000,0.000000,0.090673,0.049349,0.000000,0.067726,0.000000,0.000000,0.0,0.072296
61576,999964,0.091000,0.909000,0.000000,0.045450,0.863550,0.022452,0.841098,0.795330,0.068220,...,0.000000,0.068398,0.318132,0.408799,0.227464,0.408799,0.090668,0.023065,0.0,0.045334


As we have said, there are lots of images, so we are going to work with a subset of them.

### 1.2 Reducing the amount of images

Since the amount of images is huge, we use 10000 of them, which are enough.

In [7]:
n_images = 10000
df_reduced = df[:n_images]
df_reduced

Unnamed: 0,GalaxyID,Class1.1,Class1.2,Class1.3,Class2.1,Class2.2,Class3.1,Class3.2,Class4.1,Class4.2,...,Class9.3,Class10.1,Class10.2,Class10.3,Class11.1,Class11.2,Class11.3,Class11.4,Class11.5,Class11.6
0,100008,0.383147,0.616853,0.000000,0.000000,0.616853,0.038452,0.578401,0.418398,0.198455,...,0.000000,0.279952,0.138445,0.000000,0.000000,0.092886,0.000000,0.000000,0.000000,0.325512
1,100023,0.327001,0.663777,0.009222,0.031178,0.632599,0.467370,0.165229,0.591328,0.041271,...,0.018764,0.000000,0.131378,0.459950,0.000000,0.591328,0.000000,0.000000,0.000000,0.000000
2,100053,0.765717,0.177352,0.056931,0.000000,0.177352,0.000000,0.177352,0.000000,0.177352,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,100078,0.693377,0.238564,0.068059,0.000000,0.238564,0.109493,0.129071,0.189098,0.049466,...,0.000000,0.094549,0.000000,0.094549,0.189098,0.000000,0.000000,0.000000,0.000000,0.000000
4,100090,0.933839,0.000000,0.066161,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,248461,0.083000,0.889000,0.028000,0.000000,0.889000,0.055118,0.833882,0.861441,0.027559,...,0.000000,0.305812,0.444504,0.111126,0.000000,0.027566,0.000000,0.222252,0.305812,0.305812
9996,248466,0.439049,0.527396,0.033555,0.000000,0.527396,0.000000,0.527396,0.298742,0.228654,...,0.000000,0.298742,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.298742
9997,248470,0.256421,0.737000,0.006579,0.038667,0.698333,0.000000,0.698333,0.217351,0.480982,...,0.000000,0.172379,0.044973,0.000000,0.000000,0.000000,0.077471,0.000000,0.000000,0.139880
9998,248471,0.796456,0.203544,0.000000,0.000000,0.203544,0.000000,0.203544,0.000000,0.203544,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


### 1.3 Resizing the images

Reducing  the  size  of  each  image  has  the  secondary  quality  of  being  more  memory  efficient  which  is beneficial for training and network size.

In [8]:
# Initial shape of the images: 424x424

# Desired shape of input
shape_x, shape_y = 212, 212

# Number of channels of the images
channels = 3

In [9]:
def resize_image(image, target_width=shape_x, target_height=shape_y, max_zoom=0.2):
    """Zooms and crops the image randomly for data augmentation."""

    # First, let's find the largest bounding box with the target size ratio that fits within the image
    height = image.shape[0]
    width = image.shape[1]
    image_ratio = width / height
    target_image_ratio = target_width / target_height
    crop_vertically = image_ratio < target_image_ratio
    crop_width = width if crop_vertically else int(height * target_image_ratio)
    crop_height = int(width / target_image_ratio) if crop_vertically else height

    # Now let's shrink this bounding box by a random factor (dividing the dimensions by a random number
    # between 1.0 and 1.0 + 'max_zoom'.
    resize_factor = np.random.rand() * max_zoom + 1.0
    crop_width = int(crop_width / resize_factor)
    crop_height = int(crop_height / resize_factor)

    # Next, we can select a random location on the image for this bounding box.
    x0 = np.random.randint(0, width - crop_width)
    y0 = np.random.randint(0, height - crop_height)
    x1 = x0 + crop_width
    y1 = y0 + crop_height

    # Let's crop the image using the random bounding box we built.
    image = image[y0:y1, x0:x1, :]

    # Let's also flip the image horizontally with 50% probability:
    if np.random.rand() < 0.5:
        image = np.fliplr(image)

    # Now, let's resize the image to the target dimensions.
    original_dim = (424, 424, 3)
    target_size = (target_width, target_height)
    input = keras.layers.Input(original_dim)
    x = tf.keras.layers.Lambda(lambda image: tf.image.resize(image, target_size))(input)

    # Finally, let's ensure that the colors are represented as
    # 32-bit floats ranging from 0.0 to 1.0 (for now):
    return image.astype(np.uint8)

In [10]:
names = df_reduced['GalaxyID']
names = names.map(str)

In [11]:
# Get the length of the dataset
total = len(df.index.values)

# Create two lists
# X: List formed by arrays of 3 dimensions of 212x212x3,
# containing the pixels of the images
# Y: List formed by arrays of 1 dimension of 37,
# containing the labaels of each image
x, y = [], []
i = 0
print("Reading files...")
for name in names:
    print(name)
    image = mpimg.imread(os.path.join(training_images, name + ".jpg"))[:, :, :channels]
    image = resize_image(image)
    x.append(image)
    y.append(df.loc[name].values)

x = np.asarray(x)
y = np.asarray(y)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5)

#esto no funciona :)

Reading files...
100008


AttributeError: module 'tensorflow.image' has no attribute 'resize'