# Galaxy Zoo Dataset Classification using Convolutional Neural Networks
## Authors: Nerea Losada and Iñigo Ortega

## Description of the project

On this notebook, the Galaxy Zoo dataset is used with the aim of classifying each provided galaxy with its respective class, using CNN.

Convolutional neural networks are ones of the most employed DNN architectures. They are particularly efficient for computer vision tasks such as image classification, so it is efficient for our task.


Galaxy classification consists of, given an image, predict the probability that it belongs in a particular galaxy class (generally determined by its morphology).

## Objectives

The goal of the project is to design a convolutional network that outputs the probability that a given galaxy image belongs to one of the possible categories. This is a supervised classification problem. The dataset was used for one of the Kaggle challenges.

However, because of having a huge amount of images to train and test, the dataset was reduced in such a way that only a subset of the total amount of the images are used. This way, we used 10000, which we consider to be sufficient.


In order to do that, we will:

1) Preprocess the images

2) Design the network architecture and train it

3) Validate the network


## What is done in the notebook:

In this notebook we can find:

1) 

2) 

3) 

...

The database:

https://drive.google.com/open?id=15yza7bXOm0VF63zlbPJAe3_NA3nw8ZL8

## Importing the libraries
We start by importing all relevant libraries to be used in the notebook.

**pandas**: In order to retrive the data from the csv files provided by the team responsible of Galaxy Zoo Challenge at Kaggle.

**tensorflow.keras**: On this notebook **Keras** is used, especifically, the version bundled with **TensorFlow**.

**matplotlib.pyplot**: For plotting.

**keras_preprocessing.***: It is used for image manipulation, for preprocessing or demonstration porpouses.

**os**, **random** and **shutil**: They are used at the moment of reading files from the system into python.

In [7]:
%matplotlib inline
import pandas as pd
import os, random, shutil

import tensorflow.keras as keras
import keras_preprocessing
from keras_preprocessing import image
from keras_preprocessing.image import ImageDataGenerator
import matplotlib.pyplot as plt
import numpy as np

## 1. Preprocessing

### 1.1 Reading the data

The variable defining where is the working directory storing all the files of the dataset.

In [8]:
base_path = r'.'

Importing into Python the training set's images from _images_training_rev1_ and their labels from the CSV file, both provided at Kaggle on the Galaxy Zoo Challenge (https://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge).

In [9]:
training_solutions = os.path.join(base_path, 'training_solutions_rev1.csv')
training_images    = os.path.join(base_path, 'images_training_rev1')

Getting the values of the CSV file into a table.

In [10]:
df = pd.read_csv(training_solutions)
df.shape

(61578, 38)

Let us see the solutions for all the images.

In [11]:
df

Unnamed: 0,GalaxyID,Class1.1,Class1.2,Class1.3,Class2.1,Class2.2,Class3.1,Class3.2,Class4.1,Class4.2,...,Class9.3,Class10.1,Class10.2,Class10.3,Class11.1,Class11.2,Class11.3,Class11.4,Class11.5,Class11.6
0,100008,0.383147,0.616853,0.000000,0.000000,0.616853,0.038452,0.578401,0.418398,0.198455,...,0.000000,0.279952,0.138445,0.000000,0.000000,0.092886,0.000000,0.000000,0.0,0.325512
1,100023,0.327001,0.663777,0.009222,0.031178,0.632599,0.467370,0.165229,0.591328,0.041271,...,0.018764,0.000000,0.131378,0.459950,0.000000,0.591328,0.000000,0.000000,0.0,0.000000
2,100053,0.765717,0.177352,0.056931,0.000000,0.177352,0.000000,0.177352,0.000000,0.177352,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000
3,100078,0.693377,0.238564,0.068059,0.000000,0.238564,0.109493,0.129071,0.189098,0.049466,...,0.000000,0.094549,0.000000,0.094549,0.189098,0.000000,0.000000,0.000000,0.0,0.000000
4,100090,0.933839,0.000000,0.066161,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61573,999948,0.510379,0.489621,0.000000,0.059207,0.430414,0.000000,0.430414,0.226257,0.204157,...,0.000000,0.226257,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.226257
61574,999950,0.901216,0.098784,0.000000,0.000000,0.098784,0.000000,0.098784,0.000000,0.098784,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000
61575,999958,0.202841,0.777376,0.019783,0.116962,0.660414,0.067245,0.593168,0.140022,0.520391,...,0.000000,0.000000,0.090673,0.049349,0.000000,0.067726,0.000000,0.000000,0.0,0.072296
61576,999964,0.091000,0.909000,0.000000,0.045450,0.863550,0.022452,0.841098,0.795330,0.068220,...,0.000000,0.068398,0.318132,0.408799,0.227464,0.408799,0.090668,0.023065,0.0,0.045334


As we have said, there are lots of images, so we are going to work with a subset of them.

### 1.2 Reducing the amount of images

Since the amount of images is huge, we use 10000 of them, which are enough.

In [12]:
n_images = 10000
df_reduced = df[:n_images]
df_reduced

Unnamed: 0,GalaxyID,Class1.1,Class1.2,Class1.3,Class2.1,Class2.2,Class3.1,Class3.2,Class4.1,Class4.2,...,Class9.3,Class10.1,Class10.2,Class10.3,Class11.1,Class11.2,Class11.3,Class11.4,Class11.5,Class11.6
0,100008,0.383147,0.616853,0.000000,0.000000,0.616853,0.038452,0.578401,0.418398,0.198455,...,0.000000,0.279952,0.138445,0.000000,0.000000,0.092886,0.000000,0.000000,0.000000,0.325512
1,100023,0.327001,0.663777,0.009222,0.031178,0.632599,0.467370,0.165229,0.591328,0.041271,...,0.018764,0.000000,0.131378,0.459950,0.000000,0.591328,0.000000,0.000000,0.000000,0.000000
2,100053,0.765717,0.177352,0.056931,0.000000,0.177352,0.000000,0.177352,0.000000,0.177352,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,100078,0.693377,0.238564,0.068059,0.000000,0.238564,0.109493,0.129071,0.189098,0.049466,...,0.000000,0.094549,0.000000,0.094549,0.189098,0.000000,0.000000,0.000000,0.000000,0.000000
4,100090,0.933839,0.000000,0.066161,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,248461,0.083000,0.889000,0.028000,0.000000,0.889000,0.055118,0.833882,0.861441,0.027559,...,0.000000,0.305812,0.444504,0.111126,0.000000,0.027566,0.000000,0.222252,0.305812,0.305812
9996,248466,0.439049,0.527396,0.033555,0.000000,0.527396,0.000000,0.527396,0.298742,0.228654,...,0.000000,0.298742,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.298742
9997,248470,0.256421,0.737000,0.006579,0.038667,0.698333,0.000000,0.698333,0.217351,0.480982,...,0.000000,0.172379,0.044973,0.000000,0.000000,0.000000,0.077471,0.000000,0.000000,0.139880
9998,248471,0.796456,0.203544,0.000000,0.000000,0.203544,0.000000,0.203544,0.000000,0.203544,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


### 1.3 Getting the classes of the images

The goal of the project is to classify the images in 37 classes, so let us take them from the dataset. In order to do that, we take from each image the class that has the highest probability of belonging to.

In [13]:
aux = df_reduced.loc[:,'Class1.1':]
class_names = list(aux.columns)
classes = list()
for i in range(aux.shape[0]):
    max_val = 0
    ind = 0
    for j in range(aux.shape[1]):
        if aux.loc[i][j] > max_val:
            max_val = aux.loc[i][j]
            ind = j                
    cls = class_names[ind]
    classes.append(cls)

In this problem there are 37 classes that we need to convert to a number between 1 and 37.
That is what we do in the next cell.

In [24]:
cl = {'class': classes}
cl_df = pd.DataFrame(cl)
diff_classes = cl_df['class'].unique()
num_classes = len(diff_classes)
for i in range(num_classes):
    new = diff_classes[i]
    cl_df = cl_df.replace({'class': new}, i+1)
cl_df

Unnamed: 0,class
0,1
1,1
2,1
3,2
4,1
...,...
9995,4
9996,1
9997,4
9998,2


In [25]:
ids = df_reduced.loc[:,:'GalaxyID']
all_classes = pd.concat([ids,cl_df],axis=1,join='inner')
all_classes

Unnamed: 0,GalaxyID,class
0,100008,1
1,100023,1
2,100053,1
3,100078,2
4,100090,1
...,...,...
9995,248461,4
9996,248466,1
9997,248470,4
9998,248471,2
