# Data science in Python III. - Neural Networks (NNs) and Deep Learning (DL)

<center>
  <img width="545" height="300" src="./images/ann.png"/>
  <img width="500" height="300" src="./images/alphafold-net.png"/>
</center>
<p style="text-align:center; font-size:24px;">
  <b>Fig. 1. A simple Deep Neural Network (DNN) vs the structure of AlphaFold2</b>
</p>

### Glossary
<p>
  Just like other terms related to different fields in machine learning and artificial intelligence, the terms related to neural networks are also not well defined.
</p>
<ul>
<li><b>Neural Network (NN)</b> (<i>hu: neurális háló(zat)</i> ): An umbrella term for NNs and NN-based algorithms in machine learning.</li>

<li><b>Artificial Neural Network (ANN)</b> (<i>hu: mesterséges neurális háló(zat)</i> ): The "correct" term describing the most basic computational model of biological neural networks. An ANN consist of a set of connected nodes called <i>artificial neurons</i>, that can input multiple signals, process them, then output a single signal, where all of them are represented by real numbers in computing (although complex-valued NNs do exists, but they never really progressed beyond the conceptual/experimental level). The output value of a single neuron is calculated by sending the weighted sum of all input signals through a non-linear function. Typically, neurons are aggregated into layers, where the vector if the input data is simply called as <i>input layer</i> and the output value (in case of regression) or values (in case of classification) is called as the <i>output layer</i>. The layers of neurons between these two are referred to as <b>hidden (neuron) layers</b>.</li>

<li><b>Deep Neural Network (DNN)</b> (<i>hu: mély neurális háló(zat)</i> ): Artificial Neural Networks with more, than one hidden neuron layers are usually referred to as DNNs, however this differentiation is purely just informative. Neural networks with more, than one neuron layers are widely referred to simply as ANNs too in the scientific literature.</li>
</ul>

## Very short history and definitions
Understanding <i>how</i> neural networks work is simple and only requires the understanding of some basic concepts in linear algebra (operations on vectors, matrices and tensors) and calculus (differentiation). However understanding <i>why</i> neural networks work requires a much broader knowledge in mathematics, especially in logics, real analysis and category theory.

Warren McCulloch and Walter Pitts (the prior is a neurophysiologist and a latter is an autodidact mathematician working in computational neurophysiology) were those, who devised the idea to describe brain activity on terms of propositional calculus. Their paper from 1943, <i>A logical calculus of the ideas immanent in nervous activity</i> is considered to be the very first article that details the basics of the mathematical description of biological (and also artificial) neural networks or more like the neurons itself. (That's why artificial neurons are referred to as <i>McCulloch-Pitts neurons</i> too sometimes.)

By principle, a biological neuron can be modelled as a computational unit, that processes input signals by calculating their weighted sum first (referred to as the <b>linear</b> part), then passing this sum through a so called <b>activation</b> or <b> activation function</b> (referred to as the <b>non-linear</b> part). Optionally a constant value (called as <b>bias</b>, denoted by $b$) can be also added to the summation's result. Artificial neurons try to mimic this exact behaviour. In mathematical terms, the behaviour of a neuron for $N$ number of input signals can be described as

$$
y
=
f \left(
  \sum_{i=1}^{N} w_{i} x_{i} + b
\right),
$$

where $x_{i}$ are the input values, $w_{i}$ are their corresponding weights given by the neuron to them during the weighted summation, $b$ is the arbitrary bias and $f()$ is the non-linear activation function.

<center>
  <img width=70% src="./images/artificial-neuron.png"/>
</center>
<p style="text-align:center; font-size:24px;">
  <b>Fig. 2. A biological neuron and its mathematical model, an artificial neuron</b>
</p>

The first real artificial neuron was constructed between 1957 and 1958 by <a href="https://en.wikipedia.org/wiki/Frank_Rosenblatt">Frank Rosenblatt</a> and his research team at the Cornell Aeronautical Laboratory [[1]](https://blogs.umass.edu/brain-wars/files/2016/03/rosenblatt-1957.pdf). It was an actual, wardrobe-sized machine, called the <i>Mark I Perceptron</i> that implemented the <i>perceptron</i> algorithm, a rudimentary binary classifier containing a single artificial neuron. In this setup, all the inputs are directly fed into the output layer through a single weighted summation and activation function (so a neuron). The perceptron algorithm's "binary" nature arises from the fact that it uses a Heaviside step function as it's activation function, mapping all input vectors to the $\{0,\,1\}$ set, which 
means it will classify any input to two separate classes (denoted by $0$ and $1$).

Adding more neurons to the system, where the input goes into all of the available neurons and from them to the output layer, creates a single-layer neural network. Similarly, adding more layers of neurons, where the inputs of each new layers are the outputs of the previous ones, makes a multi-layer neural network. These type of networks - just like the Mark I Perceptron - are referred to as <b>feedforward</b> neural networks that means "signals propagate through the network in a single direction, without any loops". However after the success of the Perceptron machine, a huge decline in the interest for neural networks was observed. The main reason for this was the non-versatile nature of feedforward networks.

After a short period of slow research in the 1960s, the final nail in the coffin of neural networks came in 1969, when the book titled <i>Perceptrons</i> by Marvin Minsky and Seymour Papert showed that it was impossible for single-layer feedforward networks to learn an XOR function, which is also referred to as the <i>XOR problem</i> and can be seen on Fig. 3. Unfortunately human laziness already existed that time and without carefully reading their book, people often believed that Minsky and Papert proved this non-versatility for multi-layer neural networks too. Although this wasn't true and they've explicitly conjectured that the XOR problem can be solved using multi-layer networks, it completely halt neural network research over the next $\approx 25$ years. (Yes, this really happened.)

## Modelling an Artificial Neural Network
The parallel linear and non-linear steps performed by each neuron in consecutive neuron layers, as well as the entirety of the backward propagation can be described using basic tools in linear algebra. If we consider $N$ (upper case) number of input signals that we denote with the $N$-dimensional vector $\boldsymbol{x}$ and a neuron layer with $n$ (lower case) number of neurons, we'll have $N \times n$ number of weights ($1$ weight for every input in every neuron) and $n$ number of biases ($1$ bias in every neuron). In this case weights can be denoted by the $N \times n$ matrix $W$ and biases can be denoted by the $n$-dimensional vector $\boldsymbol{b}$.
<center>
  <img width=95% src="./images/nn_forward.png"/>
</center>
<p style="text-align:center; font-size:24px;">
  <b>Fig. 4. The forward propagation section of an ANN with $L$ number of layer</b>
</p>
<center>
  <img width=95% src="./images/nn_backward.png"/>
</center>
<p style="text-align:center; font-size:24px;">
  <b>Fig. 5. The backward propagation section of an ANN with $L$ number of layer</b>
</p>

## The 2nd main part - Programming

In [None]:
import numpy as np
import pandas as pd
from tqdm import tqdm

import seaborn as sns
import matplotlib.cm as cm
import matplotlib.pyplot as plt

from sklearn.datasets import make_regression, make_classification, \
                             make_blobs, make_moons, make_circles
from sklearn.model_selection import train_test_split
# ...
# ...

## Using NumPy

The NumPy Python library offers a wide range of tools for numerical calculations. Because NumPy implements full vectorization and a versatile vector class (`numpy.ndarray`), which are not available in standard Python, it has become probably the most popular and most widely used Python library in Python 2.x and above. These core features that I've mentioned, makes NumPy the best choice to perform calculations that primarily involve linear algebra. Since ANNs contain purely linear algebraic calculations, NumPy is the perfect library to implement ANNs from scratch with.

### Constructing a basic ANN from scratch

```python
def init_weights():
    '''Initializes neuron weights and biases at the start of the training.'''
    ...

def nn_forward():
    '''Implements the forward propagation steps.'''
    ...

def nn_backward():
    '''Implements the backward propagation steps.'''
    ...
```

The steps above require the help of other functions to work, like calculating the activation and the cost/lost function and updating weights and biases during backward propagation:
```python
def activation():
    ...
  
def loss_function():
    ...

def update_weights():
    '''
    Updates the weights and biases of the neurons in the neural network.
    Also referred to as "optimization".

    '''
    ...
```

In [None]:
def init_weights(ld):
    '''
    Initializes neuron weights and biases at the start of the training.
    Weights are set to be a random, small number, while biases are all
    set to 0.

    Parameters
    ----------
    ld : numpy.ndarray or array-like of shape (N+1,)
        Defines the dimensionality of the input data and the number of
        neurons in each of the ANN layers. Starts at the input side and
        ends at the output side.
        
        The first element is the dimensionality of the input data, while
        its other elements define the number of neutrons in each layers.

    Returns
    -------
    params : dict
        The initial weights and biases of each neuron layer.
    '''
    # Initial checks whether input is OK or not
    if not isinstance(ld, np.ndarray):
        ld = np.array(ld)
    assert ld.ndim == 1, "Input should be 1D array of shape (N,)!"
    assert ld.size > 1, "The ANN should contain at least a single layer!"
    assert ld[ld == 0].size == 0, "All layers should cointain at least 1 neuron!"

    params = {}

    for li in range(1, len(ld)):
        params[f'W{li}'] = np.random.randn(ld[li], ld[li-1]) * 0.01
        params[f'b{li}'] = np.zeros((ld[li], 1))

    return params

In [None]:
# Unit test
params = init_weights([3, 4, 8, 16, 1])
print(f"keys : {params.keys()}\n")
print(f"W1 =\n{params['W1']}")
print(f"b1 =\n{params['b1']}")

### Activation function

In [None]:
def activation(Z, *, func=None, deriv=False):
    '''
    Implements the activation node for both the forward and backward
    propagation section.

    Parameters
    ----------
    Z : float
        Outputs of a neuron layer.
    func : str
        The choosen activation function. Possible values are 'linear', 
        'relu', 'sigmoid', 'tanh' or 'softmax'.
    deriv : bool
        Changes between the activation function and its derivative.
        During forward propagation the regular functions, while during
        the backward propagation, the derivative should be used.
    '''
    __all__ = ['linear', 'relu', 'tanh', 'sigmoid', 'softmax']
    assert (func in __all__) | (func is None), \
        "Choosen activation is not implemented!\n" \
        + f"Available options are: {__all__}"
    
    if func == 'linear':
        if deriv:
            return np.ones_like(Z)
        else:
            return Z
    elif func == 'relu':
        if deriv:
            return 1 * (Z > 0)
        return Z * (Z > 0)
    elif func == 'sigmoid':
        s = lambda x: 1 / (1 + np.exp(-x))
        if deriv:
            return s(Z) * (1 - s(Z))
        return s(Z)
    elif func == 'tanh':
        if deriv:
            return 1 - Z**2
        return np.tanh(Z)
    elif func == 'softmax':
        return np.exp(Z) / np.sum(np.exp(Z), axis=1, keepdims=True)

### Forward propagation

The two functions below implements the forward propagation section in an ANN. The `nn_forward_step` calculates the output of a single neuron layer, while the `nn_forward` function links the whole section together. The steps implemented by these functions are the following:

1. `nn_forward` is the main function that is called at the beginning of the forward propagation during the training of the neural network. Its argument `X` is the input training data, the `parameters` contains the weigths and biases of the neurons in the network, layer by layer, while the `actv` argument defines, which activation function is used in the hidden layers. The possible activation functions are defined in the function named `activation` above.

2. As it was already mentioned at the beginning of this NumPy section, this Python library possesses a versatile tool: the `numpy.ndarray` class. This could help us a lot in our calculations, thus we're first converting any input data to a `numpy.ndarray` type. At the beginning of the `nn_foward` function, the following test is evaluated:

    ```python
        if not isinstance(X, np.ndarray):
            X = np.array(X)
    ```
    \
    If our input data, this array-like structure is already a `numpy.ndarray` object, then nothing happens. However if it isn't, then it's getting converted to `numpy.ndarray` first.

3. A list named `cache` is created at the line

    ```python
        cache = []
    ```
    \
    This `cache` list will contain all the necessary values for the backward propagation section, where derivatives are calculated using the output `a` and `Z` values, that are calculated during the forward propagation, also the existing `W` weights and `b` biases are getting updated. While we have the `W` and `b` values in the `parameters` dictionary, and `a` values can be calculated from the corresponding `Z` values if the `Z` values are saved into `cache` during forward propagation. Just to make everything easier (and our code a bit more transparent), it's best to just save every necessary value to this `cache` list and go on with our lives.

4. The number of layers are calculated at the line

    ```python
        L = len(parameters) // 2
    ```
    \
    Since the `parameters` dictionary contains exactly one `W` and a `b` entry for every layer, then the length of `parameters` is two-times the number of layers in the ANN.

5. The actual forward propagation section begins here. As the computation graph shows on Fig. 2., the input data is transposed before it's passed to the very first neuron layer. After this, the for loop propagates the values through the network, layer by layer, saves the output $Z^{\left[ i \right]}$, the weights $W^{\left[ i \right]}$ and the biases $b^{\left[ i \right]}$ to the cache, then applies the selected activation function (specified by the `actv` parameter) on the $Z$ output.

    Calculation of the linear part that is

    $$
    Z^{\left[ i \right]}
    =
    W^{\left[ i \right]} \cdot a^{\left[ i - 1 \right]} + b^{\left[ i \right]}
    $$
    is performed by the `nn_forward_step` function, which waits for the input vector `a`, the weights `W` and biases `b` as its arguments. The reason why the input vector is denoted by `a` can be understood by looking at the computation graph above. Except for the very first layer, where the input is the transpose of the training data, every layer uses the output of the activation function from the previous layer (as it's stated in the equation above also).

In [None]:
def nn_forward_step(a, W, b):
    '''
    Implements the linear forward propagation step in an ANN.
    
    Parameters
    ----------
    a : array-like of shape (1, N)
        The output of the previous activation 
    W : array-like of shape (N, M)
    
    b : array-like of shape (M, 1)
    
    Returns
    -------
    Z : numpy.ndarray
        
    '''
    Z = np.matmul(W, a) + b
    
    return Z

In [None]:
# Unit test
X = np.random.random((10, 30))
params = init_weights([30, 4])
Z = nn_forward_step(X.T, params['W1'], params['b1'])
a = activation(Z.T, func='softmax')
print(f"{Z = }")
print(f"{a = }")

In [None]:
def nn_forward(X, *, params, actv: str = 'relu'):
    '''
    Implements consecutive forward propagation steps in an ANN.
    Input values propagated through the neural network layer by layer.
    
    Parameters
    ----------
    X : numpy.ndarray or array-like of shape (M, N)
        An input entry in an ANN is in the form of a vector. Mutiple
        input dataentries can be sorted into a table of size M by N,
        where M is the number of rows that represent the individual
        input entries, while N is the number of columns that represents
        the dimensionality of the input data.
    params : dict
    
    '''
    # Initial checks whether input is OK or not
    if not isinstance(X, np.ndarray):
        X = np.array(X)
        
    # Store forward step results (Z), weights (W) and biases (b) for
    # the backward propagation steps
    cache = []
    
    # Number of layers is defined in the size of the `parameters` dict
    L = len(params) // 2
    
    # I. Layers 0 -> L-1
    ## In the very first layer the input is transposed
    a = X.T
    for li in range(1, L):
        ## Take a single forward step
        Wi = params[f'W{li}']
        bi = params[f'b{li}']
        Z = nn_forward_step(a, Wi, bi)

        ## Add relevant values to the `cache` list
        cache.append((a, Z, Wi, bi))

        ## Apply the activation function
        a = activation(Z, func=actv)
    
    # II. Last layer
    ## Take the last forward step
    WL = params[f'W{L}']
    bL = params[f'b{L}']
    Z = nn_forward_step(a, WL, bL)
    
    ## Add relevant values to the `cache` list
    cache.append((a, Z, WL, bL))
    
    ## At the end of the last layer, the output is transposed before
    ## going through the last activation function
    Z = Z.T

    ## Apply the last activation function
    if bL.size == 1:
        ## [REGRESSION] Single neuron in the last layer
        ##
        ## In case of regression, the last neuron layer contains only a
        ## single neuron.
        a = activation(Z, func='linear')
    else:
        ## [CLASSIFICATION] Multiple neurons in the last layer
        ##
        ## In case of classification, the last neuron layer represents
        ## the number of classes we want to predict.
        a = activation(Z, func='softmax')
    
    return a, cache

In [None]:
# Unit test
X = np.random.randn(5, 30)
params = init_weights([30, 4, 1])
a, cache = nn_forward(X, params=params, actv='sigmoid')
print(f"{a = }")

### Loss function

In [None]:
def binary_crossentropy(P, y, *, deriv=False):
    '''
    Calculates the binary cross-entropy.
    '''
    M = y.shape[0]
    
    if deriv:
        return 1 / M * (P - y).T
    return - 1 / M * np.sum(np.multiply(y, np.log(P)))


def root_mean_square(a, y, *, deriv=False):
    '''
    Calculates the root mean square error.
    '''
    if deriv:
        pass
    return np.sum(np.sqrt((a - np.mean(y))**2)) / len(a)

In [None]:
# Unit test (regression)
y = np.random.random((1,5))
loss = root_mean_square(a=a, y=y)
print(f"{y = }")
print(f"{loss = }")

In [None]:
# Unit test (classification)
y = np.eye(2)[np.random.randint(0, 2, 5)]
loss = binary_crossentropy(P=a, y=y)
print(f"{y = }")
print(f"{loss = }")

### Backward propagation

<center>
  <img width=95% src="./images/nn_backward.png"/>
</center>

In [None]:
def nn_backward_step(dLdZ, cache):
    '''
    Implements a single backward step in the ANN.
    '''
    # cache = (a, Z, W, b)
    dLda = np.dot(cache[2].T, dLdZ)             # dLda = W.T * dLdZ
    dLdW = np.dot(dLdZ, cache[0].T)             # dLdW = dLdZ * a.T
    dLdb = np.sum(dLdZ, axis=1, keepdims=True)  # dLdb = sum(dLdZ)
    
    return dLda, dLdW, dLdb

In [None]:
# Unit test
np.random.seed(1)
params = init_weights([3, 2, 2])
X = np.random.randn(3, 3)
y = np.eye(2)[np.random.randint(0, 2, 3)]
a, cache = nn_forward(X, params=params, actv='sigmoid')

dLdZ = binary_crossentropy(a, y, deriv=True)

dLda, dLdW, dLdb = nn_backward_step(dLdZ, cache[-1])
print("dLda=",dLda)
print("dLdW=",dLdW)
print("dLdb=",dLdb)

In [None]:
def nn_backward(a, y, cache, actv: str = 'relu'):
    '''
    Implements consecutive backward propagation steps in an ANN.
    '''
    # Dictionary to store the derivatives of every layer in
    derivatives = {}
    
    # Number of layers is defined as the length if the cache list
    L = len(cache)
    
    # Calculate the backward loss function to get dLdZ
    if cache[-1][-1].size == 1:
        dLdZ = root_mean_square(a=a, y=y, deriv=True)
    else:
        dLdZ = binary_crossentropy(P=a, y=y, deriv=True)
    
    for i in range(L, 1, -1):
        # Calculate and save relevant derivatives from the linear
        # backward propagation section in hidden layers
        dLda, dLdW, dLdb = nn_backward_step(dLdZ, cache[i - 1])
        derivatives[f'dLdW{i}'] = dLdW
        derivatives[f'dLdb{i}'] = dLdb
        
        # a' = f'(Z)
        a_prime = activation(cache[i - 2][1], func=actv, deriv=True)
        # dLdZ = dLda * a'
        dLdZ = np.multiply(dLda, a_prime)

    # Calculate and save relevant derivatives from the linear
    # backward propagation section in the first layer
    dLda, dLdW, dLdb = nn_backward_step(dLdZ, cache[0])
    derivatives[f'dLdW1'] = dLdW
    derivatives[f'dLdb1'] = dLdb

    return derivatives

In [None]:
np.random.seed(1)
params = init_weights([3, 2, 2])
X = np.random.randn(4, 3)
y = np.eye(2)[np.random.randint(0, 2, 4)]
a, cache = nn_forward(X, params=params, actv='sigmoid')
derivatives = nn_backward(a, y, cache, actv='sigmoid')

In [None]:
derivatives

### Connecting and training the whole network

In [None]:
def nn_train(X, y, *, ld=None, epochs=50, lr=0.01, actv='sigmoid'):
    '''
    Train
    '''
    #
    losses = []
    
    # Initialize the weights and biases of the network
    params = init_weights(ld)
    
    # Number of layers is defined in the size of the `parameters` dict
    L = len(params) // 2
    
    # In one iteration of gradient descent
    for l in tqdm(range(epochs)):
        
        # Propagate the data through the network in a forward direction
        a, cache = nn_forward(X, params=params, actv=actv)
        
        # Calculate the loss and save it into the `lpsses` list for
        # later use
        if ld[-1] == 1:
            loss = root_mean_square(a, y)
        else:
            loss = binary_crossentropy(a, y)
        losses.append(loss)
        
        # Calculate the derivates doing a backward step in the network
        derivates = nn_backward(a, y, cache, actv=actv)
        
        for i in range(1, L):
            # Update the weights
            params[f"W{i}"] = params[f"W{i}"] - lr * derivates[f"dLdW{i}"]
            params[f"b{i}"] = params[f"b{i}"] - lr * derivates[f"dLdb{i}"]

    return params, losses

In [None]:
X, y = make_classification(
    n_samples=1280,
    n_features=3,
    n_informative=3,
    n_redundant=0,
    n_classes=2,
    random_state=57
)

y = np.eye(2)[y]

# Randomly select a test set
p_test = 0.33
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.33)

In [None]:
pd.DataFrame(X)

In [None]:
# This will take a while
params, losses = nn_train(
    X_train, y_train,
    ld=[X.shape[1], 4, 8, 16, 32, 2],
    epochs=500,
    lr=100,
    actv='sigmoid'
)

In [None]:
fig, ax = plt.subplots(figsize=(10, 7))
ax.grid(True, ls='--', alpha=0.6)

ax.plot(losses, lw=3)
ax.set_xlabel('Epochs', fontsize=20, fontweight='bold')
ax.set_ylabel('Binary cross-entropy loss', fontsize=20, fontweight='bold')
ax.tick_params(axis='both', which='major', labelsize=16)

plt.show()

In [None]:
P, _ = nn_forward(X_test, params=params, actv='sigmoid')

In [None]:
y_pred = np.argmax(P, axis=1)

In [None]:
# Final evaluation of the model
fig, ax = plt.subplots(figsize=(10,  10))
ConfusionMatrixDisplay.from_predictions(np.argmax(y_test, axis=1), y_pred,
                                        ax=ax)
ax.set_xlabel('Predicted label', fontsize=20, fontweight='bold')
ax.set_ylabel('True label', fontsize=20, fontweight='bold')
ax.tick_params(axis='both', which='major', labelsize=16)

plt.show()

## Using TensorFlow and Keras

TensorFlow is an open source machine learning and AI library for Python, C++ and Java, originally developed by Google to help those primarily engaged in general deep learning. It provides an almost block-based approach for anyone to construct deep learning models of any complexity easily. While it's focused on deep neural networks, it can still be used a variety of other tasks. Since TensorFlow 2.0, the production branch of TensorFlow has merged with the Keras library. Keras contains several implementations of neural network modules and layers, optimization methods, activation functions, loss metrics and more.

In [None]:
import tensorflow as tf
from tensorflow.keras import layers as kl
from tensorflow.keras import models as km
from tensorflow.keras import backend as K
from tensorflow.keras import callbacks as kc
from tensorflow.keras import optimizers as ko
from tensorflow.keras import regularizers as kr

In [None]:
# Check TensorFlow version
print(f"TF version : {tf.__version__}")
# Test if GPU is available for TensorFlow
# This should show the available GPUs, listed in an array
print(f"GPU : {tf.config.list_physical_devices('GPU')}")

In [None]:
def ann_tf(ld,
           activation='relu'):
    '''
    Implements an N-layer ANN in TensorFlow-Keras.

    Parameters
    ----------
    ld : numpy.ndarray or array-like of shape (N+1,)
        Defines the dimensionality of the input data and the number of
        neurons (N) in each of the ANN layers. Starts at the input side
        and ends at the output side.
        
        The first element is the dimensionality of the input data, while
        its other elements define the number of neutrons in each layers.
    
    activation : str
        Specifies the activation function for the hidden layers.
    '''
    # Initial checks whether input is OK or not
    if not isinstance(ld, np.ndarray):
        ld = np.array(ld)
    assert ld.ndim == 1, "Input should be 1D array of shape (N,)!"
    assert ld.size > 1, "The ANN should contain at least a single layer!"
    assert ld[ld == 0].size == 0, "All layers should cointain at least 1 neuron!"
    
    # Tensorflow placeholder for inputs
    inp = kl.Input(shape=(ld[0],))
    x = inp
    
    # Define hidden layers of the ANN
    for ni in ld[1:-1]:
        x = kl.Dense(ni, activation=activation)(x)
    
    # Define last layer of the ANN
    name = f"final_dense_n{ld[-1]}_ngpu{len(gpu.split(','))}"
    if ld[-1] == 1:
        x = kl.Dense(ld[-1], activation='linear', name=name)(x)
    else:
        x = kl.Dense(ld[-1], activation='softmax', name=name)(x)
    
    model = km.Model(inputs=inp, outputs=x)

    return model

### Regression

In [None]:
gpu = '0'
GPU = [f"GPU:{i}" for i in gpu.split(',')]

if len(gpu.split(',')) > 1:
    strategy = tf.distribute.MirroredStrategy(GPU)
else:
    strategy = tf.distribute.OneDeviceStrategy(GPU[0])

with strategy.scope():
    model = ann_tf(ld=np.array([10, 4, 8, 16, 1]), activation='relu')
    model.compile(loss='mean_squared_error',
                  optimizer=tf.keras.optimizers.Adam(learning_rate=0.01))

In [None]:
model.summary()

In [None]:
X, y = make_regression(
    n_samples=12800,
    n_features=10,
    n_informative=4,
    random_state=57
)

# Randomly select a test set
p_test = 0.33
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.33,
                     random_state=57)
# Randomly select a validation set
p_valid = 0.25
X_train, X_valid, y_train, y_valid = \
    train_test_split(X_train, y_train, test_size=p_valid / (1 - p_test),
                     random_state=57)

#
# TENSORFLOW PURGATORY IN EARLY 2022
#

# Wrap data in Dataset objects
train_data = tf.data.Dataset.from_tensor_slices((X_train, y_train))
valid_data = tf.data.Dataset.from_tensor_slices((X_valid, y_valid))

# The batch size must now be set on the Dataset objects
batch_size = 128
train_data = train_data.batch(batch_size)
valid_data = valid_data.batch(batch_size)

# Disable AUTO sharding policy
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = \
                        tf.data.experimental.AutoShardPolicy.OFF
train_data = train_data.with_options(options)
valid_data = valid_data.with_options(options)

In [None]:
epochs = 50

# Fit the model
history = model.fit(train_data, validation_data=valid_data, 
                    epochs=epochs)

In [None]:
from sklearn.metrics import r2_score

In [None]:
y_pred = model.predict(X_test)

In [None]:
# Final evaluation of the model
score = r2_score(y_test, y_pred)
print(f"Accuracy: {score*100:.3f}%")

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
ax.grid(True, ls='--', alpha=0.6)

ax.scatter(y_test, y_pred)
ax.set_xlabel('Groundtruth', fontsize=20, fontweight='bold')
ax.set_ylabel('Prediction', fontsize=20, fontweight='bold')
ax.tick_params(axis='both', which='major', labelsize=16)

plt.show()

### Classification

In [None]:
gpu = '0'
GPU = [f"GPU:{i}" for i in gpu.split(',')]

if len(gpu.split(',')) > 1:
    strategy = tf.distribute.MirroredStrategy(GPU)
else:
    strategy = tf.distribute.OneDeviceStrategy(GPU[0])

with strategy.scope():
    model = ann_tf(ld=np.array([10, 4, 8, 16, 2]), activation='sigmoid')
    model.compile(loss='binary_crossentropy',
                  optimizer=tf.keras.optimizers.Adam(learning_rate=0.01))

In [None]:
model.summary()

In [None]:
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split

In [None]:
X, y = make_classification(
    n_samples=12800,
    n_features=10,
    n_informative=4,
    n_redundant=2,
    n_classes=2,
    random_state=57
)

# Randomly select a test set
p_test = 0.33
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.33)
# Randomly select a validation set
p_valid = 0.25
X_train, X_valid, y_train, y_valid = \
    train_test_split(X_train, y_train, test_size=p_valid / (1 - p_test))

# One-hot encode labels
y_train = np.eye(2)[y_train]
y_valid = np.eye(2)[y_valid]

#
# TENSORFLOW PURGATORY IN EARLY 2022
#
# Wrap data in Dataset objects
train_data = tf.data.Dataset.from_tensor_slices((X_train, y_train))
valid_data = tf.data.Dataset.from_tensor_slices((X_valid, y_valid))

# The batch size must now be set on the Dataset objects
batch_size = 128
train_data = train_data.batch(batch_size)
valid_data = valid_data.batch(batch_size)

# Disable AUTO sharding policy
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = \
                        tf.data.experimental.AutoShardPolicy.OFF
train_data = train_data.with_options(options)
valid_data = valid_data.with_options(options)

In [None]:
epochs = 50

# Fit the model
history = model.fit(train_data,
                    validation_data=valid_data, 
                    epochs=epochs)

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

In [None]:
y_pred = model.predict(X_test, verbose=0)
y_pred = np.argmax(y_pred, axis=1)

In [None]:
# Final evaluation of the model
fig, ax = plt.subplots(figsize=(10, 
                                10))
ConfusionMatrixDisplay.from_predictions(y_test, y_pred, ax=ax)
ax.set_xlabel('Predicted label', fontsize=20, fontweight='bold')
ax.set_ylabel('True label', fontsize=20, fontweight='bold')
ax.tick_params(axis='both', which='major', labelsize=16)

plt.show()

## Something bigger?

In [None]:
import os
import natsort

In [None]:
X = []
Z = []

for f in natsort.os_sorted(os.listdir('/home/masterdesky/data/SDSS/images/')):
    X.append(plt.imread(os.path.join('/home/masterdesky/data/SDSS/images/', f))[:,:,:3])
    Z.append(float(f.split('z')[-1][:-4]))

X = np.array(X)
Z = np.array(Z)

In [None]:
nrows = 3
ncols = 8
fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*2, nrows*2),
                         facecolor='black', subplot_kw={'facecolor' : 'black'})
fig.subplots_adjust(hspace=0.5)

rand_idx = np.random.randint(0, len(X), size=nrows*ncols)
images = X[rand_idx]
labels = Z[rand_idx]

for i, ax in enumerate(axes.flat):
    ax.imshow(images[i], cmap='Greys_r')
    ax.set_title(f'z : {labels[i]}', fontweight='bold',
                 color='white', pad=0)
    ax.axis('off')
    ax.grid(False)

plt.suptitle('Fig. 4. Sample data along with their labels of the SDSS dataset.',
             color='white', fontsize=20, y=0.05)
    
plt.show()

In [None]:
test_size = 0.33
valid_size = 0.2
X_train, X_test, y_train, y_test = train_test_split(X, Z,
                          test_size=test_size, random_state=57)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train,
                          test_size=valid_size/(1-test_size), random_state=57)

In [None]:
print('Train :', X_train.shape)
print('Valid :', X_valid.shape)
print('Test :', X_test.shape)

In [None]:
def CNN(imsize, n_target, stride, kernelsize,
        n_channels=1, num_of_filters=32,
        padding='same', activation='relu', 
        reg=5e-5, gpu='0,1,2'):
    
    # Tensorflow placeholder for inputs
    inputs = kl.Input(shape=(imsize, imsize, n_channels))

    #
    # Convolutional block 1.
    # 3x3CONVx32 -> ReLU -> 3x3CONVx32 -> ReLU -> MAXPOOLx2
    #
    x = kl.Conv2D(filters=num_of_filters,                   # 3x3CONVx32
                kernel_size=(kernelsize, kernelsize),
                padding=padding,
                strides=(stride, stride),
                kernel_regularizer=kr.l2(reg))(inputs)
    x = kl.Activation(activation)(kl.BatchNormalization()(x))   # ReLU

    x = kl.Conv2D(filters=num_of_filters,                   # 3x3CONVx32
                kernel_size=(kernelsize, kernelsize),
                padding=padding,
                strides=(stride, stride),
                kernel_regularizer=kr.l2(reg))(x)
    x = kl.Activation(activation)(kl.BatchNormalization()(x))   # ReLU

    x = kl.MaxPooling2D(strides=(2, 2))(x)                  # MAXPOOLx2


    #
    # Convolutional block 2.
    # 3x3CONVx64 -> ReLU -> 3x3CONVx64 -> ReLU -> MAXPOOLx2
    #
    x = kl.Conv2D(filters=2*num_of_filters,                 # 3x3CONVx64
                kernel_size=(kernelsize, kernelsize),
                padding=padding,
                strides=(stride, stride),
                kernel_regularizer=kr.l2(reg))(x)
    x = kl.Activation(activation)(kl.BatchNormalization()(x))   # ReLU

    x = kl.Conv2D(filters=2*num_of_filters,                 # 3x3CONVx64
                kernel_size=(kernelsize, kernelsize),
                padding=padding,
                strides=(stride, stride),
                kernel_regularizer=kr.l2(reg))(x)
    x = kl.Activation(activation)(kl.BatchNormalization()(x))   # ReLU

    x = kl.MaxPooling2D(strides=(2, 2))(x)                  # MAXPOOLx2


    #
    # Convolutional block 3.
    # 3x3CONVx128 -> ReLU -> 1x1CONVx64 -> ReLU -> 3x3CONVx128 -> ReLU -> MAXPOOLx2
    #
    x = kl.Conv2D(filters=4*num_of_filters,                 # 3x3CONVx128
                kernel_size=(kernelsize, kernelsize),
                padding=padding,
                strides=(stride, stride),
                kernel_regularizer=kr.l2(reg))(x)
    x = kl.Activation(activation)(kl.BatchNormalization()(x))   # ReLU

    x = kl.Conv2D(filters=2*num_of_filters,                 # 1x1CONVx64
                kernel_size=(1, 1),
                padding=padding,
                kernel_regularizer=kr.l2(reg))(x)
    x = kl.Activation(activation)(kl.BatchNormalization()(x))

    x = kl.Conv2D(filters=4*num_of_filters,                 # 3x3CONVx128
                kernel_size=(kernelsize, kernelsize),
                padding=padding,
                strides=(stride, stride),
                kernel_regularizer=kr.l2(reg))(x)
    x = kl.Activation(activation)(kl.BatchNormalization()(x))   # ReLU

    x = kl.MaxPooling2D(strides=(2, 2))(x)                  # MAXPOOLx2


    #
    # Convolutional block 4.
    # 3x3CONVx256 -> ReLU -> 1x1CONVx128 -> ReLU -> 3x3CONVx256 -> ReLU -> MAXPOOLx2
    #
    x = kl.Conv2D(filters=8*num_of_filters,                 # 3x3CONVx256
                kernel_size=(kernelsize, kernelsize),
                padding=padding,
                strides=(stride, stride),
                kernel_regularizer=kr.l2(reg))(x)
    x = kl.Activation(activation)(kl.BatchNormalization()(x))   # ReLU

    x = kl.Conv2D(filters=4*num_of_filters,                 # 1x1CONVx128
                kernel_size=(1, 1),
                padding=padding,
                kernel_regularizer=kr.l2(reg))(x)
    x = kl.Activation(activation)(kl.BatchNormalization()(x))   # ReLU

    x = kl.Conv2D(filters=8*num_of_filters,                 # 3x3CONVx256
                kernel_size=(kernelsize, kernelsize),
                padding=padding,
                strides=(stride, stride),
                kernel_regularizer=kr.l2(reg))(x)
    x = kl.Activation(activation)(kl.BatchNormalization()(x))   # ReLU

    x = kl.MaxPooling2D(strides=(2, 2))(x)                  # MAXPOOLx2


    #
    # Convolutional block 5.
    # 3x3CONVx512 -> ReLU -> 1x1CONVx256 -> ReLU -> 3x3CONVx512 -> ReLU -> AVGPOOL ||
    #
    x = kl.Conv2D(filters=16*num_of_filters,                # 3x3CONVx512
                kernel_size=(kernelsize, kernelsize),
                padding=padding,
                strides=(stride, stride),
                kernel_regularizer=kr.l2(reg))(x)
    x = kl.Activation(activation)(kl.BatchNormalization()(x))   # ReLU

    x = kl.Conv2D(filters=8*num_of_filters,                 # 1x1CONVx256
                kernel_size=(1, 1),
                padding=padding,
                kernel_regularizer=kr.l2(reg))(x)
    x = kl.Activation(activation)(kl.BatchNormalization()(x))   # ReLU

    x = kl.Conv2D(filters=16*num_of_filters,                # 3x3CONVx512
                kernel_size=(kernelsize, kernelsize),
                padding=padding,
                strides=(stride, stride),
                kernel_regularizer=kr.l2(reg))(x)
    x = kl.Activation(activation)(kl.BatchNormalization()(x))   # ReLU


    # End of convolution
    x = kl.GlobalAveragePooling2D()(x)                      # AVGPOOL

    x = kl.Dense(units=n_target,
                 name = f"final_dense_n{n_target}_ngpu{len(gpu.split(','))}")(x)

    model = km.Model(inputs=inputs, outputs=x)

    return model

In [None]:
gpu = '0'
GPU = [f"GPU:{i}" for i in gpu.split(',')]

if len(gpu.split(',')) > 1:
    strategy = tf.distribute.MirroredStrategy(GPU)
else:
    strategy = tf.distribute.OneDeviceStrategy(GPU[0])

with strategy.scope():
    best_model = kc.ModelCheckpoint('./best_model.hdf5',
                                    save_best_only=True, verbose=1)
    model = CNN(
        imsize=X[0].shape[1], n_target=1, stride=1, kernelsize=3,
        n_channels=3, num_of_filters=32,
        padding='same', activation='relu', 
        reg=5e-5, gpu=gpu
    )
    model.compile(optimizer=ko.Adam(learning_rate=5e-5),
                  loss='MeanSquaredError',
                  metrics=['MeanAbsoluteError'])

In [None]:
model.summary()

In [None]:
# Fit the model
epochs = 100
batch_size = 128
history = model.fit(x=X_train, y=y_train,
                    validation_data=(X_valid, y_valid),
                    epochs=epochs, batch_size=batch_size,
                    callbacks=[best_model])

In [None]:
# summarize history for accuracy
fig, ax = plt.subplots(1, 1, figsize=(8,8))

x = np.arange(epochs)+1
ax.plot(x, history.history['loss'], label='Train loss',
        color='tab:blue', lw=4, alpha=0.9)
ax.plot(x, history.history['val_loss'], label='Valid. loss',
        color='tab:orange', lw=4, alpha=0.9)

ax.set_ylabel('Score', fontsize=15, fontweight='bold')
ax.set_xlabel('Epoch', fontsize=15, fontweight='bold')
ax.tick_params(labelsize=14)

ax.legend(loc='upper right', fontsize=14, ncol=2)
ax.grid(ls='--', color='0.7')

plt.show()

In [None]:
y_pred = model.predict(X_test, verbose=0)

In [None]:
# Final evaluation of the model
score = r2_score(y_test, y_pred)
print(f"Accuracy: {score*100:.3f}%")

In [None]:
fig, ax = plt.subplots(figsize=(8,8))
ax.set_aspect('equal')

ax.scatter(y_test, y_pred,
           fc='k', ec='none', lw=1, alpha=0.4,
           s=6**2)
ax.plot([0,1],[0,1],
        color='tab:red', lw=4, alpha=0.8)

ax.set_xlim(0,1)
ax.set_ylim(0,1)

ax.set_ylabel('Predicted label', fontsize=15, fontweight='bold')
ax.set_xlabel('True label', fontsize=15, fontweight='bold')
ax.tick_params(labelsize=14)

plt.show()