<a href="https://colab.research.google.com/github/mhuertascompany/Saas-Fee/blob/main/hands-on/chapter2/hello_ANN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
import matplotlib

from matplotlib import patches
from matplotlib import lines
import matplotlib.pyplot as plt
import pickle
from astropy.wcs import WCS
from astropy.coordinates import SkyCoord
from astropy.nddata.utils import Cutout2D
import scipy.stats as stats
import sys
from scipy.ndimage import uniform_filter
from astropy.table import Table
from astropy.cosmology import Planck13






import tensorflow as tf

import tensorflow_probability as tfp

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten



%pylab inline

# Let's first generate some data...

In [None]:
x = np.random.uniform(-1,1,100)
y = 0.1*x+np.random.normal(0,0.025,100)
plt.scatter(x,y,label='data')
plt.legend()
plt.show()

# The normal way to deal with this, is through linear regression

In [None]:
res = np.polyfit(x,y,1)
print(res)
plt.scatter(x,y,label='data')
plt.plot(np.linspace(-1,1),np.linspace(-1,1)*res[0]+res[1],color='green',label='polyfit')
plt.legend()
plt.show()

# Now, let's try to write the linear regression in a different way (more complicated way)

In [None]:
tfk = tf.keras
tfkl = tf.keras.layers
tfpl = tfp.layers
tfd = tfp.distributions

ann = tfk.Sequential([
tf.keras.layers.Flatten(input_shape=(1,1)),
tfkl.Dense(1, activation=None)])


The Dense command here, onnly says that the input is multiplied by a parameter $w$. We are effectively writing a simple model for our data: $y = w.a+b$, where $w$ is unknown.
![alt](https://drive.google.com/uc?id=1Rt2bNPCxaHXdjzmVS7TCw_u_Ur-WIqlW)

We can visualize the model we just created.

In [None]:
ann.summary()

 # We then compile

In [None]:
ann.compile(optimizer=tf.optimizers.Adam(),loss='mse')

We are simply tht we want to minimize the mean square error (mse) between input and output. We call this the "loss function". So we are looking for the value of $w$ that minimizes the following expression: $$ \sum(x-w.x)^2$$

# And fit the model ...

In [None]:
# you might need to run this cell a couple of times if it does not work directly
ann.fit(x,y,batch_size=1,epochs=20)

# Let's see what we got here...

In [None]:
xp = np.linspace(-1,1)
yp = ann.predict(xp)
plt.plot(xp,yp,color='red',label='ANN')
plt.scatter(x,y,label='data')
plt.plot(np.linspace(-1,1),np.linspace(-1,1)*res[0]+res[1],color='green',label='polyfit')
plt.legend()

We have performed a linear regression with and artifical neural network ! So, yes, linear regression IS also Machine Learning...

# But why is this useful ?

Let's suppose we have a more complex dataset...

In [None]:
x = np.random.uniform(-1,1,100)
y = 0.1*x+np.sin(5*x)+np.random.normal(0,0.2,100)

In [None]:
plt.scatter(x,y)
plt.show()

# I can try again simple linear regression ...

In [None]:
res = np.polyfit(x,y,1)
print(res)
plt.scatter(x,y,label='data')
plt.plot(np.linspace(-1,1),np.linspace(-1,1)*res[0]+res[1],color='green',label='polyfit')
plt.legend()
plt.show()

but that will not work super well as expected...

# Let's go back to our complicated ANN ...

In [None]:
tfd = tfp.distributions
tfpl = tfp.layers
tfk = tf.keras
tfkl = tf.keras.layers

ann = tfk.Sequential([
tf.keras.layers.Flatten(input_shape=(1,1)),
tfkl.Dense(1, activation=None)])
ann.compile(optimizer=tf.optimizers.Adam(),loss='mse')
ann.fit(x,y,batch_size=1,epochs=50)

In [None]:
xp = np.linspace(-1,1)
yp = ann.predict(xp)
plt.plot(xp,yp,color='red',label='ANN')
plt.plot(np.linspace(-1,1),np.linspace(-1,1)*res[0]+res[1],color='green',label='polyfit')
plt.scatter(x,y)
plt.legend()

If I do not change anything, I will obtain the same result. My model is simply linear...

# Let's add a bit of non-linearity ...

In [None]:
tfd = tfp.distributions
tfpl = tfp.layers
tfk = tf.keras
tfkl = tf.keras.layers

ann = tfk.Sequential([
tf.keras.layers.Flatten(input_shape=(1,1)),
tfkl.Dense(1, activation='sigmoid')])


The sigmoid function is given by this expression: $$ \frac{1}{1+e^{-x}}$$
So our model is now like this:![alt](https://drive.google.com/uc?id=1-2VbatzRnqGJMKCga-tppiTo6iPRBr9s)
This is what we call a perceptron. The non-linear function added after the linear combination is also called the activation function, because "it fires the unit".

In [None]:
ann.summary()

In [None]:
ann.compile(optimizer=tf.optimizers.Adam(),loss='mse')
ann.fit(x,y,batch_size=1,epochs=50)

In [None]:
xp = np.linspace(-1,1)
yp = ann.predict(xp)
plt.plot(xp,yp,color='red',label='ANN')
plt.plot(np.linspace(-1,1),np.linspace(-1,1)*res[0]+res[1],color='green',label='polyfit')
plt.scatter(x,y)
plt.legend()

Still not great, but there is some potential !?

# We are going to work a bit more on the model

In [None]:
tfd = tfp.distributions
tfpl = tfp.layers
tfk = tf.keras
tfkl = tf.keras.layers

ann = tfk.Sequential([
tf.keras.layers.Flatten(input_shape=(1,1)),
tfkl.Dense(1, activation='sigmoid'),
tfkl.Dense(1, activation=None)])


We have added "a layer". Our model is now: $$ y=(\frac{1}{1+e^{-(w_1.x)}}).w_2$$
![alt](https://drive.google.com/uc?id=1E0iobni7jhUI2jfGKPb081OM_QDB5Hjg)

In [None]:
ann.summary()

In [None]:
ann.compile(optimizer=tf.optimizers.Adam(),loss='mse')
ann.fit(x,y,batch_size=1,epochs=50)

In [None]:
xp = np.linspace(-1,1)
yp = ann.predict(xp)
plt.plot(xp,yp,color='red',label='ANN')
plt.plot(np.linspace(-1,1),np.linspace(-1,1)*res[0]+res[1],color='green',label='polyfit')
plt.scatter(x,y)
plt.legend()

Not fantastic, but you get the idea...You have just created your first ANN for regression!

In fact, it turns out that it exists a mathematical theorem that proves that NNs are optimal approximators:


FOR ANY CONTINUOS FUNCTION FOR A HYPERCUBE [0,1]d TO REAL NUMBERS, AND EVERY POSITIVE EPSILON, THERE EXISTS A SIGMOID BASED 1-HIDDEN LAYER NEURAL NETWORK THAT OBTAINES AT MOST EPSILON ERROR IN FUNCTIONAL SPACE - Cybenko+89

“BIG ENOUGH NETWORK CAN APPROXIMATE, BUT NOT REPRESENT ANY SMOOTH FUNCTION. THE MATH DEMONSTRATION IMPLIES SHOWING THAT NETWORS ARE DENSE IN THE SPACE OF TARGET FUNCTIONS”

So, the approximation theorem tells me that there exists a NN that can approximate any function. It does not tell me which one: this is the alchemia of ML. It does not tell me how to minimize it either!

# Let's then try to improve it...anyway.

In [None]:
tfd = tfp.distributions
tfpl = tfp.layers
tfk = tf.keras
tfkl = tf.keras.layers

ann = tfk.Sequential([
tf.keras.layers.Flatten(input_shape=(1,1)),
tfkl.Dense(30, activation='relu'),
tfkl.Dense(20, activation='relu'),
tfkl.Dense(10, activation='relu'),
tfkl.Dense(5, activation='relu'),
tfkl.Dense(1, activation=None)])
ann.compile(optimizer=tf.optimizers.Adam(),loss='mse')
ann.fit(x,y,batch_size=1,epochs=50)

In [None]:
xp = np.linspace(-1,1)
yp = ann.predict(xp)
plt.plot(xp,yp,color='red',label='ANN')
plt.plot(np.linspace(-1,1),np.linspace(-1,1)*res[0]+res[1],color='green',label='polyfit')
plt.scatter(x,y,label='data')
plt.legend()

Which is not that far from the real underlying model...

In [None]:
xp = np.linspace(-1,1)
yp = ann.predict(xp)
plt.plot(xp,yp,color='red',label='ANN')
plt.plot(np.linspace(-1,1),np.linspace(-1,1)*res[0]+res[1],color='green',label='polyfit')
plt.scatter(x,y,label='data')
plt.plot(np.linspace(-1,1),0.1*np.linspace(-1,1)+np.sin(5*np.linspace(-1,1)),label='model',color='black')
plt.legend()

Does it mean I can do any arbitrarily complex network and I still will be able to optimize over it? Yes! The answer is the backpropagation algorithm (Rumelhart et al., 1986a).

So this is it? If I add a lot of different layers, am I doing Deep Learning?

Well, not yet ... deep learning implies also feautre learning, which we have not touched here. However, the framework is the same

# What about errors? Can we capture the uncertainties in the data?

In [None]:

ann = tfk.Sequential([
tf.keras.layers.Flatten(input_shape=(1,1)),
tfkl.Dense(30, activation='relu'),
tfkl.Dense(20, activation='relu'),
tfkl.Dense(10, activation='relu'),
tfkl.Dense(5, activation='relu'),
tfkl.Dense(tfpl.IndependentNormal.params_size(1),activation=None),
tfpl.IndependentNormal(1, tfd.Normal.sample)])


Wow! What's that? We are transforming our model into a probabilsiitc model. Our model now predicts a Normal pdf at every point. We are going to learn the mean and the stanrdarde deviation of the pdf. That way, we let the model capture not only the mean but also the uncertainity.

So now, let's compile this model. Since the output of the network is now a distribution, we are going to maximize the likelihood, or minimize the negative log likelihood.

In [None]:
negloglik = lambda y, rv_y: -rv_y.log_prob(y)
ann.compile(optimizer=tf.optimizers.Adam(learning_rate=0.001),loss=negloglik)
ann.fit(x,y,batch_size=1,epochs=400)

# Let's plot the results ...

In [None]:
xp = np.linspace(-1,1)
yp = ann(xp).mean()
yp_std = ann(xp).stddev()
plt.plot(xp,yp,color='red',label='ANN (mean)')
plt.plot(xp,yp+yp_std,color='red',label='ANN (std)',ls='--')
plt.plot(xp,yp-yp_std,color='red',label='ANN (std)',ls='--')
plt.plot(np.linspace(-1,1),np.linspace(-1,1)*res[0]+res[1],color='green',label='polyfit')
plt.scatter(x,y,label='data')
plt.plot(np.linspace(-1,1),0.1*np.linspace(-1,1)+np.sin(5*np.linspace(-1,1)),label='model',color='black')
plt.legend()


The model captures now that it is more uncertain towrds the edges of the distribution...