### Deep learning and neural networks: 

* Deep learning is a subfield of machine learning that focuses on neural networks with many layers. 
* A neural network is a computational model inspired by the way the human brain processes information. It consists of interconnected nodes (neurons) organized in layers. Each neuron receives input from previous neurons, processes it, and passes the output to the next neurons.

In [1]:
from numpy import asarray
from sklearn.datasets import make_regression
from keras.models import Sequential
from keras.layers import Dense
 

The code imports the required libraries: numpy for numerical operations, sklearn for generating a regression dataset, and keras for building and training a neural network model.

In [2]:
def get_dataset():
    X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=3, random_state=2)
    return X, y

The get_dataset() function generates a regression dataset using make_regression() from sklearn. It creates 1000 samples with 10 features, out of which 5 are informative. There are 3 target variables.

In [3]:
def get_model(n_inputs, n_outputs):
    model = Sequential()
    model.add(Dense(20, input_dim=n_inputs, kernel_initializer='he_uniform', activation='relu'))
    model.add(Dense(n_outputs, kernel_initializer='he_uniform'))
    model.compile(loss='mae', optimizer='adam')
    return model

The get_model() function defines a sequential neural network model using Sequential() from keras. It has a single hidden layer with 20 neurons, which uses the ReLU activation function. The output layer has n_outputs neurons. The model is compiled with the mean absolute error (MAE) loss function and the Adam optimizer.

### Feedforward neural network: 
The code you provided implements a feedforward neural network, which means that the information flows in one direction from the input layer to the output layer without any loops. In this case, the network has two layers: an input (hidden) layer and an output layer.

### Activation functions:
Neurons in a neural network use activation functions to introduce non-linearity into the model. In this case, the input layer uses the ReLU (Rectified Linear Unit) activation function, which is defined as ReLU(x) = max(0, x). This means that if the input is positive, the function returns the input value, and if the input is negative, it returns 0.

### Loss function and optimization: 
To train a neural network, we need a loss function that measures the difference between the predicted output and the actual output (ground truth). In this case, the Mean Absolute Error (MAE) is used as the loss function. The goal of training is to minimize the loss function by adjusting the weights and biases of the network. The Adam optimizer is used in this code to update the weights and biases during training.

Now, let's revisit the mathematical representation of the neural network:

Input layer: `h1 = ReLU(W1 * x + b1)`
Output layer: `y_pred = W2 * h1 + b2`

Here, x is the input feature vector, W1 and W2 are the weight matrices, b1 and b2 are the bias vectors, and h1 is the output of the first layer (ReLU activation).
During training, the weights and biases (W1, W2, b1, and b2) are updated to minimize the loss function (MAE). 

In [4]:
# load dataset
X, y = get_dataset()
n_inputs, n_outputs = X.shape[1], y.shape[1]
# get model
model = get_model(n_inputs, n_outputs)
# fit the model on all data
model.fit(X, y, verbose=0, epochs=100)

<keras.callbacks.History at 0x167e9524820>

In [5]:
# make a prediction for new data
row = [-0.99859353,2.19284309,-0.42632569,-0.21043258,-1.13655612,-0.55671602,-0.63169045,-0.87625098,-0.99445578,-0.3677487]
newX = asarray([row])
yhat = model.predict(newX)
print('Predicted: %s' % yhat[0])

Predicted: [-142.8951    -83.09769   -92.076965]


Once the model is trained, you can use it to make predictions for new input samples using the same mathematical equation:

- `yhat = W2 * ReLU(W1 * x + b1) + b2`

#### From Above equation we get know how the prediction happens mathematically. But how excatly this equation helps in giving 3 outputs for single set of features? 

- The output layer of the neural network has a number of neurons equal to the number of target variables (3 in this case). 

- Each neuron in the output layer is responsible for predicting one of the target variables. The weight matrix W2 and bias vector b2 in the output layer are adjusted accordingly to accommodate multiple outputs.
- Let's denote the output layer's weight matrix as W2 with dimensions (20, 3) and the bias vector b2 with dimensions (1, 3). The output of the first layer, h1, has dimensions (1, 20). 
- The mathematical equation for the output layer becomes:

    `y_pred = h1 * W2 + b2`

Here, y_pred is a (1, 3) matrix, where each element corresponds to the prediction of one of the target variables.

- To break it down further, let's represent W2 as three column vectors w2_1, w2_2, and w2_3, and b2 as three scalar values b2_1, b2_2, and b2_3. The equation for the output layer can be rewritten as:

    `y_pred = [h1 * w2_1 + b2_1, h1 * w2_2 + b2_2, h1 * w2_3 + b2_3]`

- Each element in the y_pred vector represents the prediction for a specific target variable. The neural network learns to adjust the weights and biases during training to minimize the loss function (MAE) for all target variables simultaneously.

In summary, multi-output prediction in this neural network is achieved by having multiple neurons in the output layer, each responsible for predicting one target variable. The weight matrix and bias vector in the output layer are adjusted to accommodate multiple outputs, and the final prediction is a vector containing the predicted values for all target variables.

In [7]:
model.get_weights()

[array([[ 3.02841210e+00, -2.93569851e+00,  2.81300354e+00,
          3.35917091e+00, -2.15088391e+00,  2.52842236e+00,
         -1.66626668e+00, -3.49653006e+00, -2.04469609e+00,
          2.97652483e+00, -3.07689285e+00,  2.19050264e+00,
          1.33807743e+00, -3.60908389e+00, -2.94842243e+00,
          2.95717764e+00,  3.30104089e+00,  7.15663493e-01,
         -1.08380914e+00,  2.44521070e+00],
        [ 1.25329703e-01, -2.04313368e-01,  2.18125079e-02,
         -7.02505708e-02,  6.52799785e-01, -6.97129190e-01,
         -1.88534975e-01, -4.38837614e-03, -8.32859874e-01,
          2.76842743e-01,  2.95069158e-01, -2.70908892e-01,
         -2.51155883e-01,  7.41884634e-02, -1.65855169e-01,
          4.29592788e-01,  1.80095017e-01,  9.38812047e-02,
          9.68291983e-02, -3.90892267e-01],
        [ 1.23813558e+00,  4.15536389e-02,  7.50992894e-01,
          2.16997519e-01, -2.81887919e-01,  9.49331284e-01,
         -1.35934126e+00, -1.28497660e+00, -1.61139274e+00,
          3.

### So, Are you wondering how exactly Loss function and optimizer comes in picture?

#### Loss function: 
The loss function measures the difference between the predicted output and the actual output (ground truth). It quantifies how well the neural network is performing. 
- In this case, the Mean Absolute Error (MAE) is used as the loss function. The MAE is calculated as the average of the absolute differences between the predicted and actual values for each target variable.
- Mathematically, for a single data point, the MAE is defined as:

    `MAE = (1/n) * Σ|y_pred_i - y_true_i|`
    
where n is the number of target variables, y_pred_i is the predicted value for the i-th target variable, and y_true_i is the actual value for the i-th target variable.

#### Optimizer: 
The optimizer is responsible for updating the weights and biases of the neural network to minimize the loss function. It determines how the model learns from the data. 
- In this case, the Adam optimizer is used. Adam is an adaptive learning rate optimization algorithm that combines the advantages of two other popular optimizers, AdaGrad and RMSProp. It adjusts the learning rate for each weight and bias individually, making it suitable for a wide range of problems.



#### To summarize the overall training process:

1. Initialize the neural network with random weights and biases.

2.  For each epoch (iteration over the entire dataset): 

    a. Forward pass: 

        Calculate the predicted output y_pred for each input sample using the current weights and biases. 

    b. Calculate the loss: 
    
        Compute the MAE between the predicted output y_pred and the actual output y_true. 
    
    c. Backward pass: 
        Compute the gradients of the loss function with respect to the weights and biases using backpropagation. 
    
    d. Update the weights and biases using the Adam optimizer, which adjusts them based on the computed gradients and its adaptive learning rate.

Repeat steps 2a-2d for the specified number of epochs or until the loss converges.

After training, the neural network can be used to make predictions for new input samples using the learned weights and biases.