<div >
<img src = "figures_notebook/banner.png" />
</div>


Intro
=====

### Deep Learning: Intro

-   Neural networks are simple models.

-   Their strength lays in their simplicity

-   The model has linear combinations of inputs that are passed through
    nonlinear activation functions called nodes (or, in reference to the
    human brain, neurons).

-   Let's start with a familiar and simple model, the linear model

$$\begin{aligned}
y &= f(X) + u \\ \nonumber
y &=  \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3  + u\end{aligned}$$

<div >
<img src = "figures_notebook/red1.png" />
</div>


Single Layer Neural Networks
============================

### Single Layer Neural Networks

-   Linear Models may miss the nonlinearities that best approximate
    $f^*(x)$

-   We can overcome these limitations of linear models and handle a more
    general class of functions by incorporating hidden layers.

-   Neural Networks are also called deep feedforward networks,
    feedforward neural networks, or multilayer perceptrons (MLPs), and
    are the quintessential deep learning models

-   A neural network takes an input vector of $p$ variables
    $$\begin{aligned}
        X = (X_1, X_2, . . . , X_p)     
        \end{aligned}$$

-   and builds a nonlinear function $f(X)$ to predict the response $y$ .
    $$\begin{aligned}
        y = f(X) +u 
        \end{aligned}$$

-   What distinguishes neural networks from previous methods is the
    particular structure of the model.

<div >
<img src = "figures_notebook/red2.png" />
</div>


-   The NN model has the form $$\begin{aligned}
            f(X)    &= f \left(\beta_0 + \sum_{k=1}^K \beta_k h_k(X)\right) \\
                &= f \left(\beta_0 + \sum_{k=1}^K \beta_k g\left(w_{k0} + \sum_{j=1}^p w_{kj} X_j \right)\right)
        \end{aligned}$$

-   where 
    - $f()$ is the output function
    - $g(.)$ is a activation function specified in advance, the nonlinearity of $g(.)$ is **essential**



### Worked Example I: Single Layer Neural Networks

-   $p=2$, $X=(X_1,X_2)$

-   $K=2$, $h_1(X)$ and $h_2(X)$

-   $g(z)=z^2$

-   $f(x)=x$ 



\begin{aligned}
  f(X) &= \beta_0 + \sum_{k=1}^2 \beta_k g\left(w_{k0} + \sum_{j=1}^2 w_{kj} X_j \right)
  \end{aligned}
  

-   Suppose we get

       
       $\hat{\beta}_0 =0$  
       $\hat{\beta}_1 =\frac{1}{4}$   
       $\hat{\beta}_2 =-\frac{1}{4}$
        
       $\hat{w}_{10} =0$        
       
       $\hat{w}_{11} =1$               
       
       $\hat{w}_{12} =1$
       $\hat{w}_{20} =0$         
       $\hat{w}_{21} =1$              
       $\hat{w}_{22} =-1$
      

- Then


    \begin{align}
      h_1(X) &= \left(0 + X_1 + X_2\right)^2 \\
      h_2(X) &= \left(0 + X_1 - X_2\right)^2 
  \end{align}

-  and plugging in 

    \begin{align}
      f(X) &=  0 + \frac{1}{4} \left(0 + X_1 + X_2\right)^2   - \frac{1}{4} \left(0 + X_1 - X_2\right)^2 \\
            &= \frac{1}{4} \left(\left(X_1 + X_2\right)^2   -  \left( X_1 - X_2\right)^2 \right) \\
            &= X_1X_2
    \end{align}



### NN Minimalist Theory

-   Why not a linear activation functions?

-   Let's go back to our example

    -   $p=2$, $X=(X_1,X_2)$

    -   $K=2$, $h_1(X)$ and $h_2(X)$

    -   Now $g(z)=z$

-   Then 
                

\begin{align}
f(X)  &= \beta_0 + \sum_{k=1}^2 \beta_k h_k(X) \\
   &= \beta_0 + \sum_{k=1}^2 \beta_k \left(w_{k0} + \sum_{j=1}^2 w_{kj} X_j \right)
\end{align}

- Replacing

\begin{align}
  f(X) &= \beta_0 +  \beta_1 \left(w_{10} + w_{11} X_1 + w_{12} X_2 \right) +  \beta_2 \left(w_{20} + w_{21} X_1 + w_{22} X_2 \right)
\end{align}

\begin{align}
  f(X) &= \theta_0 +  \theta_1  X_1 + \theta_2 X_2
\end{align}


### Worked Example II : The \"Exclusive OR (XOR)\" Function

-   The exclusive disjunction of a pair of propositions, (p, q), is
    supposed to mean that p is true or q is true, but not both

-   It's truth table is:

q  | p | q v p 
---|---|------|
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 0 |        

-   When exactly one of these binary values is equal to 1, the XOR
    function returns 1. Otherwise, it returns 0



-   Let's use a linear model

    $$\begin{aligned}
    y = \beta_0 + \beta_1 q + \beta_2 p +u\end{aligned}$$

    $$\begin{aligned}
     y=\left(\begin{array}{c}
    0\\
    1\\
    1\\
    0
    \end{array}\right)X=\left(\begin{array}{cc}
    0 & 0\\
    0 &1\\
    1 & 0\\
    1 & 1
    \end{array}\right)1=\left(\begin{array}{c}
    1\\
    1\\
    1\\
    1
    \end{array}\right)
     \end{aligned}$$

-   Solution $\beta_0=\frac{1}{2}$, $\beta_1=0$, $\beta_2 =0$

-   Prediction $\hat{y}=\left(\begin{array}{c}
    \frac{1}{2}\\
    \frac{1}{2}\\
    \frac{1}{2}\\
    \frac{1}{2}
    \end{array}\right)$



-   Let's use Single Layer NN containing two hidden units

-   Activation Funcition: ReLU: $g(z)=max\{0,z\}$

-   NN $$\begin{aligned}
        f(X)  &= \beta_0 + \sum_{k=1}^2 \beta_k g\left(w_{k0} + \sum_{j=1}^2 w_{kj} X_j \right)\end{aligned}$$


-   Suppose this is the solution to the XOR problem

$$f(x)=max\{0,XW+W_0\}\,\beta+\beta_0$$

$$W=\left(\begin{array}{cc}
1 & 1\\
1 & 1
\end{array}\right)$$

$$W_0=\left(\begin{array}{cc}
0 & -1\\
0 & -1\\
0 & -1\\
0 & -1
\end{array}\right)$$

$$\beta=\left(\begin{array}{cc}
1 & -2\end{array}\right)$$

$$\beta_0 = 0$$


-   Lets work out the example step by step
\begin{align}
f(x)=max\{0,XW+W_0\}\,\beta+\beta_0
\end{align}

$$
XW=\left(\begin{array}{cc}
0 & 0\\
0 &1\\
1 & 0\\
1 & 1
\end{array}\right)\left(\begin{array}{cc}
1 & 1\\
1 & 1
\end{array}\right)=\left(\begin{array}{cc}
0 & 0\\
1 & 1\\
1 & 1\\
2 & 2
\end{array}\right)
$$

$$
XW+W_0=\left(\begin{array}{cc}
0 & -1\\
1 & 0\\
1 & 0\\
2 & 1
\end{array}\right)
$$


$$
max\{0,XW+W_0\}=\left(\begin{array}{cc}
0 & 0\\
1 & 0\\
1 & 0\\
2 & 1
\end{array}\right)
$$

$$
\hat{y}=max\{0,XW+W_0\}\,\beta + \beta_0=\left(\begin{array}{cc}
0 & 0\\
1 & 0\\
1 & 0\\
2 & 1
\end{array}\right)\left(\begin{array}{cc}
1 \\ -2\end{array}\right)+ 0 =\left(\begin{array}{c}
0\\
1\\
1\\
0
\end{array}\right)
$$




-   In this example, we simply specified the solution, then showed that
    it obtained zero error.

-   In a real situation, obviously we can't guess the solution

-   What we do is to estimate the parameters via optimization methods

-   All gain comes from using nonlinear activation function




# XOR




q  | p | q v p 
---|---|------|
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 0 |


## Llamando las librerías

In [None]:
# install.packages("pacman") #run this line if you use Google Colab

In [1]:
require('pacman')
p_load("tidyverse","keras",'caret')

Loading required package: pacman



## Creamos las variables

In [2]:
X<-matrix(c(0,0,0,1,1,0,1,1),nrow=4,byrow=TRUE)
X

0,1
0,0
0,1
1,0
1,1


In [3]:
y<-matrix(c(0,1,1,0),nrow=4,byrow=TRUE)
y

0
0
1
1
0


## Entrenamos la red

In [4]:
model <- keras_model_sequential() 

model %>% 
  layer_dense(units = 16, activation = 'relu', input_shape = c(2)) %>% #activación
  layer_dense(units = 1, activation = 'sigmoid') #output

model %>% compile(
  optimizer = 'adam',
  loss = 'binary_crossentropy',
  metrics = c('binary_accuracy')
)

history <- model %>% fit(
  X, y, 
  epochs = 1000
)

In [5]:
summary(model)

Model: "sequential"
________________________________________________________________________________
Layer (type)                        Output Shape                    Param #     
dense_1 (Dense)                     (None, 16)                      48          
________________________________________________________________________________
dense (Dense)                       (None, 1)                       17          
Total params: 65
Trainable params: 65
Non-trainable params: 0
________________________________________________________________________________


## Resultados

In [6]:
y_hat <- model  %>% predict(X) 
y_hat <- ifelse(y_hat>0.5,1,0)
y_hat

0
0
1
1
0


In [7]:
confusionMatrix(data = factor(as.numeric(y_hat), levels = 0:1), 
  reference = factor(y, levels = 0:1))

Confusion Matrix and Statistics

          Reference
Prediction 0 1
         0 2 0
         1 0 2
                                     
               Accuracy : 1          
                 95% CI : (0.3976, 1)
    No Information Rate : 0.5        
    P-Value [Acc > NIR] : 0.0625     
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0        
            Specificity : 1.0        
         Pos Pred Value : 1.0        
         Neg Pred Value : 1.0        
             Prevalence : 0.5        
         Detection Rate : 0.5        
   Detection Prevalence : 0.5        
      Balanced Accuracy : 1.0        
                                     
       'Positive' Class : 0          
                                     