<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/marco-canas/didactica_ciencia_datos/blob/main/referentes/chollet/c1/c1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
</table>

# Apéndice D

# Appendix D. Autodiff

Este apéndice explica cómo funciona la característica de autodiferenciación (autodiff) de TensorFlow y cómo se compara con otras soluciones.

Suponga que define una función $f(x, y) = x^{2}y + y + 2$, y necesita sus derivadas parciales $\partial f/\partial x$ y $\partial f/\partial y$ , normalmente para realizar el Descenso de gradiente (o algún otro algoritmo de optimización).

Sus opciones principales son: 

* la diferenciación manual, 
* la aproximación de diferencias finitas, 
* la diferencia automática en modo directo y 
* la diferencia automática en modo inverso.

TensorFlow implements reverse-mode autodiff, but to understand it, it’s useful to look at the other options first. 

So let’s go through each of them, starting with manual differentiation.

## Manual Differentiation

El primer enfoque para calcular derivadas es tomar un lápiz y una hoja de papel y usar su conocimiento de cálculo para derivar la ecuación apropiada.

For the function $f(x, y)$ just defined, it is not too hard; you just need to use five rules:  

* The derivative of a constant is 0.
* The derivative of $\lambda x$ is $λ$ (where $\lambda$  is a constant).
* The derivative of $x^{\lambda}$ is $\lambda x^{\lambda - 1}$, so the derivative of $x^{2}$ is $2x$.
* The derivative of a sum of functions is the sum of these functions’ derivatives.
* The derivative of $\lambda$ times a function is $\lambda$ times its derivative.  

From these rules, you can derive Equation D-1.

Equation D-1. Partial derivatives of $f(x, y)$

\begin{align*}
\frac{\partial f}{\partial x} & = \frac{\partial(x^{2}y)}{\partial x} + \frac{\partial y}{ \partial x} + \frac{\partial 2}{\partial x} = y\frac{\partial x^{2}}{\partial x} + 0 + 0 = 2xy  \\
\frac{\partial f}{\partial y} & = 
\end{align*}


Este enfoque puede volverse muy tedioso para funciones más complejas y corre el riesgo de cometer errores.

Fortunately, there are other options.

Let’s look at finite difference approximation now.

## Finite Difference Approximation

Recuerda que la derivada $h′(x_{0})$ de una función $h(x)$ en un punto $x_{0}$ es la pendiente de la función en ese punto.

More precisely, the derivative is defined as the limit of the slope of a straight line going through this point $x$ and another
point $x$ on the function, as $x$ gets infinitely close to $x$ (see Equation D-2).

\begin{align*}
h'(x_{0}) & = \lim_{x \to x_{0}} \frac{h(x) - h(x_{0})}{x - x_{0}} \\
& = \lim_{\epsilon \to 0} \frac{h(x_{0} + \epsilon) - h(x_{0})}{\epsilon}
\end{align*}


So, if we wanted to calculate the partial derivative of $f(x, y)$ with regard to $x$ at $x = 3$ and $y = 4$, we could compute $f(3 + ε, 4) – f(3, 4)$ and divide the result by $ε$, using a very small value for $ε$. 

This type of numerical approximation of the derivative is called a finite difference approximation, and this specific equation is called Newton’s difference quotient. 

That’s exactly what the following code does:

In [2]:
def f(x, y):
    return x**2*y + y + 2

In [4]:
def derivative(f, x, y, x_eps, y_eps):
    return (f(x + x_eps, y + y_eps) - f(x, y)) / (x_eps + y_eps)

In [5]:
df_dx = derivative(f, 3, 4, 0.00001, 0)
df_dy = derivative(f, 3, 4, 0, 0.00001)

Unfortunately, the result is imprecise (and it gets worse for more complicated functions). 

The correct results are respectively 24 and 10, but instead we get:

In [6]:
print(df_dx)

24.000039999805264


In [7]:
print(df_dy)


10.000000000331966


Notice that to compute both partial derivatives, we have to call f() at least three times (we called it four times in the preceding code, but it could be
optimized). 

If there were 1,000 parameters, we would need to call f() at
least 1,001 times. When you are dealing with large neural networks, this
makes finite difference approximation way too inefficient.

However, this method is so simple to implement that it is a great tool to check that the other methods are implemented correctly. 

For example, if it
disagrees with your manually derived function, then your function probably
contains a mistake.

So far, we have considered two ways to compute gradients: using manual differentiation and using finite difference approximation. 

Unfortunately, both were fatally flawed to train a large-scale neural network. 

So let’s turn to autodiff, starting with forward mode.

## Forward-Mode Autodiff

Figure D-1 shows how forward-mode autodiff works on an even simpler function, g(x, y) = 5 + xy. 

The graph for that function is represented on the left. 

After forward-mode autodiff, we get the graph on the right, which represents the partial derivative $∂g∂x = 0 + (0 × x + y × 1) = y$ (we could similarly obtain the partial derivative with regard to y).

figura 1_D
<img src = ''>