### Introduction

There is just one more rule that we need to learn.  It's called the chain rule.  

### Reviewing our denotation

Before getting into the chain rule, let's review the following.  When we label a function as equalling $f(x)$, we are saying that the output of the function is dependent on the value of $x$.  This means that different values of $x$ will produce a different output for the function.  For example, we would say $f(x) = x^2 $, as the output of $x^2$ is dependent on different values of $x$.  Similarly, we would say that $f(y) = y^3$, as the output depends on different values of $y$.  

So whatever is in the parentheses just means, "is dependent on."  This would be wrong: $f(y) = x$.  That doesn't seem right as the output is not a function of $y$, it is a function of $x$ meaning it's output changes with different values of $x$.  The output of a function can even be depenedent on other functions.  For example consider the following: 

$$g = x + 1 $$ and 
$$f = g + 1$$

Now to indicate that $g$ depends on different values of $x$, we would describe this function as $g(x) = x + 1$.  As the value of $x$ changes the output of that function changes.  Now the function $f$ varies with different values of the function $g$.  When $g(x) = 3 + 1$, then $f = 4 + 1 = 5 $.  So the output of $f$ depends on $g$ and the output of $g$ depends on $x$.  So rewriting our two functions to express this, we would say: 

$$g(x) = x + 1 $$ and 
$$f(g(x)) = g(x) + 1$$

That last expressing is read $f$ of $g$ of $x$, which means $f$ depends on the function $g$, which depends on the function $x$.

Now perhaps could plug in different values of x, eg. f(3), etc.

We know that $\frac{df}{dx}x^2 = 2x$, and  $\frac{df}{dy}y^2 = 2y$.  That is, it does not matter what the variable is called.  We can still take the derivative.  

We read $\frac{df}{dx}x^2$, as the derivative "with respect to $x$", and $\frac{df}{dy}$ as the derivative "with respect to $y$". Think of "with respect to $x$", as meaning the change in the output of the function from budging $x$ a little bit.  

#### The chain rule

Now imagine the following function:  

$$f = g^2$$
$$g = x$$

What is the derivative of $\frac{df}{dx}f = g^2.$

As you can see, the derivtive of $\frac{df}{dx}f = g^2$, is a little tricky to think about.  Remember that in taking the derivative $\frac{df}{dx}f(x)$, we are really asking what happens to the output of the function $f(x)$ when we change $x$ a little bit.  But our function $f(x)$ does not directly depend on $x$.  Well, the chain rule tells us how to solve just this.  The chain rule answers the question of how do take the derivative of a function $\frac{df}{dx}f$ when a function $f$ does not directly depend on $x$, but does depend on another function, $g$ that depends on $x$?  

The chain rule states that given a function $f$ where $f$, depends on a function $g$ which depends on $x$, $\frac{df}{dx}f = \frac{df}{dg}f(g)*\frac{df}{dx}g(x)$.  Yes it's a mouthful, but it's not so bad in practice.

Let's apply our chain rule step by step to the function by taking the derivative $\frac{df}{dx}f $ where:

$$f = g^2 $$
$$g = x $$

* First, we take the derivative $\frac{df}{dg}f(g) = \frac{df}{dg}(g^2) = 2*g$.
* Then we take the derivative $\frac{df}{dx}g(x) = \frac{df}{dx}x = 1$.

Now, plugging these functions back into our equation of $\frac{df}{dx}f = \frac{df}{dg}f(g)*\frac{df}{dx}g(x)$ we have:

$\frac{df}{dx}f = 2*g*1 = 2*g$.  Finally we know that $ g = x $, so we have $\frac{df}{dx}f = 2*g = 2*x$.

Remember this says that $\frac{df}{dx}f(z)$ where $z $ is a function that depends on $x$ is $\frac{df}{dz}*\frac{dz}{dx}$.  Ok now let's solve for $\frac{df}{dz}$ and $\frac{dz}{dx}$ and plug them into our chain rule.

When we know from above that 
$$\frac{df}{dz}(z)^2 = 2z$$  And as we let $z(x) = 4x^3 + x^2 + 3x $, $$\frac{dz}{dx}z(x) = \frac{dz}{dx}(4x^3 + x^2 + 3x) = 12x^2 + 2x + 3 $$.  

**Now let's just plug these two formulas into our chain rule **.

$$f(g(x)) = g(x)^2 $$

$$f'(g(x)) =  \frac{df}{dx}g(x)^2 = 2*g(x) $$

And we let $ g(x) = 4x^3 + x^2 + 3x $, so using our above rules, $ g'(x) = 12x^2 + 2x^1 + 3 $

So now plugging in our chain rule we have: 

$\frac{df}{dx}f(g(x)) =  2*(4x^3 + x^2 + 3x)*(12x^2 + 2x + 3) = (8x^3 + 2x^2 + 3x)(12x^2 + 2x + 3)$

And let's leave it there for now.

Perhaps start with g'(x).  Then talk about thinking of that base value as it's own term.  So if you think of it like that, what is the derivative...Also, look into what we have.  

### Summary

***
This section is a little different in that we did not intuitively prove the rules we will apply.  But hopefully, you still have an intuition for the derivative of a function.  The derivative of a function at a given point is simply the rate of change of that function at that point.  And we calculate it with the formul

In this section we saw that we can find the minimum error by following the line tangent to a graph.  And we can move along by following the line tangent to the spot we are currently located.  We then saw how this holds for a two-dimensional graph, by considering how our error changes with respect to a change in b.  We identified this change in output from an infintesimally small change in input as our derivative.  

Then we considered three rules that allow us to calculate our derivative.  The most tricky of these is the power rule, which says that if $f(b) = b^n$, then $ f'(b) = n * b^{n -1} $.  We still haven't seen how derivativesgive us a way to understand gradient descent, but we will shortly when we consider how to take derivatives when we have functions with multiple variables, like an error function that is dependent on both m and b.

But first, let's practice what we know about derivatives in a lab.