## Scalars 

Have zero dimensions, is a single value, a persons height.

## Vectors

theres two types row vectors and column vectors

A row vector would be$\left[\begin{array}{ccc}
    1 2 3\\
\end{array}
\right]$, while a column vector would be $\left[\begin{array}{c}
    1\\
    2\\
    3
\end{array}
\right]$

They can store the same thing. But has 1 dimension. 

Row, Column to get the Index

## Matrices
$\left[\begin{matrix}
    1 \ 2 \ 3\\
    4 \ 5 \ 6\\
    7 \ 8 \ 9 
\end{matrix}
\right]$

The most common way to work with numbers is through an ndarray object. Similar to lists, but can have any number of dimension. Also supports fast math operations, because written in C. Any form, scalars, vectors, matrices, or tensors. **But every item in the array must have the same type**

In [8]:
import numpy as np

# Scalar
s = np.array(18)
print(s)

# Vector 
v = np.array([1,2,3,4])
print(v)



# Matrix
m = np.array([[1,2,3], [4,5,6], [7,8,9]])

# Tensor
t = np.array([[[[1],[2]],[[3],[4]],[[5],[6]]],[[[7],[8]],\
    [[9],[10]],[[11],[12]]],[[[13],[14]],[[15],[16]],[[17],[17]]]])

print(s.shape, 'Scalar Shape')
print(v.shape, 'Vector Shape')
print(m.shape, 'Matrix Shape')
print(t.shape, 'Tensor Shape')


18
[1 2 3 4]
() Scalar Shape
(4,) Vector Shape
(3, 3) Matrix Shape
(3, 3, 2, 1) Tensor Shape


### Changing Shapes

Sometimes you need to change the change of your data, without changing its contents. 

We can do this with `x.reshape()`.


In [11]:
x = v.reshape(4,1)
print(x.shape)

(4, 1)


One more thing should be noted. If you see code from experienced numpy users, you will often see them use a special slicing syntax instead of calling reshape, using this the previous two example would look like

`x = v[None, :]`

In [16]:
x = v[None, :]
print(x.shape)

# or

x_reshape = v[:, None]
print(x_reshape.shape)



(1, 4)
(4, 1)


Element Wise Matrices

To do this you can simply do

In [24]:
# Addition
add = v + 5
scalar_to_vector = np.multiply(v, 5)

# Matrix element wise

b = m + 1

matrix_add = b + m
print(matrix_add)



[[ 3  5  7]
 [ 9 11 13]
 [15 17 19]]


If you try doing math with matricies of different shapes you will see a ValueError

In [23]:
a = np.array([[1,3],[5,7]])
a
# displays the following result:
# array([[1, 3],
#        [5, 7]])
c = np.array([[2,3,6],[4,5,9],[1,8,7]])
c
# displays the following result:
# array([[2, 3, 6],
#        [4, 5, 9],
#        [1, 8, 7]])

a.shape
# displays the following result:
#  (2, 2)

c.shape
# displays the following result:
#  (3, 3)

a + c

ValueError: operands could not be broadcast together with shapes (2,2) (3,3) 

Matrix Multiplication

Matrix Multiplication vs Matrix Product

dot product is just the element wise matrix multiplication, then added up into a single number

Whenever you multiply two matrices, you're actually dealing with the rows of the first matrix and the columns of the second matrix, when you do matrix multiplication you do the dot product for the first row, with first column and that becomes the first entry in the new matrix. the number of rows is equal to the number of rows in the first matrix, with the number of columns equal to the second matrix. 

**This means the rows of the first matrix need to be the same length as the columns of the second matrix.**

**The number of columns in the left matrix must equal the number of rows in the right matrix.**

Order Matters,

Multiplying $A * B$ is different than $B * A$

In [28]:
a = np.array([[1,2],[3,4]])

np.dot(a,a) # can do this
a.dot(a) # or call it directly on the array

np.matmul(a, a)

array([[ 7, 10],
       [15, 22]])

You may sometimes see NumPy's dot function in places where you would expect a matmul. It turns out that the results of dot and matmul are the same if the matrices are two dimensional.

## Matrix Transpose

A matrix is the same matrix, but has the rows and columns switched. 

In [34]:
m = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
print('Original:\n ', m)
print()
print('Transpose:\n', m.T)


Original:
  [[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]

Transpose:
 [[ 1  5  9]
 [ 2  6 10]
 [ 3  7 11]
 [ 4  8 12]]


### Real World Use Case



In [39]:
inputs = np.array([[-0.27,  0.45,  0.64, 0.31]])
inputs
# displays the following result:
# array([[-0.27,  0.45,  0.64,  0.31]])

inputs.shape
# displays the following result:
# (1, 4)

weights = np.array([[0.02, 0.001, -0.03, 0.036], \
    [0.04, -0.003, 0.025, 0.009], [0.012, -0.045, 0.28, -0.067]])

weights
# displays the following result:
# array([[ 0.02 ,  0.001, -0.03 ,  0.036],
#        [ 0.04 , -0.003,  0.025,  0.009],
#        [ 0.012, -0.045,  0.28 , -0.067]])

weights.shape
# displays the following result:
# (3, 4)

# np.matmul(inputs, weights)
print(np.matmul(inputs, weights.T))

# or swapping the order, and taking the transpose of inputs
print(np.matmul(weights, inputs.T))

[[-0.01299  0.00664  0.13494]]
[[-0.01299]
 [ 0.00664]
 [ 0.13494]]


In [45]:
vec_t = np.array([[1,2,3,4,5],[6,7,8,9,10]])

print(np.squeeze(vec_t))

help(np.squeeze)

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]]
Help on function squeeze in module numpy.core.fromnumeric:

squeeze(a, axis=None)
    Remove single-dimensional entries from the shape of an array.
    
    Parameters
    ----------
    a : array_like
        Input data.
    axis : None or int or tuple of ints, optional
        .. versionadded:: 1.7.0
    
        Selects a subset of the single-dimensional entries in the
        shape. If an axis is selected with shape entry greater than
        one, an error is raised.
    
    Returns
    -------
    squeezed : ndarray
        The input array, but with all or a subset of the
        dimensions of length 1 removed. This is always `a` itself
        or a view into `a`.
    
    Raises
    ------
    ValueError
        If `axis` is not `None`, and an axis being squeezed is not of length 1
    
    See Also
    --------
    expand_dims : The inverse operation, adding singleton dimensions
    reshape : Insert, remove, and combine dimensions, and resiz

# Derivatives

A derivative can be defined in two ways:
1) Instantaneous rate of change (Physics)
2) Slope of a line at a specific point (Geometry)

Both represent the same principle, but for our purposes its easier to explain using the geometric deinition.

**Definition**
Slope represents the steepness of a line. It answers the question, how much does $y$ or $f(x)$ change given a specific change in $x$?

### Steps:
1. Given the function $f(x) = x^2$

2. Increment $x$ by a very small value $h (h = \nabla x) $ $f(x+h) = (x+h)^2$

3. Apply the slope formula $f(x+h) - f(x) \over h$
4. Simplify the equation $x^2 + 2xh + h^2 -x^2 \over h$ = $ 2xh+h^2 \over h$ $=2x+h$
5. Set h to 0 $(\lim_{h\to0})$
$2x+0=2x$

### Result
So what does this mean? It means for $f(x) = x^2$, the slope at any point can be calculated using the function $2x$.






### Derivative of a constant function

The derivative of f(x) = c where c is a constant is given by

$ f^\prime (x) = 0 $

**Example**

$ f(x) = -10 $, then $ f^\prime (x) = 0 $

### Derivative of the power of two functions (Power Rule)

The derivative of a power function. The derivative of $f(x) = x^r$ where r is a constant real number is

$ f^\prime(x) = r x^{r-1}$

**Example**

$ f(x) = x^{-2}$, then $ f^\prime(x) = -2 x ^{-3} = \frac{-2}{x^{3}} $

### Derivative of a function multiplied by a constant

The derivative of f(x) = cg(x) is given by

$ f^\prime{x} = cg^{\prime}(x) $

**Example**

$ f(x) = 3x^3$
let c = 3 and $g(x) = x^3$, then $ f^\prime(x) = cg^\prime(x) $

$ = 3(3x^2) = 9x^2 $

### Derivative of the sum of functions (Sum Rule) 

The derivative of $f(x) = g(x) + h(x)$ is given by

$ f^\prime(x) = g^\prime(x) + h^\prime(x) $

**Example**

$ f(x) = x^2 + 4 $

let $g(x) = x^2$ and $h(x) = 4$, then $f^\prime(x) = g^\prime(x) + h^\prime(x) = 2(x) + 0 = 2x$

### Derivative of the difference of functions

The derivative of $f(x) = g(x) - h(x) $ is given by

$ f^\prime(x) = g^\prime(x) - h^\prime(x) $

**Example**

$f(x) = x^3 - x^{-2} $

let $g(x) = x^3$ and $h(x) = x^{-2}$, then

$ f^\prime(x) = g^\prime(x) - h^\prime(x) = 3x^2 - (-2x^{-3}) = 3x^2 + 2x^{-3} $

### Derivative of the product of two functions (Product Rule)

The derivative of $f(x) = g(x) h(x)$ is given by

$ f^\prime(x) = g(x) h^\prime(x) + h(x) g^\prime(x) $

**Example**

$ f(x) = (x^2 - 2x) (x-2) $ 

let $g(x) = (x^2 - 2x)$ and $h(x) = (x - 2)$, then

$f^\prime(x) = g(x)h^\prime(x) + h(x)g^\prime(x) = (x^2 - 2x)(1) + (x-2)(2x-2) $

$= x^2 - 2x + 2x^2 - 6x + 4 = 3x^2 - 8x + 4$

### Derivative of the quotient of two functions (Quotient Rule)

The derivative of f(x) = g(x) / h(x) is given by

$$ f^\prime(x) = \frac{(h(x)g^\prime(x) - g(x)h^\prime(x))}{h(x)^2} $$

**Example**

$f(x) = \frac{(x-2)}{(x+1)} $

let $g(x) = (x-2) $ and $h(x) = (x+1)$, then

$$f^\prime(x) = \frac{(h(x)g^\prime(x) - g(x)h^\prime(x))}{h(x)^2}$$

$$ = \frac{(x+1)(1) - (x-2)(1)}{(x+1)^2} $$

$$ = \frac{3}{(x+1)^2} $$

## Chain Rule


The chain rule says, if you have a variable $x$ on a function $f$ that you want to apply to $x$ to get $f(x)$ which were going to call 

$A = f(x)$

then another function g which you apply to f(x) to get

$ B = g \circ f(x) $

if you want to find the partial derivative of B with respect to x, thats just a partial derivative of B with respect to A times the partial derivative of A with respect to x. 

$$ \frac{\partial B}{\partial x} = \frac{\partial B}{\partial A} \frac{\partial A}{\partial x}$$

when composing functions, that derivatives just multiply, and this is going to be super useful for us because feed forwarding is literally composing a bunch of functions, and back propagation is literally taking the derivative at each piece, and since taking the derivative of a composition is the same as multiplying the partial derivatives, then all we're going to do is multiply a bunch of partial derivatives to get what we want. 

## Popular Derivatives

$ \frac{d}{dx} \sqrt x = \frac{1}{2\sqrt{x}}$ 

$ \frac{d}{dx} \ln x = \frac{1}{x}$

$ \frac{d}{dx} \log x = \frac{1}{x}$

$ \frac{d}{dx} \sin x = \cos x$

$ \frac{d}{dx} \cos x = - \sin x$

$ \frac{d}{dx} \tan x = \sec^2 x$



## Examples

**Chain Rule**:

This can be used whenever your function is a composition of more than 1 function. 
$ h(x) = (\sin x)^2$

If we are going to take the derivative of $x^2$ with respect to $x$. it would be

$ \frac{d}{dx} [x^2] $ = $2x$

another example 

$ \frac{d}{da} [a^2] $ = $2a$

Now we are only replacing $x$, or $a$ with $(\sin x)$

$\frac{d}{d(\sin x)} [(\sin x)^2] $ = $2\sin x$

Derivative of the outer function with respect to sin of x (the inner function). times the derivative of sine of x with respect to x. 

this would return

$ \frac{d}{dx)} (\sin x)  = \cos x$

so $ h^\prime(x) = \frac{d}{dx} = 2\sin x \cos x$

Derivative of the outer function with respect to the inner, multiplied times the inner function with respect to x.

OR, can think of it as,

$ \frac{d}{dx}[f(g(x))] = f^\prime (g(x)) \times g^\prime (x) $

"F prime of G(x) times g of x"


$ \frac{d}{dx)} (\sin x)  = \cos x$

## Common Misconceptions

$ \frac{d}{dx}[f(g(x))] $

With the chain rule, when your dealing with transcendental functions is just a fancy word for these functions like trigonometric functions like $ \sin x$ or logarithmic functions like $ \ln x$ that don't use standard algebraic operations. But when you see transcendental functions like this or compositions of them, many people confuse this with the product of functions. 

this is a composition
$ \frac{d}{dx}[\ln (\sin(x)] $

So to do this we need to take the derivative of the outer with respect to the inner.

$ f^\prime (g(x)) = \frac{1}{\sin x}$

$g(x) = \sin x$

$f^\prime (x) = 1 / x $

$g^\prime (x) = \cos x $

which gives us our 

$ = \frac{1}{\sin x} \times \cos x$

$ = \frac{\cos x}{\sin x} $



**Question**

Is $h(x)=\cos^2(x)$ a composite function? If so, what are the "inner" and "outer" functions?

**Answer**

Yes, $h(x)$ is composite. The "inner" function is $x^2$ and the "outer" function is $\cos(x)$

## Gradient Math

The gradient points in the direction of the deepest descent.

#### https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/gradient-and-directional-derivatives/v/gradient-and-graphs

$ f(x, y) = x^2 + y^2 $

$ \nabla f(x, y) = \left[
\begin{array}{c}
    \frac{\partial f}{\partial x}\\
    \frac{\partial f}{\partial y}
\end{array}
\right]$

$ = \left[
\begin{array}{c}
    2x\\
    2y
\end{array}
\right]$

What does the length mean?

Whats the contour map?
If a vector is crossing that contour line, its always perpendicular. Gradient is always perpendicular to contour lines. 
**The Directional Derivative**

The partial derivative can be thought of as taking a look at the input space (x, y plane). You know that somehow this outputs to a line. Whether its a transformation, or maybe just a from x, y plane to a line.

But when you take the partial derivative of f with respect to x, at a point like 1, 2

The directional derivative is saying when you take a slight nudge in the direction of that vector, what is the resulting change to the output?

$ \vec{V} = \left[
\begin{array}{c}
    -1\\
    2
\end{array}
\right]$

$h \vec{V} = \left[
\begin{array}{c}
    -h\\
    2h
\end{array}
\right]$

So, its kind of like you took negative one nudge in the x-direction, and then two nudges in the y-direction. You know, so for whatever your nudge in the v-direction, there, you take a negative one step by x, and then two of them up by y. 

**Notation**

$\nabla_{\vec{v}} \ f(x, y) = -\frac{\partial f}{\partial x} + 2 \frac{\partial f}{\partial y}$

**Abstract**

$ \vec{w} = \left[
\begin{array}{c}
    a\\
    b
\end{array}
\right]$


$\nabla_{\vec{w}} \ f(x, y) = a \frac{\partial f}{\partial x} + b \frac{\partial f}{\partial y}$

**Continued**

If you take the dot product of the vectors, a, b, and the one that has the partial derivatives in it. So,

$ \left[
\begin{array}{c}
    a\\
    b
\end{array}
\right] \cdot \left[
\begin{array}{c}
    \frac{\partial f}{\partial x}\\
    \frac{\partial f}{\partial y}
\end{array}
\right]$

so $a,b$ is just equivalent to $\vec{w}$
and the partial vector is just the gradient $\nabla{f}$

thus, the directional derivative can be found by
$\vec{w} \cdot \nabla{f}$

you think of moving along that vector, by a tiny nudge, little value, multiplied by that vector, and saying "How does that change the output and what's the ratio of the resulting change?"

#### Formal Definition: Directional Derivative

This is the formal definition for the partial derivative of a two-variable function with respect to x, and we are going to build up to the formal definition of the directional derivative of that same function in the direction of some vector V, which will be some vector in the input space. 

F(1) $$ \frac{\partial f}{\partial x} (a, b) = \lim_{h \to 0} \frac{f(a + h, b) - f(a, b)}{h} $$

F(2)
$$ \nabla_{\vec{v}}f = ??$$

F(3)
$$ \hat{i} = \left[
\begin{array}{c}
    1\\
    0
\end{array}
\right]$$

The unit vector in the x direction

The way to read this formal definition is that you think of this variable h, you think of it as that change in your input space.

Imagine some input space on a x, y coordinate plane, and you think of it somehow mapping over to the real number plane, which is where your output F lives and when you're taking the partial derivative at a point A, B, and essentially you want to know for a little nudge in the x-direction what will be the resulting nudge on the output plane. You can think of the size of that nudge as $\partial x$ and the size of the resulting nudge in the output space as $\partial f$.

H in F(1) can be thought of as that change in your input space (that slight nudge). And you look at how that influences the function when you only change the X component here, whats the resulting change in F? Whats that $\partial f$

So rewriting F(1) into vector notation. 

F(4) $$ \frac{\partial f}{\partial x} (\vec{a}) = \lim_{h \to 0} \frac{f(\vec{a} + h\hat{i}, b) - f(\vec{a})}{h} $$

So the change we are looking for is going to be F, evaluated at that initial input vector $\vec{a} + h$, that scaling value that little nudge, multiplied by the vector whose direction we care about $\vec{v}$, then we subtract off the value of F at that original input $f(\vec{a})$

Thus the formal definition for the directional derivative, much easier to write in vector notation because you're thinking of your input as a vector and your output as just some nudge by something. 

F(5) $$ \nabla_{\vec{v}} f(\vec{a}) = \lim_{h \to 0} \frac{f(\vec{a} + h\vec{v}) - f(\vec{a})}{h} $$

So with F(3) when we multiply h by the unit vector it will only be 


## Why the gradient is the direction of steepest ascent?

From the directional derivative you can tell the rate at which the function changes as you move in this direction by taking a directional derivative of your function, lets say a point, (a, b), When you evaluate this at a,b. Just dotting the gradient of f, with that vector. 

$ \nabla_{\vec{v}} f(a,b) = \nabla f (a,b) \cdot \vec{v} $

So this is how you tell the rate of change, and when we initially say the directional derivative, we saw about dotting the gradient with a vector. (we can think of the vector here as
$\left[
\begin{array}{c}
    1\\
    2
\end{array}
\right]$ one step in the x-direction and two steps in the y direction. So the amount that it changes should be one times the change caused by a pure step in the x direction, plus two times a change caused by a pure step in the y direction. 

This just gives us the key of how to find the direction of deepest descent, because now, what we are really asking, when we say which one of these changes things the most, we are saying find the maximum for all unit vectors, that satisfy the property whose length is one, find the maximum of the dot product of the gradient of f with vector v

$ \max_{ \parallel \vec{v} \parallel} \nabla f(a,b)$

now lets think about what the dot product represents, is you would project that vector directly, kind of a prependicular projection onto your gradient vector, and you'd say whats that length? So then you are searching to find what unit vector maximizes the above condition. So the answer to what vector maximuzes this, is well, it's the gradient itself, right? 

Its the gradient itself but you'd normalize it because we're only considering unit vectors, so we just divide it by whatever its magnitude is $ \nabla f(a,b) \over \parallel \nabla f(a,b) \parallel $

The most notable thing here is that the gradient is this tool for computing directional derivatives, the gradient is something that loves to be dotted against, and as a consequence of that, **the direction of steepest ascent is that vector itself **because anything, if you're saying what maximizes the dot product with that thing, its, well, the vector that points in the same direction as that thing. And, this can also give us an interpretation for the length of the gradient. We know the direction of deepest ascent but what is the length mean? So let's give the denominator above a name - the normalized version, call it W. So W will be the unit vector that points in the direction of the gradient. If you take the directional derivative in the direction of W of f, what that means is the gradient of f 

$\nabla_{\vec{w}}f = \nabla f \cdot \vec{w} = \frac{\nabla f \cdot \nabla f}{\parallel \nabla f \parallel} = \frac{\parallel \nabla f \parallel ^2}{\parallel \nabla f \parallel} = \  \parallel \nabla f \parallel $ 

that means you're taking the gradient of the vector dotted with itself, but because its w and not the gradient, we're normalizing it with the gradient of f, and you can think of the gradient of f evaluated at a,b. So when you take the dot product  with itself, its the square of its magnitude, but the whole thing is divided by the magnitude, so basically the directional derivative in the direction of the gradient itself has a value equal to the magnitude of the gradient. So this tells you when you're moving in that direction, in the direction of the gradient, the rate at which the function changes is given by the magnitude of the gradient so it's this really magical vector. And the magnitude tells you the rate at which things change while you're moving in that direction of steepest ascent. 




## Gradients in Deep Learning

The Gradient is just a derivative generalized to functions with more than one variable. We can use Calculus to find the gradient at any point in our error function, which depends on the input weights. 

At each st



[Gradient][~/Documents/Data_Science/data_science/deep_learning/gradient.png]

# Appendix:

## MathJax

Nabla $\nabla$

cdots $\cdots$

Big Right (and the direction) $\Biggr($
Big Left (and the direction) $\Biggl)$

Epsilon $\epsilon$

Left arrow $\leftarrow$

Sin $\sin$

Prod $\prod$

Sum $\sum$

alpha to omega $\alpha \beta \omega$
Gamma to Omega $\Gamma \Omega$

Int $\int$

Infinity $\infty$

Square Root $\sqrt{x} $

Limit $\lim_{x\to 0}$

Space $ a\ b$

hat $\hat{y}$

bar $\bar{x}$

vec $\vec{x}$

arrays $ \nabla f(x, y) = \left[
\begin{array}{c}
    \frac{\partial f}{\partial x}\\
    \frac{\partial f}{\partial y}
\end{array}
\right]$

In [None]:
from math import cos, sin, radians
import numpy as np
# derivative(cos, 1)
# import 

def derivative(func, x, h = None):
    if h is None:
        # Note the hard coded value found here is the square root of the
        # floating point precision, which can be found from the function
        # call np.sqrt(np.finfo(float).eps).
        h = 1.49011611938477e-08
    xph = x + h
    dx = xph - x
    return (func(xph) - func(x)) / dx
np.sqrt(np.finfo(float).eps