In [1]:
%load_ext nb_black

<IPython.core.display.Javascript object>

- a deep neural network is just a term that describes a big multi-layer neural network
- a neural network is a machine learning algorithm that you can train using input like camera images or sensor readings and generate output like what's steering angle the car should take or how fast it should go
- the idea is that the neural network learns from observing the world
  - you don't have to teach it anything specific
- deep learning is just another term for using deep neural networks to solve problems and it's become really important for self-driving cars
- deep learning is relatively new; until the last few years, computers simply weren't fast enough to train deep neural networks effectively
  - now, however, automotive manufacturers can apply deep learning techniques to drive cars in real time
- because deep learning is so new, automotive engineers and researchers are still experimenting with just how far it can take us, but deep learning has already revolutionized segments of autonomous driving like computer vision and it has the potential to entirely change the way we develop self-driving cars

- machine learning is a field of artificial intelligence that relies on computers to learn about the environment from data, instead of relying on the rule set by computer programmers
- deep learning is an approach to machine learning, that uses deep neural networks
  - deep learning uses this one tool to accomplish an amazing array of objectives, from speech recognition, to driving a car

## Quiz: Housing Prices

- let's say we're studying the housing market and our task is to predict the price of a house given its size
- so we have a small house that costs \\$70000 and a big house that costs \\$160000
- we'd like to estimate the price of these medium-sized house over here; so how do we do it?
- first we put them in a grid where the *x-axis* represents the size of the house in square feet and the *y-axis* represents the price of the house
- and to help us out, we have collected some previous data in the form of these blue dots
  - these are other houses that we've looked at and we've recorded their prices with respect to their size
  - and here we can see the small house is priced at \\$70000 and the big one at \\$160000

<img src="resources/quiz_housing_prices.png"/>

- to help us out, we can see that these points can form a line
- and we can draw the line that best fits this data
- now on this line, we can see that our best guess for the price of the house is this point here over the line which corresponds to \\$120000
- this method is known as **linear regression**
- we can think of linear regression as a painter who would look at your data and draw the best fitting line through it

# Linear to Logistic Regression

- linear regression helps predict values on a continuous spectrum, like predicting what the price of a house will be
- how about classifying data among discrete classes?
- here are examples of classification tasks:
  - determining whether a patient has cancer
  - identifying the species of a fish
  - figuring out who's talking on a conference call
- classification problems are important for self-driving cars
  - self-driving cars might need to classify whether an object crossing the road is a car, pedestrian, and a bicycle
  - or they might need to identify which type of traffic sign is coming up, or what a stop light is indicating
- linear regression leads to logistic regression and ultimately neural networks, a more advanced classification tool

## Classification Problems

- let's say we are the admissions office at a university and our job is to accept or reject students
- in order to evaluate students, we have two pieces of information, the results of a test and their grades in school
- we'll start with Student 1 who got 9/10 in the test and 8/10 in the grades; that student did quite well and got accepted
- then we have Student 2 who got 3/10 in the test and 4/10 in the grades; that student got rejected
- we also have a new Student 3 who got 7/10 in the test and 6/10 in the grades, and we're wondering if the student gets accepted or not
- our first way to find this out is to plot students in a graph with the horizontal axis corresponding to the score on the test and the vertical axis corresponding to the grades
- now we'll do what we do in most of our algorithms, which is to look at the previous data

<img src="resources/student_classification_quiz.png" style="width: 60%;"/>

- this is how the previous data looks; these are all the previous students who got accepted or rejected
- the blue points correspond to students that got accepted, and the red points to students that got rejected
-  we can see in this diagram that the students would did well in the test and grades are more likely to get accepted and the students who did poorly in both are more likely to get rejected

- it seems that this data can be  nicely separated by a line which is this line over here, and it seems that most students over the line get accepted and most students under the line get rejected
- so this line is going to be our model
- the model makes a couple of mistakes since there are a few blue points that are under the line and a few red points over the line but we're not going to care about those
- we can say that it's safe to predict that if a point is over the line the student gets accepted and if it's under the line then the student gets rejected
- so based on this model we'll look at the new student that we see that they are over here at the point (7,6) which is above the line
- we can assume with some confidence that the student gets accepted
- the question is, how do we find this line?
  - we'll dedicate the rest of the session to show you algorithms that will find this line, not only for this example, but for much more general and complicated cases

<img src="resources/student_classification_solution.png" style="width: 60%;"/>

## Linear Boundaries

- we're going to label the horizontal axis corresponding to the test by the variable *x<sub>1</sub>*, and the vertical axis corresponding to the grades by the variable *x<sub>2</sub>*

<img src="resources/student_classification_boundary_line_1.png" style="width: 80%;"/>

- this boundary line that separates the blue and the red points is going to have a linear equation
- the one drawn has equation $2x_1+x_2-18=0$
  - this means that our method for accepting or rejecting students simply says the following: take this equation as our score; the score is $2*Test + Grades - 18$
- when the student comes in, we check their score
  - if their score is a positive number, then we accept the student
  - if the score is a negative number then we reject the student
- this is called a prediction
- we can say by convention that if the score is $0$, we'll accept a student although this won't matter much at the end
- and that's it; that linear equation is our model

- in the more general case, our boundary will be an equation of the following $w_1x_1 + w_2x_2 + b = 0$
- we'll abbreviate this equation in vector notation as $Wx + b = 0$, where $W$ is the vector $(w_1,w_2)$ and $x$ is the vector $(x_1, x_2)$
  - we simply take the product of the two vectors
- we'll refer to $x$ as the input, $w$ as the weights and $b$ as the bias
- for a student coordinates $(x_1, x_2)$, we'll denote a label as $y$ and the label is what we're trying to predict
- if the student gets accepted; namely the point is blue, then the label is $y = 1$
- if the student gets rejected; namely the point is red and then the label is $y = 0$
- thus, each point is in the form $(x_1,x_2)$ or $y$ is $1$ for the blue points and $0$ for the red points
- finally, our prediction is going to be called y-hat ($\hat y$) and it will be what the algorithm predicts that the label will be
  - in this case, $\hat y$ is one of the algorithm predicts that the student gets accepted, which means the point lies over the line
  - $\hat y$ is $0$ if the algorithm predicts that this didn't get rejected, which means the point is under the line
- in math terms, this means that the prediction $\hat y$ is $1$ if $Wx+b \geq 0$ and $0$ if $Wx+b < 0$

<img src="resources/student_classification_boundary_line_2.png" style="width: 80%;"/>

- to summarize, the points above the line have $\hat y = 1$ and the points below the line have $\hat y = 0$
  - and, the blue points have $y = 1$ and the red points have $y = 0$
- the goal of the algorithm is to have $\hat y$ resembling $y$ as closely as possible, which is exactly equivalent to finding the boundary line that keeps most of the blue points above it and most of the red points below it

- https://www.statisticshowto.datasciencecentral.com/y-hat-definition/

***Q:*** Now that you know the equation for the line ($2x_1 + x_2 - 18 = 0$), and similarly the “score” ($2x_1 + x_2 - 18$), what is the score of the student who got $7$ in the test and $6$ for grades?
<br/>
***A:*** $2$

## Higher Dimensions

- what happens if we have more data columns so not just testing grades, but maybe something else like the ranking of the student in the class?
  - how do we fit three columns of data?
- the only difference is that now, we won't be working in two dimensions, we'll be working in three
- so now, we have three axis: $x_1$ for the test, $x_2$ for the grades and $x_3$ for the class ranking
  - our data will look like a bunch of blue and red points flying around in 3D
- our equation won't be a line in two dimension, but a plane in three dimensions with a similar equation as before
  - now, the equation would be $w_1x_1 + w_2x_2 + w_3x_3 + b = 0$, which will separate this space into two regions
  - this equation can still be abbreviated by $Wx + b = 0$, except our vectors will now have three entries instead of two
  - and our prediction $\hat y$ will still be $1$ if $Wx+b \geq 0$ and $0$ if $Wx+b < 0$
  
<img src="resources/student_classification_higher_dimensions_1.png" style="width: 80%;"/>

- what if we have many columns like say n of them?
- well, it's the same thing; now, our data just leaps in *n-dimensional space*
- if we can imagine that the points are just things with $n$ coordinates called $x_1, x_2, x_3$ all the way up to $x_n$ with our labels being $y$, then our boundaries are just an $n-1$ dimensional hyperplane, which is a high dimensional equivalent of a line in 2D or a plane in 3D
- the equation of this $n-1$ dimensional hyperplane is going to be $w_1x_1 + w_2x_2 + w_nx_n + b = 0$, which we can still abbreviate to $Wx + b = 0$, where our vectors now have $n$ entries
- our prediction is still the same as before
  - $\hat y$ will still be $1$ if $Wx+b \geq 0$ and $0$ if $Wx+b < 0$

<img src="resources/student_classification_higher_dimensions_2.png" style="width: 80%;"/>

***Q:*** Given the table above, what would the dimensions be for input features (x), the weights (W), and the bias (b) to satisfy (Wx + b)?
<br/>
***A:*** W: ($1x_n$), x: ($nx_1$)

# Perceptrons

- a preceptron, is the building block of neural networks, and it's just an encoding of our equation into a small graph

<img src="resources/student_classification_perceptron_1.png" style="width: 80%;"/>

- here we have our data and our boundary line and we fit it inside a node
- now we add small nodes for the inputs which, in this case, they are the test and the grades
- here we can see an example where test equals seven and grades equals six
- what the perceptron does is it plots the point $(7, 6)$ and checks if the point is in the positive or negative area
  - if the point is in the positive area, then it returns a yes
  - if it is in the negative area, it returns a no

<img src="resources/student_classification_perceptron_2.png" style="width: 80%;"/>

- these weights: $2$, $1$ and $-18$ are what define the linear equation, so we'll use them as labels in the graph
- $2$ and $1$ will label the edges coming from $x_1$ and $x_2$ respectively, and the bias unit $-18$ will label the node
- thus, when we see a node with these labels, we can think of the linear equation they generate

<img src="resources/student_classification_perceptron_3.png" style="width: 80%;"/>

- another way to grab this node is to consider the bias as part of the input
- now since $w_1$ gets multiplied by $x_1$ and $w_2$ by $x_2$, it's natural to think that $b$ gets multiplied by $1$
- so we'll have the $b$ labeling an edge coming from $1$
- then what the node does is it multiplies the values coming from the incoming nodes by the values and the corresponding edges
- then it adds them and finally, it checks if the result is greater or equal to zero
- if it is, then the node returns a yes or a value of one, and if it isn't then the node returns a no or a value of zero

<img src="resources/student_classification_perceptron_4.png" style="width: 80%;"/>

- in the general case, this is how the nodes look
- we will have our node over here then end inputs coming in with values $x_1$ up to $x_n$ and $1$, and edges with weights $w_1$ up to $w_n$, and $b$ corresponding to the bias unit
- and then the node calculates the linear equation $Wx + b$, which is a summation from $i = 1$ to $n$, of $W_iX_i + b$
- this node then checks if the value is zero or bigger, and if it is, then the node returns a value of one for yes and if not, then it returns a value of zero for no

<img src="resources/step_function_perceptrons.png" style="width: 80%;"/>

- note that we're using an implicit function here, which is called a **step function**
- what the step function does is it returns a one if the input is positive or zero, and a zero if the input is negative

<img src="resources/perceptrons_combinations_of_nodes_1.png" style="width: 80%;"/>

- in reality, these perceptrons can be seen as a combination of nodes, where the first node calculates a linear equation and the inputs on the weights, and the second node applies the step function to the result

- these can be graphed as follows:

<img src="resources/perceptrons_combinations_of_nodes_2.png" style="width: 80%;"/>

- the summation sign represents a linear function in the first node, and the drawing represents a step function in the second node
- in the future, we will use different step functions
- so this is why it's useful to specify it in the node

<img src="resources/perceptrons_combinations_of_nodes_3.png" style="width: 80%;"/>

- as we've seen, there are two ways to represent perceptrons
- the one on the left has a bias unit coming from an input node with a value of one
- the one in the right has the bias inside the node

# Perceptrons II

- not you've seen how a simple neural network makes decisions: by taking in input data, processing that information, and finally, producing an output in the form of a decision
- data, like test scores and grades, are fed into a network of interconnected nodes
  - these individual nodes are called **perceptrons**, or **artificial neurons**, and they are the basic unit of a neural network
  - each one looks at input data and decides how to categorize that data
  - https://en.wikipedia.org/wiki/Perceptron

<img src="resources/university_admission_perceptrons_1.png" style="width: 80%;"/>

- in the example above, the input either passes a threshold for grades and test scores or doesn't, and so the two categories are: yes (passed the threshold) and no (didn't pass the threshold)
- these categories then combine to form a decision; for example, if both nodes produce a "yes" output, then this student gains admission into the university

- the perceptron above is one of the two perceptrons from the video that help determine whether or not a student is accepted to a university
- it decides whether a student's grades are high enough to be accepted to the university
- you might be wondering: "How does it know whether grades or test scores are more important in making this acceptance decision?"
  - well, when we initialize a neural network, we don't know what information will be most important in making a decision
  - it's up to the neural network to learn for itself which data is most important and adjust how it considers that data
  - it does this with something called **weights**

## Weights

- when input comes into a perceptron, it gets multiplied by a weight value that is assigned to this particular input
- for example, the perceptron above has two inputs, *tests* for test scores and *grades*, so it has two associated weights that can be adjusted individually
- these weights start out as random values, and as the neural network network learns more about what kind of input data leads to a student being accepted into a university, the network adjusts the weights based on any errors in categorization that results from the previous weights
- this is called *training the neural network*
- a higher weight means the neural network considers that input more important than other inputs, and lower weight means that the data is considered less important
- an extreme example would be if test scores had no affect at all on university acceptance; then the weight of the test score input would be zero and it would have no affect on the output of the perceptron

## Summing the Input Data

- each input to a perceptron has an associated weight that represents its importance
- these weights are determined during the learning process of a neural network, called training
- in the next step, the weighted input data are summed to produce a single value, that will help determine the final output - whether a student is accepted to a university or not

<img src="resources/summing_the_input_data_perceptrons.png" style="width: 70%;"/>

- in this example we weight $x_{test}$ by $w_{test}$ and add it to $x_{grades}$ weighted by $w_{grades}$

- when writing equations related to neural networks, the weights will always be represented by some type of the letter **w**
- it will usually look like a $W$ when it represents a matrix of weights or a $w$ when it represents an individual weight, and it may include some additional information in the form of a subscript to specify which weights
- but remember, when you see the letter **w**, think **weights**

- in this example, we'll use $w_{grades}$ for the weight of *grades* and $w_{test}$ for the weight of *test*
- for the image above, let's say that the weights are: $w_{grades} = -1 $, $w_{test} = - 0.2$
- you don't have to be concerned with the actual values, but their relative values are important
- $w_{grades}$ is 5 times larger than $w_{test}$, which means the neural network considers grades input 5 times more important than test in determining whether a student will be accepted into a university

- the perceptron applies these weights to the inputs and sums them in a process known as **linear combination**
- in our case, this looks like $w_{grades} \cdot x_{grades} + w_{test} \cdot x_{test} = -1 \cdot x_{grades} - 0.2 \cdot x_{test}$
- to make our equation less wordy, let's replace the explicit names with numbers
  - let's use $1$ for $grades$ and $2$ for $tests$
  - so now our equation becomes $w_1 \cdot x_1 + w_2 \cdot x_2$
- in this example, we just have 2 simple inputs: grades and tests
- let's imagine we instead had $m$ different inputs and we labeled them $x_1, x_2, ... , x_m$
  - let's also say that the weight corresponding to $x_1$ is $w_1$ and so on
  - in that case, we would express the linear combination succintly as: $\sum_{i=1}^{m} w_i \cdot x_i$

- one last thing: you'll see equations written many different ways, both here and when reading on your own
- for example, you will often just see $\sum_{i}$ instead of $\sum_{i=1}^{m}$ 
- the first is simply a shorter way of writing the second
- that is, if you see a summation without a starting number or a defined end value, it just means perform the sum for all of the them
- and sometimes, if the value to iterate over can be inferred, you'll see it as just $\sum$
- just remember they're all the same thing: $\sum_{i=1}^{m} w_i \cdot x_i = \sum_{i} w_i \cdot x_i = \sum w_i \cdot x_i$

## Calculating the Output with an Activation Function

- finally, the result of the perceptron's summation is turned into an output signal!
- this is done by feeding the linear combination into an **activation function**

- activation functions are functions that decide, given the inputs into the node, what should be the node's output
- because it's the activation function that decides the actual output, we often refer to the outputs of a layer as its "activations"
- one of the simplest activation functions is the **Heaviside step function**
  - this function returns a **0** if the linear combination is **less than 0**
  - it returns a **1** if the linear combination is **positive or equal to zero**
  - https://en.wikipedia.org/wiki/Heaviside_step_function

- the Heaviside step function is shown below, where *h* is the calculated linear combination

<img src="resources/heaviside_step_function_1.png" style="width: 60%;"/>

- in the university acceptance example above, we used the weights $w_{grades} = -1$, $w_{test} = -0.2$
- since $w_{grades}$ and $w_{test}$ are negative values, the activation function will only return a $1$ if grades and test are $0$!
  - this is because the range of values from the linear combination using these weights and inputs are $(-\infty,0]$ (i.e. negative infinity to 0, including 0 itself)

- it's easiest to see this with an example in two dimensions
- in the following graph, imagine any points along the line or in the shaded area represent all the possible inputs to our node
- also imagine that the value along the y-axis is the result of performing the linear combination on these inputs and the appropriate weights
- it's this result that gets passed to the activation function
- now remember that the step activation function returns $1$ for any inputs greater than or equal to zero
- as you can see in the image, only one point has a y-value greater than or equal to zero – the point right at the origin, $(0,0)$

<img src="resources/heaviside_step_function_2.png" style="width: 40%;"/>

- now, we certainly want more than one possible grade/test combination to result in acceptance, so we need to adjust the results passed to our activation function so it activates – that is, returns $1$ - for more inputs
- specifically, we need to find a way so all the scores we’d like to consider acceptable for admissions produce values greater than or equal to zero when linearly combined with the weights into our node
- one way to get our function to return $1$ for more inputs is to add a value to the results of our linear combination, called a **bias**
- bias, represented in equations as $b$, lets us move values in one direction or another

- for example, the following diagram shows the previous hypothetical function with an added bias of $3$
- the blue shaded area shows all the values that now activate the function
- but notice that these are produced with the same inputs as the values shown shaded in grey – just adjusted higher by adding the bias term

<img src="resources/heaviside_step_function_3.png" style="width: 40%;"/>

- of course, with neural networks we won't know in advance what values to pick for biases
- that’s ok, because just like the weights, the bias can also be updated and changed by the neural network during training
- so after adding a bias, we now have a complete perceptron formula:

<img src="resources/perceptron-equation.gif" style="width: 60%;"/>

- this formula returns $1$ if the input $(x_1, x_2,...,x_m)$ belongs to the accepted-to-university category or returns $0$ if it doesn't
- the input is made up of one or more real numbers, each one represented by $x_i$ where $m$ is the number of inputs
- then the neural network starts to learn!
- initially, the weights ($w_i$) and bias ($b$) are assigned a random value, and then they are updated using a learning algorithm like gradient descent
- the weights and biases change so that the next training example is more accurately categorized, and patterns in data are "learned" by the neural network

# Why "Neural Networks"?

- so you may be wondering why are these objects called neural networks
  - the reason why they're called neural networks is because perceptions kind of look like neurons in the brain
  
<img src="resources/perceptrons_vs_brain.png" style="width: 70%;"/>

- in the left we have a perception with four inputs
  - what the perception does, it calculates some equations on the input and decides to return a one or a zero
- in a similar way neurons in the brain take inputs coming from the dendrites
  - these inputs are nervous impulses
  - so what the neuron does is it does something with the nervous impulses and then it decides if it outputs a nervous impulse or not through the axon
- the way we'll create neural networks later in this lesson is by concatenating these perceptions so we'll be mimicking the way the brain connects neurons by taking the output from one and turning it into the input for another one

# Perceptrons as Logical Operators

- here's something very interesting about perceptrons: some logical operators can be represented as perceptrons

- https://medium.com/@stanleydukor/neural-representation-of-and-or-not-xor-and-xnor-logic-gates-perceptron-algorithm-b0275375fea1

## AND Perceptron

- the AND operator takes two inputs and it returns an output
- the inputs can be true or false but the output is only true if both of the inputs are true
- how do we turn this into a perceptron?
  - the first step is to turn this table of true/false into a table of zeros and ones where the one corresponds to true and the zero corresponds to false
  - and now we draw this perceptron over here which works just as before: it has a line defined by weights and a bias and it has a positive area which is colored blue and a negative area which is colored red
  - what this perceptron is going to do is it will plot each point and if the point falls in the positive area then it returns a one and if it falls in the negative area then it returns a zero

<img src="resources/and_perceptron.png" style="width: 90%;"/>

### What are the weights and bias for the AND perceptron?

In [2]:
# Set the weights (weight1, weight2) and bias
# to values that will correctly determine the AND operation as shown above
# More than one set of values will work!

# - https://stackoverflow.com/a/53822383

import pandas as pd

# TODO: Set weight1, weight2, and bias
weight1 = 0.5
weight2 = 0.5
bias = -1.0


# DON'T CHANGE ANYTHING BELOW
# Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [False, False, False, True]
outputs = []

# Generate and check output
for test_input, correct_output in zip(test_inputs, correct_outputs):
    linear_combination = weight1 * test_input[0] + weight2 * test_input[1] + bias
    output = int(linear_combination >= 0)
    is_correct_string = "Yes" if output == correct_output else "No"
    outputs.append(
        [test_input[0], test_input[1], linear_combination, output, is_correct_string]
    )

# Print output
num_wrong = len([output[4] for output in outputs if output[4] == "No"])
output_frame = pd.DataFrame(
    outputs,
    columns=[
        "Input 1",
        "  Input 2",
        "  Linear Combination",
        "  Activation Output",
        "  Is Correct",
    ],
)
if not num_wrong:
    print("Nice!  You got it all correct.\n")
else:
    print("You got {} wrong.  Keep trying!\n".format(num_wrong))
print(output_frame.to_string(index=False))

Nice!  You got it all correct.

 Input 1    Input 2    Linear Combination    Activation Output   Is Correct
       0          0                  -1.0                    0          Yes
       0          1                  -0.5                    0          Yes
       1          0                  -0.5                    0          Yes
       1          1                   0.0                    1          Yes


<IPython.core.display.Javascript object>

## OR Perceptron

- the OR operator returns true if any of its two inputs are true
- that gets turned to this table which gets turned into this perceptron which is very similar as the one before except the line has different weights and a different bias

<img src="resources/or_perceptron.png" style="width: 90%;"/>

- the OR perceptron is very similar to an AND perceptron
- in the image below, the OR perceptron has the same line as the AND perceptron, except the line is shifted down

<img src="resources/and_to_or_perceptron.png" style="width: 90%;"/>

***Q:*** What are two ways to go from an AND perceptron to an OR perceptron?
<br/>
***A:*** Increase the weights and Decrease the magnitude of the bias.

## NOT Perceptron

- unlike the other perceptrons we looked at, the NOT operation only cares about one input
- the operation returns a `0` if the input is `1` and a `1` if it's a `0`
- the other inputs to the perceptron are ignored

In [3]:
# Set the weights (weight1, weight2) and bias
# to the values that calculate the NOT operation on the second input and ignores the first input

import pandas as pd

# TODO: Set weight1, weight2, and bias
weight1 = 0.0
weight2 = -1.0
bias = 0.0


# DON'T CHANGE ANYTHING BELOW
# Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [True, False, True, False]
outputs = []

# Generate and check output
for test_input, correct_output in zip(test_inputs, correct_outputs):
    linear_combination = weight1 * test_input[0] + weight2 * test_input[1] + bias
    output = int(linear_combination >= 0)
    is_correct_string = "Yes" if output == correct_output else "No"
    outputs.append(
        [test_input[0], test_input[1], linear_combination, output, is_correct_string]
    )

# Print output
num_wrong = len([output[4] for output in outputs if output[4] == "No"])
output_frame = pd.DataFrame(
    outputs,
    columns=[
        "Input 1",
        "  Input 2",
        "  Linear Combination",
        "  Activation Output",
        "  Is Correct",
    ],
)
if not num_wrong:
    print("Nice!  You got it all correct.\n")
else:
    print("You got {} wrong.  Keep trying!\n".format(num_wrong))
print(output_frame.to_string(index=False))

Nice!  You got it all correct.

 Input 1    Input 2    Linear Combination    Activation Output   Is Correct
       0          0                   0.0                    1          Yes
       0          1                  -1.0                    0          Yes
       1          0                   0.0                    1          Yes
       1          1                  -1.0                    0          Yes


<IPython.core.display.Javascript object>

## XOR Perceptron

- the XOR perceptron, is very similar to the other two except this one returns a true if exactly one of them is true and the other one is false

<img src="resources/xor_perceptron.png" style="width: 90%;"/>

- if we introduce the NAND operator as the combination of AND and NOT, then we get the following two-layer perceptron that will model XOR

<img src="resources/xor_multi_layer_perceptron.png" style="width: 40%;"/>

# Perceptron Trick

- how do we find this line that separates the blue points from the red points in the best possible way?
- the computer doesn't know where to start
- it might as well start at a random place by picking a random linear equation
- this equation will define a line and a positive and negative area given in blue and red respectively
- what we're going to do is to look at how badly this line is doing and then move it around to try to get better and better
- now the question is, how do we find how badly this line is doing?
  - let's ask all the points
  - here we have four points that are correctly classified
  - they are these two blue points in the blue area and these two red points in the red area
  - then we have these two points that are incorrectly classified
    - that's this red point in the blue area and this blue point in the red area
    - we want to get as much information from them so we want them to tell us something so that we can improve this line

<img src="resources/split_data_perceptron.png" style="width: 40%;"/>

- so what is it that incorrectly classified points can tell us?

<img src="resources/split_data_perceptron_quiz.png" style="width: 40%;"/>

- here we have a misclassified point, this red point in the blue area
- if you're in the wrong area, you would like the line to go over you, in order to be in the right area
- thus, the point says come closer so the line can move towards it and eventually classify it correctly

- let's see how to make a line go closer to a point
- let's say we have our linear equation for example, $3x_1 + 4x_2 - 10$
  - that linear equation gives us a line which is the points where the equation is zero and two regions
  - the positive region drawn in blue where $3x_1 + 4x_2 - 10$ is positive
  - the negative region drawn in red with $3x_1 + 4x_2 - 10$ is negative

<img src="resources/split_data_perceptron_positive.png" style="width: 70%;"/>

- here we have our lonely misclassified point $(4, 5)$ which is a red point in the blue area, and the point says to the line to come closer
- so how do we get that point to come closer to the line?
- the idea is we're going to take the $4$ and $5$ and use them to modify the equation of the line in order to get the line to move closer to the point
- we have the parameters of the line: $3$, $4$ and $-10$ 
- the coordinates of the point are $4$ and $5$, and let's also add $1$ here for the bias unit

- what we'll do is subtract these numbers from the parameters of the line to get $3 - 4$, $4 - 5$, and $-10 -1$
- the new line will have parameters $-1$, $-1$, $-11$
  - this line will move drastically towards the point, possibly even going over it and placing it in the correct area
- since we have a lot of other points, we don't want to make any drastic moves since we may accidentally misclassify all our other points
- we want the line to make a small move towards that point and for this, we need to take small steps towards the point
- here's where we introduce the **learning rate**
  - the learning rate is a small number for example, $0.1$
- what we'll do is instead of subtracting $4$, $5$ and $1$ from the coordinates of the line, we'll multiply these numbers by $0.1$ and then subtract them from the equation of the line
  - this means we'll be subtracting $0.4$, $0.5$, and $0.1$ from the equation of the line
  - obtaining a new equation of $2.6x_1 + 3.5x_2 - 10.1 = 0$
  - this new line will actually move closer to the point.

<img src="resources/split_data_perceptron_negative.png" style="width: 70%;"/>

- in the same way, if we have a blue point in the red area; for example, the point $(1,1)$ is a positively labeled point in the negative area
- this point is also misclassified and it says to the line to come closer
- so what do we do here is the same thing, except now instead of subtracting the coordinates to the parameters of the line, we add them
- again, we multiply by the learning rate in order to make small steps
- so here we take the coordinates of the point $1$, $1$ and put an extra $1$ for the constant term and we multiply them by the learning rates $0.1$
- now, we add them to the parameters of the line and we get a new line with equation $3.1x_1 + 4.1x_2 - 9.9 = 0$

# Perceptron Algorithm

- now we finally have all the tools for describing the perceptron algorithm
- we start with the random equation, which will determine some line, and two regions, the positive and the negative region
- now, we'll move this line around to get a better and better fit
- so, we ask all the points how they're doing
- correctly classified points say, "I'm good."
- incorrectly classified points say, "Come closer."
- so, let's actually write the pseudocode for this perceptron algorithm

- 1. Start with random weights:&nbsp;&nbsp;$w_1, ..., w_n, b$
    - this gives us the equation $Wx + b$, the line, and the positive and negative areas
<br>
<br>
- 2. For every misclassified point with coordinates $(x_1, ..., x_n)$:
    - 2.1. If prediction = 0 (which means the point is a positive point in the negative area):
      - For $i = 1 ... n$
        - Change $w_i = w_i + \alpha x_i$
      - Change $b$ to $b + \alpha$&nbsp;&nbsp;(this moves the line closer to the misclassified point)
<br>
<br>
    - 2.1. If prediction = 1 (which means the point is a negative point in the positive area):
      - For $i = 1 ... n$
        - Change $w_i = w_i - \alpha x_i$
      - Change $b$ to $b - \alpha$&nbsp;&nbsp;(this moves the line closer to the misclassified point)

- $\alpha$ is the learning rate


- we just repeat this step until we get no errors, or until we have a number of error that is small
- or simply we can just say, do the step a thousand times and stop

## Coding the Perceptron Algorithm

- in this quiz, you'll have the chance to implement the perceptron algorithm to separate the following data (given in the file data.csv)

<img src="resources/coding_perceptron_algorithm_quiz.png" style="width: 60%;"/>

- recall that the perceptron step works as follows
  - for a point with coordinates $(p, q)$, label $y$, and prediction given by the equation $\hat y = step(w_1x_1 + w_2x_2 + b)$:
    - if the point is correctly classified, do nothing
    - if the point is classified positive, but it has a negative label, subtract $\alpha p, \alpha q$, and $\alpha$ to $w_1, w_2$ and $b$ respectively
    - if the point is classified negative, but it has a positive label, add $\alpha p, \alpha q$, and $\alpha$ to $w_1, w_2$ and $b$ respectively

```python
# data is in resources/data.csv

import numpy as np

# Setting the random seed, feel free to change it and see different solutions.
np.random.seed(42)


def stepFunction(t):
    if t >= 0:
        return 1
    return 0


def prediction(X, W, b):
    return stepFunction((np.matmul(X, W) + b)[0])


# TODO: Fill in the code below to implement the perceptron trick.
# The function should receive as inputs the data X, the labels y,
# the weights W (as an array), and the bias b,
# update the weights and bias W, b, according to the perceptron algorithm,
# and return W and b.
def perceptronStep(X, y, W, b, learn_rate=0.01):
    for i in range(len(X)):
        y_hat = prediction(X[i], W, b)
        if y[i] - y_hat == 1:
            W[0] += X[i][0] * learn_rate
            W[1] += X[i][1] * learn_rate
            b += learn_rate
        elif y[i] - y_hat == -1:
            W[0] -= X[i][0] * learn_rate
            W[1] -= X[i][1] * learn_rate
            b -= learn_rate
    return W, b


# This function runs the perceptron algorithm repeatedly on the dataset,
# and returns a few of the boundary lines obtained in the iterations,
# for plotting purposes.
# Feel free to play with the learning rate and the num_epochs,
# and see your results plotted below.
def trainPerceptronAlgorithm(X, y, learn_rate=0.01, num_epochs=25):
    x_min, x_max = min(X.T[0]), max(X.T[0])
    y_min, y_max = min(X.T[1]), max(X.T[1])
    W = np.array(np.random.rand(2, 1))
    b = np.random.rand(1)[0] + x_max
    # These are the solution lines that get plotted below.
    boundary_lines = []
    for i in range(num_epochs):
        # In each epoch, we apply the perceptron step.
        W, b = perceptronStep(X, y, W, b, learn_rate)
        boundary_lines.append((-W[0] / W[1], -b / W[1]))
    return boundary_lines
```

# Error Functions

- the way we'll solve our problems from now on is with the help of an error function
- an error function is simply something that tells us how far we are from the solution
- for example, if I'm here and my goal is to get to this plant, an error function will just tell me the distance from the plant
- my approach would then be to look around myself, check in which direction I can take a step to get closer to the plant, take that step and then repeat
- here the error is simply the distance from the plant

# Log-loss Error Function

- here is obvious realization of the error function
- lets say we're standing on top a mountain, Mount Errorest and we want to descend but it's not that easy because it's cloudy and the mountain is very big, so we can't really see the big picture
- what we'll do to go down is we'll look around us and we consider all the possible directions in which we can walk
- then we pick a direction that makes us descend the most
- so we take a step in that direction; thus, we've decreased the height
- once we take the step and we start the process again and again always decreasing the height until we go all the way down the mountain, minimizing the height
- in this case the key metric that we use to solve the problem is the height
  - we'll call the height the error
  - the error is what's telling us how badly we're doing at  the moment and how far we are from an ideal solution
- if we constantly take steps to decrease the error then we'll eventually solve our problem, descending from Mt. Errorest
- this method, which we'll study in more detail later, is called gradient descent


- some of you may be thinking, wait, that doesn't necessarily solve the problem
  - what if I get stuck in a valley, a local minimum, but that's not the bottom of the mountain
  - this happens a lot in machine learning and we'll see different ways to solve it later in this Nanodegree
  - it's also worth noting that many times a local minimum will give us a pretty good solution to a problem

- so let's try that approach to solve a problem (2 areas, with 2 errors): what would be a good error function here?
- what would be a good way to tell the computer how badly it's doing?
- here's our line with our positive and negative area
- and the question is how do we tell the computer how far it is from a perfect solution?
- maybe we can count the number of mistakes
  - there are two mistakes here
  - so that's our height; that's our error
- so just as we did to descend from the mountain, we look around all the directions in which we can move the line in order to decrease our error
- so let's say we move in this direction
  - we'll decrease the number of errors to one and then if we're moving in that direction, we'll decrease the number of errors to zero
- there's a small problem with that approach
  - in our algorithms we'll be taking very small steps and the reason for that is calculus, because our tiny steps will be calculated by derivatives
- so what happens if we take very small steps here?
  - we start with two errors and then move a tiny amount and we're still at two errors
  - then move a tiny amount again and we're still two errors
  - another tiny amount and we're still at two and again and again; so not much we can do here

- this is equivalent to using gradient descent to try to descend from an Aztec pyramid with flat steps
- if we're standing here in the second floor, for the two errors and we look around ourselves, we'll always see two errors and we'll get confused and not know what to do
- on the other hand in Mt. Errorest we can detect very small variations in height and we can figure out in what direction it can decrease the most
- in math terms this means that in order for us to do gradient descent our error function can not be discrete, it should be continuous
- Mt. Errorest is continuous since small variations in our position will translate to small variations in the height but the Aztec pyramid does not since the high jumps from two to one and then from one to zero 
- as a matter of fact, our error function needs to be differentiable, but we'll see that later

<img src="resources/log_loss_error_aztec_errorest.png"/>

- so, what we need to do here is to construct an error function that is continuous and we'll do this as follows
  - so here are six points with four of them correctly classified, that's two blue and two red, and two of them incorrectly classified, that is this red point at the very left and this blue point at the very right
- the error function is going to assign a large penalty to the two incorrectly classified points and small penalties to the four correctly classified points
- here we are representing the size of the point as the penalty
  - the penalty is roughly the distance from the boundary when the point is misclassified and almost zero when the point is correctly classified
  
<img src="resources/log_loss_error_penalties.png" width="80%"/>

- we'll learn the formula for the error later in the class
- now we obtain the total error by adding all the errors from the corresponding points
- here we have a large number so it is two misclassified points add a large amount to the error
- the idea now is to move the line around in order to decrease these errors
- but now we can do it because we can make very tiny changes to the parameters of the line which will amount to very tiny changes in the error function
- so, if you move the line, say, in this direction, we can see that some errors decrease, some slightly increase, but in general when we consider the sum; the sum gets smaller and we can see that because we've now correctly classified the two points that were misclassified before
- so once we are able to build an error function with this property, we can now use gradient descent to solve our problem

- so here's the full picture
- here we are at the summit of Mt. Errorest
- we're quite high up because our error is large
- as you can see the error is the height which is the sum of the blue and red areas
- we explore around to see what direction brings us down the most, or equivalently, what direction can we move the line to reduce the error the most, and we take a step in that direction.
- so in the mountain we go down one step and in the graph we've reduced the error a bit by correctly classifying one of the points
- and now we do it again
  - we calculate the error
  - we look around ourselves to see in what direction we descend the most
  - we take a step in that direction and that brings us down the mountain
- so on the left we have reduced the height and successfully descended from the mountain and on the right we have reduced the error to its minimum possible value and successfully classified our points

# Discrete vs Continuous

- in the last section we pointed out the difference between a discrete and a continuous error function and discovered that in order for us to use gradient descent we need a continuous error function
- in order to do this we also need to move from discrete predictions to continuous predictions

- the prediction is basically the answer we get from the algorithm
- a discrete answer will be of the form yes, no
- a continued answer will be a number, normally between zero and one which we'll consider a probability


- in the running example, here we have our students where blue is accepted and red is rejected
- and the discrete algorithm will tell us if a student is accepted or rejected by typing a zero for rejected students and a one for accepted students

<img src="resources/discrete_vs_continuous_percentage.png" width="80%"/>


- on the other hand, the farther our point is from the black line, the more drastic these probabilities are
- points that are well into the blue area get very high probabilities, such as this point with an 85% probability of being blue
- points that are well into the red region are given very low probabilities, such as this point on the bottom that is given a 20% probability of being blue
- the points over the line are all given a 50% probability of being blue


- as you can see the probability is a function of the distance from the line

## Activation Functions

- the way we move from discrete predictions to continuous, is to simply change your activation function from the step function in the left, to the sigmoid function on the right
- the **sigmoid function** is simply a function which for large positive numbers will give us values very close to $1$
- for large negative numbers will give us values very close to $0$
- for numbers that are close to $0$, it'll give you values that are close to $0.5$
- the formula is sigmoid effects: $ \sigma(x) = \dfrac{1}{1 + e^{-x}}$

<img src="resources/discrete_vs_continuous_activation_functions.png" width="80%"/>


- before our model consisted of a line with a positive region and a negative region
- now it consists of an entire probability space or for each point in the plane we are given the probability that the label of the point is $1$ for the blue points, and $0$ for the red points
- the way we obtain this probability space is very simple
  - we just combine the linear function $Wx + b$ with the sigmoid function
  - in the left we have the lines that represent the points for which $Wx + b$ is  $0, 1, 2, -1, -2, ...$
  - and once we apply the sigmoid function to each of these values in the plane, we then obtain numbers from $0$ to $1$ for each point
    - these numbers are just the probabilities of the point being blue
    - the probability of the point being blue is a prediction of the model $\hat y = \sigma(Wx + b)$ 
- we can see the lines for which the prediction is $0.5, 0.6, 0.7, 0.4, 0.3, ...$
- as we get more into the blue area, $\sigma(Wx + b)$ gets closer and closer to $1$
- as we move into the red area, $\sigma(Wx + b)$ gets closer and closer to $0$
- when we're over the main line, $Wx + b$ is $0$, which means sigmoid of $Wx + b$ is exactly $0.5$

<img src="resources/discrete_vs_continuous_probabillity_space.png" width="80%"/>

- here on the left we have our old perceptron with the activation function as a step function
- on the right we have our new perceptron, where the activation function is the sigmoid function
  - what our new perceptron does, it takes the inputs, multiplies them by the weights in the edges and adds the results, then applies the sigmoid function
  - so instead of returning $1$ and $0$ like before, it returns values between $0$ and $1$ such as $0.99$ or $0.67$ etc
- before it used to say the student got accepted or not, and now it says the probability of the student got accepted is this much

<img src="resources/discrete_vs_continuous_perceptrons.png" width="80%"/>

# Softmax

- so far we have models that give us an answer of yes/no or the probability of a label being positive or negative
- what if we have more classes?
- what if we want our model to tell us if something is red, blue, yellow or dog, cat, bird?


- in this section, we'll learn about the **softmax function**, which is the equivalent of the sigmoid activation function, but when the problem has 3 or more classes

- let's switch to a different example for a moment
- let's say we have a model that will predict if you receive a gift or not
- the model use predictions in the following way:
  - it says, the probability that you get a gift is $0.8$, which automatically implies that the probability that you don't receive a gift is $0.2$
  - what the model does is take some inputs
    - for example, is it your birthday or have it been good all year?
  - and based on those inputs, it calculates a linear model which would be the score
 - the probability that you get the gift or not is simply the sigmoid function applied to that score

- now, what if you had more options than just getting a gift or not a gift?
- let's say we have a model that just tell us what animal we just saw, and the options are a duck, a beaver and a walrus
- we want a model that tells an answer along the lines of, the probability of a duck is $0.67$, the probability of a beaver is $0.24$, and the probability of a walrus is $0.09$
  - notice that the probabilities need to add to $1$


- let's say we have a linear model based on some inputs
- the inputs could be, does it have a beak or not, number of teeth, number of feathers, hair, no hair, does it live in the water, does it fly, etc.
- we calculate linear function based on those inputs, and let's say we get some scores
  - so, the duck gets a score of $2$, and the beaver gets a score of $1$, and the walrus gets a score of $0$
- now the question is, how do we turn these scores into probabilities?
- the first thing we need to satisfy with probabilities is as we said, they need to add to $1$
  - so the $2$, the $1$, and the $0$ do not add to $1$
- the second thing we need to satisfy is, since the duck had a higher score than the beaver and the beaver had a higher score than the walrus, then we want the probability of the duck to be higher than the probability of the beaver, and the probability of the beaver to be higher than the probability of the walrus
- here's a simple way of doing it
  - let's take each score and divide it by the sum of all the scores
    - the $2$ becomes $\frac{2}{2 + 1 + 0}$
    - the $1$ becomes $\frac{1}{2 + 1 + 0}$
    - the $0$ becomes $\frac{0}{2 + 1 + 0}$
  - this kind of works because the probabilities we obtain are $\frac{2}{3}$ for the duck, $\frac{1}{3}$ for the beaver, and $0$ for the walrus
  - the problem is: what happens if our scores are negative?
    - this is completely plausible since the scores are linear function which could give negative values
    - what if we had, say, scores of $1$, $0$ and $(-1)$?
      - then, one of the probabilities would turn into $\frac{1}{1 + 0 + (-1)}$ which is $0$ and we know very well that we cannot divide by zero


- how can we turn this idea into one that works all the time even for negative numbers?
  - well, it's almost like we need to turn these scores into positive scores
- we can use exponential function (`exp`) to turn every number into a positive number
  - this is a function that returns a positive number for every input
  - $e^x$ is always a positive number
  
  
- what we're going to do is exactly what we did before, except, applying $e^x$ to the scores
  - now we have:
    - $\frac{e^2}{e^2 + e^1 + e^0} = 0.67$
    - $\frac{e^1}{e^2 + e^1 + e^0} = 0.24$
    - $\frac{e^0}{e^2 + e^1 + e^0} = 0.09$
  - this clearly adds to 1
  - also, notice that since the exponential function is increasing, then the duck has a higher probability than the beaver, and this one has a higher probability than the walrus; $P(duck) > P(beaver) > P(walrus)$


- this function is called the **softmax function** and it's defined formally like this:
- let's say we have $N$ classes and a linear model that gives us the following scores: $Z_1, Z_2,\ ...\ , Z_n$
  - each score for each of the classes
- what we do to turn them into probabilities is to say the probability that the object is in class $i$ is going to be: $P(class\ i) = \frac{e^{Z_i}}{e^{Z_1} +\ ... \ + e^{Z_n}}$
- that's how we turn scores into probabilities

***Q:*** When we had two classes, we applied the sigmoid function to the scores.
<br>
Now, that we have more classes we apply the softmax function to the scores.
<br>
The question is, is the softmax function for $n=2$ same as the sigmoid function?
<br>
***A:*** The answer is actually yes, but it's not super trivial why. And, it's a nice thing to remember.

```python
## Let's code the formula for the Softmax function in Python.

import numpy as np

# Write a function that takes as input a list of numbers, and returns
# the list of values given by the softmax function.
def softmax(L):
    expL = np.exp(L)
    sumExpL = sum(expL)
    result = []
    for i in expL:
        result.append(i * 1.0 / sumExpL)
    return result

    # Note: The function np.divide can also be used here, as follows:
    # def softmax(L):
    #     expL = np.exp(L)
    #     return np.divide (expL, expL.sum())
```

# One-Hot Encoding

- as we've seen so far, all our algorithms are numerical
- this means we need to input numbers, such as a score in a test or the grades, but the input data will not always look like numbers
- let's say the module receives as an input  the fact that you got a gift or didn't get a gift
  - how do we turn that into numbers? well, that's easy
    - if you've got a gift, we'll just say that the input variable is 1
    - if you didn't get a gift, we'll just say that the input variable is 0
  - but, what if we have more classes as before or, let's say, our classes are Duck, Beaver and Walrus?
  - what variable do we input in the algorithm?
    - maybe, we can input a 0 or 1 and a 2, but that would not work because it would assume dependencies between the classes that we can't have
    - what we do is, we come up with one variable for each of the classes
    - so, our table becomes like this:
      - that's one variable for Duck, one for Beaver and one for Walrus
      - each one has its corresponding column
      - now, if the input is a duck then the variable for duck is 1 and the variables for beaver and walrus are 0
      - similarly for the beaver and the walrus
    - we may have more columns of data but at least there are no unnecessary dependencies
    - this process is called **The One-Hot Encoding** and it will be used a lot for processing data
    
<img src="resources/one_hot_encoding_variables.png" width="70%"/>

# Maximum Likelihood

- we're still in our quest for an algorithm that will help us pick the best model that separates our data
- since we're dealing with probabilities then let's use them in our favor
- let's say I'm a student and I have two models
  - one that tells me that my probability of getting accepted is 80% and one that tells me the probability is 55%
  - which model looks more accurate?
    - if I got accepted then I'd say the better model is probably the one that says 80%
    - if I didn't get accepted then the more accurate model is more likely the one that says 55%
    - but I'm just one person; what if it was me and a friend?
- the best model would more likely be the one that gives the higher probabilities to the events that happened to us, whether it's acceptance or rejection
  - the method is called **maximum likelihood**
    - what we do is we pick the model that gives the existing labels the highest probability
    - thus, by maximizing the probability, we can pick the best possible model

- let's look at the following four points: two blue and two red and two models that classify them; the one on the left and the one on the right
  - the model on the right is much better since it classifies the four points correctly whereas the model in the left gets two points correctly and two points incorrectly

<img src="resources/maximum_likelihood_models_comparison.png" width="70%"/>

- let's see why the model in the right is better from the probability perspective
  - by that, we'll show you that the arrangement in the right is much more likely to happen than the one in the left
  - let's recall that our prediction is $\hat y = \sigma(Wx+b)$ and that that is precisely the probability of a point being labeled positive which means blue; $P(blue) = \sigma(Wx+b)$
  - for the points in the left figure, let's say the model tells you that the probability of being blue are $0.9$, $0.6$, $0.3$, and $0.2$
    - notice that the points in the blue region are much more likely to be blue and the points in the red region are much less likely to be blue
  - obviously, the probability of being red is one minus the probability of being blue
    - in this case, the probability of some of the points being red in the left model are $0.1$, $0.4$, $0.7$ and $0.8$
    - for the right model, let's say that the probabilities of the two points in the right being blue are $0.7$ and $0.9$ and of the two points in the left being red are $0.8$ and $0.6$
  - we want to calculate the probability of the four points are of the colors that they actually are
    - this means the probability that the two red points are red and that the two blue points are blue
  - if we assume that the colors of the points are independent events then the probability for the whole arrangement is the product of the probabilities of the four points
    - for the left model, this is equal to $0.1 * 0.6 * 0.7 * 0.2 = 0.0084$; this is very small - it's less than $1\%$
      - what we mean by this is that if the model is given by these probability spaces, then the probability that the points are of these colors is $0.0084$
    - for the right model we get $0.3024$ which is around $30\%$ - this is much higher than $0.0084$
- thus, we confirm that the model on the right is better because it makes the arrangement of the points much more likely to have those colors


- so now, what we do is the following
  - we start from the bad modeling, calculate the probability that the points are those colors, multiply them and we obtain the total probability is $0.0084$
  - if we just had a way to maximize this probability we can increase it all the way to $0.3024$
  - thus, our new goal becomes precisely that, to maximize this probability
  - this method, as we stated before, is called **maximum likelihood**

# Maximizing Probabilities

- we've concluded that the probability is important
- the better model will give us a better probability
- now the question is, how we maximize the probability
- if remember correctly we're talking about an error function and how minimizing this error function will take us to the best possible solution
  - could these two things be connected?
  - could we obtain an error function from the probability?
  - could it be that maximizing the probability is equivalent to minimizing the error function? - maybe



- probability is a product of numbers and products are hard
- maybe this product of four numbers from previous example doesn't look so scary, but what if we have thousands of datapoints?
  - that would correspond to a product of thousands of numbers, all of them between zero and one
  - this product would be very tiny, something like $0.0000$ something and we definitely want to stay away from those numbers
- if I have a product of thousands of numbers and I change one of them, the product will change drastically
- in summary, we really want to stay away from products
- we need to find a function that will help us turn products into sums


- logarithmic function(`log`) turns products into sums
  - $log(ab) = log(a) + log(b)$

# Cross-Entropy

- from now until the end of class, we'll be taking the natural logarithm (`ln`) which is base $e$ instead of $10$
  - nothing different happens with base $10$, everything works the same as everything gets scaled by the same factor
  - it's just more for convention
- the logarithm of a number between $0$ and $1$ is always a negative number since the logarithm of $1$ is $0$
- it actually makes sense to think of the negative of the logarithm of the probabilities and we'll get positive numbers
  - we'll take the negative of the logarithm of the probabilities


- that sum of negatives of logarithms of the probabilities, we'll call the **cross entropy** which is a very important concept in the class
- if we calculate the cross entropies, we see that the bad model on left from previous example has a cross entropy $4.8$ which is high, whereas the good model on the right has a cross entropy of $1.2$ which is low
- a good model will give us a low cross entropy and a bad model will give us a high cross entropy
  - the reason for this is simply that a good model gives us a high probability and the negative of the logarithm of a large number is a small number and vice versa


- this method is actually much more powerful than we think
- if we calculate the probabilities and pair the points with the corresponding logarithms, we actually get an error for each point


- so again, here we have probabilities for both models and the products of them
- we take the negative of the logarithms which gives us sum of logarithms and if we pair each logarithm with the point where it came from, we actually get a value for each point
- if we calculate the values, we get this:

<img src="resources/cross_entropy_models_comparison.png" width="70%"/>
Correction: At 2:18, the top right point should be labelled -log(0.7) instead of -log(0.2).

- if we look carefully at the values we can see that the points that are miss-classified has like-values like $2.3$ or $1.6$, whereas the points that are correctly classified have small values
- the reason for this is again is that a correctly classified point will have a probability that as close to $1$, which when we take the negative of the logarithm, we'll get a small value
- thus we can think of the negatives of these logarithms as errors at each point


- points that are correctly classified will have small errors and points that are mis-classified will have large errors
- our cross entropy will tell us if a model is good or bad
- now our goal has changed from maximizing a probability to minimizing a cross entropy in order to get from the model in left to the model in the right
  - that error function that we're looking for, that was precisely the cross entropy

- cross entropy really says the following
  - if I have a bunch of events and a bunch of probabilities, how likely is it that those events happen based on the probabilities?
  - if it's very likely, then we have a small cross entropy
  - if it's unlikely, then we have a large cross entropy

- let's look a bit closer into Cross-Entropy by switching to a different example
- let's say we have three doors (no, this is not the Monty Hall problem)
  - we have the green door, the red door, and the blue door
  - behind each door we could have a gift or not have a gift
  - the probabilities of there being a gift behind each door is $0.8$ for the first one, $0.7$ for the second one, $0.1$ for the third one
    - so for example behind the green door there is an $80\%$ probability of there being a gift, and a $20\%$ probability of there not being a gift
  - let's say we want to make a bet on the outcomes; we want to try to figure out what is the most likely scenario here
  - for that we'll assume they're independent events
  - in this case, the most likely scenario is just obtained by picking the largest probability for each door
    - for the first door is more likely to have a gift than not have a gift - so we'll say there's a gift behind the first door
    - for the second door, it's also more likely that there's a gift - so we'll say there's a gift behind the second door
    - for the third door it's much more likely that there's no gift - so we'll say there's no gift behind the third door
  - as the events are independent, the probability for this whole arrangement is the product of the three probabilities which is $0.8 * 0.7 * 0.9 = 0.504$ which is roughly $50\%$


- let's look at all the possible scenarios in the table
- here's a table with all the possible scenarios for each door
  - there are eight scenarios since each door gives us two possibilities each, and there are three doors

<img src="resources/cross_entropy_doors_example_table.png" width="70%"/>

- we do as before to obtain the probability of each arrangement by multiplying the three independent probabilities to get these numbers
  - these numbers add to $1$
- notice that the events with high probability have low cross-entropy and the events with low probability have high cross-entropy
  - for example, the second row which has probability of $0.504$ gives a small cross-entropy of $0.69$, and the second to last row which is very very unlikely has a probability of $0.006$ gives a cross entropy a $5.12$


- let's actually calculate a formula for the cross-entropy
- here we have our three doors, and our sample scenario said that there is a gift behind the first and second doors, and no gift behind the third door
- recall that the probabilities of these events happening are $0.8$ for a gift behind the first door, $0.7$ for a gift behind the second door, and $0.9$ for no gift behind the third door
- when we calculate the cross-entropy, we get the negative of the logarithm of the product, which is a sum of the negatives of the logarithms of the factors, which is $-\ln(9.8) - \ln(0.7) - \ln(0.9)$

<img src="resources/cross_entropy_example_and_formula.png" width="70%"/>

- in order to drive the formula we'll have some variables
- let's call $p_1$ the probability that there's a gift behind the first door, $p_2$ the probability there's a gift behind the second door $p_3$ the probability there's a gift behind the third door
  - the probability of there not being a gift is $1$ minus the probability of there being a gift
- let's have another variable called $y_i$, which will be $1$ if there's a present behind the $i_{th}$ door, and $0$ if there's no present
  - $y_i$ is technically a number of presents behind the $i_{th}$ door
  - in this case $y_1 = 1$, $y_2 = 1$ and $y_3 = 0$
- we can put all this together and derive a formula for the cross-entropy and it's this sum: $CrossEntropy=-\sum_{i=1}^{m} y_i\ln(p_i) + (1-y_i)\ln(1-p_i)$


- let's look at the formula inside the summation
  - notice that if there is a present behind the $i_{th}$ door, then $y_i = 1$; so the first term is $\ln(p_i)$ snd the second term is $0$
  - likewise, if there is no present behind the $i_{th}$ door, then $y_i = 0$; so this first term is $0$ and this term is precisely $\ln(1-p_i)$


- this formula really encompasses the sums of the negative of logarithms which is precisely the cross-entropy
- the cross-entropy really tells us when two vectors are similar or different
- for example, if you calculate the cross entropy of the pairs one CE from the image above, we get $0.69$
  - that is low because vectors are similar which means that the arrangement of gifts given by the first set of numbers is likely to happen based on the probabilities given by the second set of numbers
- on the other hand if we calculate the cross-entropy of the pairs two CE we get $5.12$ which is very high
  - this is because the arrangement of gifts being given by the first set of numbers is very unlikely to happen from the probabilities given by the second set of numbers

```python
# Let's code the formula for cross-entropy in Python.
# As in the video, Y in the quiz is for the category, and P is the probability.

import numpy as np

# Write a function that takes as input two lists Y, P,
# and returns the float corresponding to their cross-entropy.
def cross_entropy(Y, P):
    Y = np.float_(Y)
    P = np.float_(P)
    return -np.sum(Y * np.log(P) + (1 - Y) * np.log(1 - P))
```

# Multi-Class Cross Entropy

- previous example was for when we had two classes; namely receiving a gift or not receiving a gift
- what happens if we have more classes?
- we have a similar problem - we still have three doors (this problem is still not the Monty Hall problem)
  - behind each door there can be an animal, and the animal can be of three types - it can be a duck, it can be a beaver, or it can be a walrus
  - we also have a table table of probabilities
    - according to the first column on the table, behind the first door, the probability of finding a duck is $0.7$, the probability of finding a beaver is $0.2$, and the probability of finding a walrus is $0.1$
    - the numbers in each column need to add to $1$ because there is some animal behind door $1$
    - the numbers in the rows do not need to add to $1$
      - it could easly be that we have a duck behind every door and that's okay

- let's look at a sample scenario
  - let's say we have our three doors, and behind the first door, there's a duck, behind the second door there's a walrus, and behind the third door there's also a walrus.
  - lecall that the probabilities are given by the table
    - a duck behind the first door is $0.7$ likely, a walrus behind the second door is $0.3$ likely, and a walrus behind the third door is $0.4$ likely
  - the probability of obtaining this three animals is the product of the probabilities of the three events since they are independent events, which in this case it's $0.084$
-  as we learned, that cross entropy (CE) here is given by the sums of the negatives of the logarithms of the probabilities
  - the first one is negative $-\ln(0.7)$, the second one is $-\ln(0.3)$, the third one is $-\ln(0.4)$
  - the Cross entropy is the sum of these three which is actually $2.48$


- but we want a formula, so let's put some variables here
- so $p_{11}$ is the probability of finding a duck behind door one, $p_{12}$ is the probability of finding a duck behind door two etc.
- let's have the indicator variables
  - $y_{1j} = 1$ if there's a duck behind door $j$
  - $y_{2j} = 1$ if there's a beaver behind door $j$
  - $y_{3j} = 1$ if there's a walrus behind door $j$
  - these variables are $0$ otherwise
- the formula for the cross entropy in more classes: $CrossEntropy=-\sum_{i=1}^{n} \sum_{j=1}^{m} y_{ij}\ln(p_{ij})$
  - in this case, $m$ is a number of classes
  - this formula works because $y_{ij} = 0$ or $y_{ij} = 1$ makes sure that we're only adding the logarithms of the probabilities of the events that actually have occurred

# Logistic Regression

## Calculating the Error Function

- this is a good time for a quick recap of the last couple of lessons
- we have two models; the bad model on the left and the good model on the right
  - for each one of those we calculate the cross entropy which is the sum of the negatives of the logarithms off the probabilities of the points being their colors
  -  we conclude that the one on the right is better because a cross entropy is much smaller
- let's actually calculate the formula for the error function
- let's split into two cases (on left model)
  - the first case being when $y=1$
    - when the point is blue to begin with, the model tells us that the probability of being blue is the prediction $P(blue) = \hat y$
    - as we can see the point in the blue area has more probability of being blue than the point in the red area
    - our error is simply the negative logarithm of this probability: $Error = -\ln(\hat y)$
  - if y$=0$, so when the point is red, then we need to calculate the probability of the point being red
    - the probability of the point being red: $P(red) = 1 - P(blue) = 1- \hat y$
    - the error is precisely the negative logarithm of this probability which is: $Error = -\ln(1- \hat y)$
  - we can summarize these two formulas into: $Error = -(1-y)(\ln(1-\hat y)) - y\ln(\hat y)$
    - this formula works because if the point is blue, then $y=1$ which means $1-y=0$ which makes the first term $0$ and the second term is simply logarithm of $\hat y$
    - similarly, if the point is red then $y=0$, so the second term of the formula is $0$ and the first one is logarithm of $1 - \hat y$


- the formula for the error function is simply the sum over all the error functions of points which is precisely the summation here: $ErrorFunction = -\dfrac{1}{m}\sum_{i=1}^{m} (1-y_i)(\ln(1-\hat y_i)) + y_i\ln(\hat y_i)$
  - by convention we'll actually consider the average, not the sum which is where we are dividing by $m$
  - from now on we'll use this formula as our error function



- since $\hat y$ is given by the sigmoid of the linear function $Wx + b$, then the total formula for the error is actually in terms of $w$ and $b$ which are the weights of the model
- it's simply the summation we see here: $E(W,b) = -\dfrac{1}{m}\sum_{i=1}^{m} (1-y_i)(\ln(1-\sigma(Wx^{(i)} + b)) + y_i\ln(\sigma(Wx^{(i)} + b))$
  - in this case $y_i$ is just the label of the point $x^{(i)}$


- just a small aside, what we did is for binary classification problems
- if we have a multiclass classification problem then the error is now given by the multiclass entropy
- this formula is given here: $ErrorFunction = -\dfrac{1}{m}\sum_{i=1}^{m} \sum_{j=1}^{n} y_{ij}\ln(\hat y_{ij})$
  - for every data point, we take the product of the label times the logarithm of the prediction and then we average all these values

## Minimizing the error function

- we start some random weights, which will give us the predictions $\sigma(Wx+b)$
- that also gives us a error function given by this formula $E(W,b) = -\dfrac{1}{m}\sum_{i=1}^{m} (1-y_i)(\ln(1-\sigma(Wx^{(i)} + b)) + y_i\ln(\sigma(Wx^{(i)} + b))$
  - remember that the summands are also error functions for each point so each point will give us a larger function if it's mis-classified and a smaller one if it's correctly classified
- the way we're going to minimize this function, is to use gradient decent


- here's Mt. Errorest and this is us, and we're going to try to jiggle the line around to see how we can decrease the error function
- the error function is the height which is $E(W,b)$, where $W$ and $b$ are the weights
- what we'll do, is we'll use gradient decent in order to get to the bottom of the mountain at a much smaller height, which gives us a smaller error function $E(W',b')$
  - this will give rise to new weights, $W'$ and $b'$ which will give us a much better prediction: $\sigma(W'x+b')$

# Gradient Descent

- error function is a function of the weights
- it's got a mathematical structure so it's not Mt. Everest anymore, it's more of a mount Math-Er-Horn
- we're standing somewhere in Mount Math-Er-Horn and we need to go down
- the inputs of the functions are $w_1$ and $w_2$ and the error function is given by $E$
- then the gradient of $E$ is given by the vector sum of the partial derivatives of $E$ with respect to $w_1$ and $w_2$

<img src="resources/gradient_descent_vector_sum.png" width="70%"/>

- this gradient actually tells us the direction we want to move if we want to increase the error function the most
- thus, if we take the negative of the gradient, this will tell us how to decrease the error function the most
  - this is precisely what we'll do
- at the point we're standing, we'll take the negative of the gradient of the error function at that point
- then we take a step in that direction
- once we take a step, we'll be in a lower position, again, and again until we are able to get to the bottom of the mountain


- this is how we calculate the gradient
- we start with our initial prediction $\hat y = \sigma(Wx + b)$ - let's say this prediction is bad because the error is large since we're high up in the mountain
- the prediction looks like this: $\hat y = \sigma(w_1x_1 + ... + w_nx_n + b)$
- the error function is given by the formula we saw before, but what matters here is the gradient of the error function
  - the gradient of the error function is precisely the vector formed by the partial derivative of the error function with respect to the weights and the bias: $\triangledown E = (\frac{\partial E}{\partial w_1}, ..., \frac{\partial E}{\partial w_n}, \frac{\partial E}{\partial b})$
- now, we take a step in the direction of the negative of the gradient
  - as before, we don't want to make any dramatic changes, so we'll introduce a smaller learning rate $\alpha$; for example, $\alpha = 0.1$
    - we'll multiply the gradient by that number
- now taking the step is exactly the same thing as updating the weights and the bias as follows
  - the weight $w_i$ will now become $w_i^{\prime}$
    - $w_i^{\prime} \leftarrow w_i - \alpha * \frac{\partial E}{\partial w_i}$
  - the bias will now become $b^{\prime}$
    - $b^{\prime} \leftarrow b - \alpha * \frac{\partial E}{\partial b}$
- this will take us to a prediction with a lower error function
- we can conclude that the prediction we have now: $\hat y = \sigma(W^{\prime}x + b^{\prime})$ is better than the one we had before with weights $W$ and $b$
- this is precisely the gradient descent step

<img src="resources/gradient_descent_prediction_mt_matherhorn.png" width="70%"/>

## Gradient Calculation

- the sigmoid function has a really nice derivative, namely $\sigma^{\prime}(x)=\sigma(x)(1−\sigma(x))$
- the reason for this is the following, we can calculate it using the quotient formula:

<img src="resources/gradient_descent_sigmoid_quotient_formula.gif"/>

- and now, let's recall that if we have $m$ points labelled $x^{(1)}, x^{(2)},\ldots, x^{(m)}$ the error formula is: $E = -\dfrac{1}{m}\sum_{i=1}^{m} (1-y_i)(\ln(1-\hat y_i)) + y_i\ln(\hat y_i)$ where the prediction is given by $\hat y_i = \sigma(Wx^{(i)} + b)$
- our goal is to calculate the gradient of $E$, at a point $x = (x_1, \ldots, x_n)$, given by the partial derivatives  $\triangledown E = (\frac{\partial E}{\partial w_1}, ..., \frac{\partial E}{\partial w_n}, \frac{\partial E}{\partial b})$
- to simplify our calculations, we'll actually think of the error that each point produces, and calculate the derivative of this error
- the total error is the average of the errors at all the points
- the error produced by each point is $E = −y\ln(\hat y) − (1 − y)\ln(1 − \hat y)$


- in order to calculate the derivative of this error with respect to the weights, we'll first calculate $\frac{\partial \hat y}{\partial w_j}$
- recall that $\hat y = \sigma(Wx+b)$, so:

<img src="resources/gradient_descent_partial_yhat.gif"/>

- the last equality is because the only term in the sum which is not a constant with respect to $w_j$ is precisely $w_jx_j$, which clearly has derivative $x_j$


- now, we can go ahead and calculate the derivative of the error $E$ at a point $x$ with respect to the weight $w_j$

<img src="resources/gradient_descent_partial_error.png"/>

- a similar calculation will show us that $\frac{\partial}{\partial b} E = -(y - \hat y)$
- this actually tells us something very important
  - for a point with coordinates $(x_1, \ldots, x_n)$, label $y$ and prediction $\hat y$, the gradient of the error function at that point is $(−(y−\hat y)x_1,\dots, −(y−\hat y)x_n,−(y−\hat y))$
  - in summary, the gradient is $\triangledown E = −(y−\hat y)(x_1,\dots,x_n,1)$


- the gradient is actually a scalar times the coordinates of the point!
- and what is the scalar? nothing less than a multiple of the difference between the label and the prediction


- the scalar we obtained above means:
  - closer the label to the prediction, smaller the gradient
  - farther the label from the prediction, larger the gradient
- so, a small gradient means we'll change our coordinates by a little bit, and a large gradient means we'll change our coordinates by a lot

## Gradient Descent Step

- since the gradient descent step simply consists in subtracting a multiple of the gradient of the error function at every point, then this updates the weights in the following way: $w_i^{\prime} \leftarrow w_i - \alpha [-(y - \hat y) x_i]$, which is equivalent to $w_i^{\prime} \leftarrow w_i + \alpha (y - \hat y) x_i$

- similarly, it updates the bias in the following way: $b^{\prime} \leftarrow b + \alpha (y - \hat y)$

**Note:**
- since we've taken the average of the errors, the term we are adding should be $\frac{1}{m} \cdot \alpha$ instead of $\alpha$, but as $\alpha$ is a constant, then in order to simplify calculations, we'll just take $\frac{1}{m} \cdot \alpha$ to be our learning rate, and abuse the notation by just calling it $\alpha$

## Gradient Descent: The Code

- from before we saw that one weight update can be calculated as: $\Delta w_i = \alpha * \delta * x_i$ where $\alpha$ is the learning rate and $\delta$ is the error term
- previously, we utilized the loss (error) function for logistic regression, which was because we were performing a binary classification task
- this time we'll try to get the function to learn a value instead of a class
- therefore, we'll use a simpler loss function, as defined: $\delta = (y - \hat y) f'(h) = (y - \hat y) f'(\sum w_i x_i)$
  - note that $f'(h)$ is the derivative of the activation function $f(h)$, and $h$ is defined as the output, which in the case of a neural network is a sum of the weights times the inputs


- now I'll write this out in code for the case of only one output unit
- we'll also be using the sigmoid as the activation function $f(h)$

```python
# Defining the sigmoid function for activations
def sigmoid(x):
    return 1 / (1 + np.exp(-x))


# Derivative of the sigmoid function
def sigmoid_prime(x):
    return sigmoid(x) * (1 - sigmoid(x))


# Input data
x = np.array([0.1, 0.3])
# Target
y = 0.2
# Input to output weights
weights = np.array([-0.8, 0.5])

# The learning rate, eta in the weight step equation
learnrate = 0.5

# The neural network output (y-hat)
nn_output = sigmoid(x[0] * weights[0] + x[1] * weights[1])
# or nn_output = sigmoid(np.dot(x, weights))

# output error (y - y-hat)
error = y - nn_output

# error term (lowercase delta)
error_term = error * sigmoid_prime(np.dot(x, weights))

# Gradient descent step
del_w = [learnrate * error_term * x[0], learnrate * error_term * x[1]]
# or del_w = learnrate * error_term * x
```

In [4]:
import numpy as np


def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-x))


learnrate = 0.5
x = np.array([1, 2])
y = np.array(0.5)

# Initial weights
w = np.array([0.5, -0.5])

# Calculate one gradient descent step for each weight
# TODO: Calculate output of neural network
nn_output = sigmoid(np.dot(x, w))

# TODO: Calculate error of neural network
error = y - nn_output

# TODO: Calculate change in weights
del_w = learnrate * error * nn_output * (1 - nn_output) * x

print("Neural Network output:")
print(nn_output)
print("Amount of Error:")
print(error)
print("Change in Weights:")
print(del_w)

Neural Network output:
0.3775406687981454
Amount of Error:
0.1224593312018546
Change in Weights:
[0.0143892 0.0287784]


<IPython.core.display.Javascript object>

# Perceptron vs Gradient Descent

- gradient descent algorithm
  - we take the weights and change them from $w_i$ to $w_i + \alpha(y - \hat y)x_i$


- perceptron algorithm
  - not every point changes weights, only the misclassified ones
  - if $x$ is misclassified
    - change $w_i$ to:
      - $w_i + \alpha x_i$ if positive
      - $w_i - \alpha x_i$ if negative
  - in perceptron algorithm, the labels are $1$ and $0$ and the predictions $\hat y$ are also $1$ and $0$
  - if the point is correctly classified: $y - \hat y = 0$ because $y = \hat y$
  - if the point in misclassified
    - $y - \hat y = 1$ if positive
    - $y + \hat y = -1$ if negative
  
    - if the point is labeled blue, then $y = 1$, and if it's misclassified, then the prediction must be $\hat y = 0 $ so $\hat y - y = -1$
    - similarly, with the points labeled red, then $y = 0$ and $\hat y = 1$, so $\hat y - y = 1$


- both algorithms are exactly the same thing
- the only difference is that in gradient descent algorithm $\hat y$ can take any number between $0$ and $1$, whereas in the perceptron algorithm, $\hat y$ can take only the values $0$ or $1$


- both in the perceptron algorithm and the gradient descent algorithm, a point that is misclassified tells a line to come closer because eventually, it wants the line to surpass it so it can be in the correct side
- what happens if the point is correctly classified?
  - the perceptron algorithm says do absolutely nothing
  - in the gradient descent algorithm, you are changing the weights
    - if we look carefully, what the point is telling the line, is to go farther away
      - this makes sense because if you're correctly classified, say, if you're a blue point in the blue region, you'd like to be even more into the blue region, so your prediction is even closer to one, and your error is even smaller
      - similarly, for a red point in the red region
    - misclassified points asks the line to come closer and the correctly classified points asks the line to go farther away
      - the line listens to all the points and takes steps in such a way that it eventually arrives to a pretty good solution

# Non-Linear Models

- we've been dealing a lot with data sets that can be separated by a line but as you can imagine the real world is much more complex than that
- this is where neural networks can show their full potential
- in the next few videos we'll see how to deal with more complicated data sets that require highly non-linear boundaries

- let's go back to this example of where we saw some data that is not linearly separable
- a line can not divide these red and blue points and we looked at some solutions, and if you remember, the one we considered more seriously was this curve over here
- what I'll teach you now is to find this curve and it's very similar than before
  - we'll still use grading dissent
  - in a nutshell, what we're going to do is for these data which is not separable with a line, we're going to create a probability function where the points in the blue region are more likely to be blue and the points in the red region are more likely to be red
  - and this curve here that separates them is a set of points which are equally likely to be blue or red
  - everything will be the same as before except this equation won't be linear and that's where neural networks come into play

# Neural Network Architecture

- we create nonlinear models by combining linear models into a nonlinear model
- visually it looks like the models over imposed creating the resulting mode
  - it's almost like we're doing arithmetic on models
  - it's like saying "This line plus this line equals that curve."


- a linear model as we know is a whole probability space
- this means that for every point it gives us the probability of the point being blue (in this example)

<img src="resources/model_combination_with_sigmoid.png" width="70%"/>


- for example, this point in upper model is in the blue region so its probability of being blue is $0.7$
- the same point given by the second probability space is also in the blue region so it's probability of being blue is $0.8$
- the simplest way to combine two numbers is to add them, right?
  - so $0.8 + 0.7 = 1.5$
  - but now, this doesn't look like a probability anymore since it's bigger than $1$ and probabilities need to be between $0$ and $1$
  - we can use sigmoid function which turns every number into something between $0$ and $1$
  - we applied the sigmoid function to $1.5$ to get the value $0.82$ and that's the probability of this point being blue in the resulting probability space
- so now we've managed to create a probability function for every single point in the plane and that's how we combined two models
- we calculate the probability for one of them, the probability for the other, then add them and then we apply the sigmoid function


- now, what if we wanted to weight this sum?
- what, if say, we wanted the model in the top to have more of a saying the resulting probability than the model in the bottom?
- well, we can add weights
- for example, I can say "Seven times the first one plus five times the second one."

<img src="resources/model_combination_with_weights.png" width="70%"/>

- what I do to combine the models is I take the first probability, multiply it by $7$, then take the second one and multiply it by $5$ and I can even add a bias if I want, say, the bias is $-6$, then we add it to the whole equation and after all we apply sigmoid function


- before we had a line that is a linear combination of the input values times the weight plus a bias
- now we have that this model is a linear combination of the two previous model times the weights plus some bias
- so it's almost the same thing
- it's almost like this curved model in the right
- it's a linear combination of the two linear models before or we can even think of it as the line between the two models
  - this is no coincidence
  - this is at the heart of how neural networks get built
- we can imagine that we can keep doing this always obtaining more new complex models out of linear combinations of the existing ones
  - this is what we're going to do to build our neural networks

- so we can add two linear models to obtain a third model
- as a matter of fact, we can take a linear combination of two models
- so, the first model times a constant plus the second model times a constant plus a bias and that gives us a non-linear model
  - that looks a lot like perceptrons where we can take a value times a constant plus another value times a constant plus a bias and get a new value
  - that's no coincidence
  - that's actually the building block of Neural Networks


- let's look at an example
  - let's say, we have this linear model where the linear equation is $5x_1 - 2x_2 + 8$ and we have another linear model with equations $7x_1 - 3x_2 - 1$
  - let's use another perceptron to combine these two models using the Linear Equation, seven times the first model plus five times the second model minus six
  
<img src="resources/linear_models_combination_perceptrons.png" width="70%"/>

  - now the magic happens when we join these together and we get a Neural Network
  - we clean it up a bit and we obtain this
    - all the weights are there
      - the weights on the left tell us what equations the linear models have
      - the weights on the right tell us what the linear combination is of the two models to obtain the curve non-linear model in the right
      
<img src="resources/linear_models_combination_perceptrons_cleaned.png" width="70%"/>

  - so, whenever you see a Neural Network like the one on the left, think of what could be the nonlinear boundary defined by the Neural Network


- note that perceptron in the previous example was drawn using the notation that puts a bias inside the node
- this can also be drawn using the notation that keeps the bias as a separate node

<img src="resources/linear_models_combination_perceptrons_notation.png" width="70%"/>

- here, what we do is, in every layer we have a bias unit coming from a node with a one on it
  - so for example, $-8$ on the top node becomes an edge labelled $-8$ coming from the bias node
- we can see that this Neural Network uses a Sigmoid Activation Function and the Perceptrons

## Multiple layers

- not all neural networks look like the one above
- they can be way more complicated!
- in particular, we can do the following things:
  - add more nodes to the input, hidden, and output layers
  - add more layers


- neural networks have a certain special architecture with layers
  - the first layer is called the input layer, which contains the inputs, in this case, $x_1$ and $x_2$
  - the next layer is called the hidden layer, which is a set of linear models created with this first input layer
  - the final layer is called the output layer, where the linear models get combined to obtain a nonlinear model

<img src="resources/neural_networks_layers.png" width="70%"/>


- we can have different architectures
  - for example, we can have a larger hidden layer where we're combining three linear models to obtain the triangular boundary in the output layer


- what happens if the input layer has more nodes?
  - for example, this neural network has three nodes in its input layer
  - that just means we're not living in two-dimensional space anymore
  - we're living in three-dimensional space, and now our hidden layer, the one with the linear models, just gives us a bunch of planes in three space, and the output layer bounds a nonlinear region in three space
  
<img src="resources/neural_networks_layers_three_input_nodes.png" width="70%"/>

- in general, if we have n nodes in our input layer, then we're thinking of data living in n-dimensional space


- what if our output layer has more nodes?
  - then we just have more outputs
  - in that case, we just have a multiclass classification model
  - if our model is telling us if an image is a cat or dog or a bird, then we simply have each node in the output layer output a score for each one of the classes: one for the cat, one for the dog, and one for the bird

<img src="resources/neural_networks_layers_three_output_nodes.png" width="70%"/>


- what if we have more layers?
  - then we have what's called a deep neural network
  - our linear models combine to create nonlinear models and then these combine to create even more nonlinear models
  
<img src="resources/neural_networks_multiple_layers.png" width="70%"/>


- in general, we can do this many times and obtain highly complex models with lots of hidden layers
  - this is where the magic of neural networks happens
  - many of the models in real life, for self-driving cars or for game-playing agents, have many, many hidden layers
    - that neural network will just split the n-dimensional space with a highly nonlinear boundary

<img src="resources/neural_networks_multiple_layers_nonlinear_boundary.png" width="70%"/>

## Multi-Class Classification

- it seems that neural networks work really well when the problem consist on classifying two classes
  - for example, if the model predicts a probability of receiving a gift or not then the answer just comes as the output of the neural network


- what happens if we have more classes?
  - say, we want the model to tell us if an image is a duck, a beaver, or a walrus
  - one thing we can do is create a neural network to predict if the image is a duck, then another neural network to predict if the image is a beaver, and a third neural network to predict if the image is a walrus
    - then we can just use SoftMax or pick the answer that gives us the highest probability
    - but this seems like overkill
  - the first layers of the neural network should be enough to tell us things about the image and maybe just the last layer should tell us which animal it is
    - as a matter of fact, as you'll see in the CNN section, this is exactly the case
    - so what we need here is to add more nodes in the output layer and each one of the nodes will give us the probability that the image is each of the animals
    - we take the scores and apply the SoftMax function that was previously defined to obtain well-defined probabilities
    - this is how we get neural networks to do multi-class classification

<img src="resources/neural_networks_multiclass_classification.png" width="70%"/> 

# Feedforward

- now that we have defined what neural networks are, we need to learn how to train them
  - training them really means what parameters should they have on the edges in order to model our data well
- in order to learn how to train them, we need to look carefully at how they process the input to obtain an output


- let's look at our simplest neural network, a perceptron

<img src="resources/feedforward_simple_nn_perceptron.png" width="70%"/>

  - this perceptron receives a data point of the form $(x_1, x_2)$ where the label is $y = 1$
    - this means that the point is blue
  - the perceptron is defined by a linear equation, say, $w_1x_1 + w_2x_2 + b$,  where $w_1$ and $w_2$ are the weights in the edges and $b$ is the bias in the note
    - here, $w_1$ is bigger than $w_2$, so we'll denote that by drawing the edge labelled $w_1$ much thicker than the edge labelled $w_2$
  - what the perceptron does is it plots the point $(x_1, x_2)$ and it outputs the probability that the point is blue
  - here since the point is in the red area and then the output is a small number, since the point is not very likely to be blue
  - this process is known as feedforward
- we can see that this is a bad model because the point is actually blue, given that the third coordinate, $y = 1$


- if we have a more complicated neural network, then the process is the same

<img src="resources/feedforward_more_complex_nn.png" width="70%"/>

- here, we have thick edges corresponding to large weights and thin edges corresponding to small weights
- the neural network plots the point in the top graph and also in the bottom graph and the outputs coming out will be a small number from the top model
- the point lies in the red area which means it has a small probability of being blue and a large number from the second model, since the point lies in the blue area which means it has a large probability of being blue
- as the two models get combined into this nonlinear model and the output layer just plots the point and it tells the probability that the point is blue
- as you can see, this is a bad model because it puts the point in the red area and the point is blue
- again, this process called feedforward and we'll look at it more carefully


- here, we have our neural network and the other notations so the bias is in the outside

<img src="resources/feedforward_nn_matrices.png" width="70%"/>

- now we have a matrix of weights
- the matrix $W_1$ denoting the first layer and the entries are the weights $W_{11}$ up ro $W_{32}$
  - notice that the biases have now been written as $W_{31}$ and $W_{32}$; this is just for convenience
- in the next layer, we also have a matrix; this one is $W_2$ for the second layer  
  - this layer contains the weights that tell us how to combine the linear models in the first layer to obtain the nonlinear model in the second layer
- now what happens is some math
  - we have the input in the form $(x1, x2, 1)$ where the one comes from the bias unit
  - we multiply it by the matrix $W_1$ to get these outputs
  - then, we apply the sigmoid function to turn the outputs into values between $0$ and $1$
  - then the vector format these values gets a $1$ attatched for the bias unit and multiplied by the second matrix
  - this returns an output that now gets thrown into a sigmoid function to obtain the final output which is $\hat y$
    - $\hat y$ is the prediction or the probability that the point is labeled blue

- so this is what neural networks do
  - they take the input vector and then apply a sequence of linear models and sigmoid functions
  - these maps when combined become a highly non-linear map
  - the final formula is simply y-hat equals sigmoid of W2 combined with sigmoid of W1 applied to x


- just for redundance, we do this again on a multi-layer perceptron or neural network

<img src="resources/feedforward_multilayer_nn_matrices.png" width="70%"/>

- to calculate our prediction $\hat y$, we start with the unit vector x, then we apply the first matrix and a sigmoid function to get the values in the second layer
- then, we apply the second matrix and another sigmoid function to get the values on the third layer and so on and so forth until we get our final prediction, $\hat y$
- this is the feedforward process that the neural networks use to obtain the prediction from the input vector

## Error Function

- just as before, neural networks will produce an error function, which at the end, is what we'll be minimizing
- our goal is to train our neural network
- in order to do this, we have to define the error function
- let's look again at what the error function was for perceptrons


- here's our perceptron

<img src="resources/feedforward_error_function_perceptron_1.png" width="70%"/>

- in the left, we have our input vector with entries $x_1$ up to $x_n$, and $1$ for the bias unit
- and the edges with weights $W_1$ up to $W_n$, and $b$ for the bias unit
- finally, we can see that this perceptor uses a sigmoid function
- the prediction is defined as $\hat y = \sigma(Wx+b)$
- the error function is $E(W) = -\dfrac{1}{m}\sum_{i=1}^{m} y_i\ln(\hat y_i) + (1-y_i)\ln(1-\hat y_i)$
  - this function gives us a measure of the error of how badly each point is being classified
    - roughly, this is a very small number if the point is correctly classified, and a measure of how far the point is from the line and the point is incorrectly classified

<img src="resources/feedforward_error_function_perceptron_2.png" width="70%"/>


- what are we going to do to define the error function in a multilayer perceptron?

<img src="resources/feedforward_error_function_multilayer_perceptron_1.png" width="70%"/>

- as we saw, our prediction is simply a combination of matrix multiplications and sigmoid functions
- but the error function can be the exact same thing, right?
  - it can be the exact same formula, except now, $\hat y$ is just a bit more complicated
  - and still, this function will tell us how badly a point gets misclassified
  - except now, it's looking at a more complicated boundary
  
<img src="resources/feedforward_error_function_multilayer_perceptron_2.png" width="70%"/>

# Multilayer Perceptrons

- we saw before with the XOR perceptron that adding a second layer of units allows the model to find solutions to linearly inseparable problems


- imagine example of a multilayer perceptron, with three input units, one output unit, and two units in the middle
  - this middle layer is called the hidden layer
  - calculating the output of this network is the same as before, except that now, the activations of the hidden layer are used as the input to the output layer
  - the input to the hidden layer is the same as before
    - it's these weights times the input values plus some bias term
  - and as before, again, you use an activation function such as a sigmoid to calculate the output of the hidden layer
  - the hidden layer activations are passed to the output layer through the second set of weights and again use an activation function to get the output of the network


- stacking more and more layers like this, helps the network learn more complex patterns
- this is where deep learning gets its name from, and what makes it so powerful: deep stacks of hidden layers

## Implementing the hidden layer

- Khan Academy's introduction to vectors: https://www.khanacademy.org/math/linear-algebra/vectors-and-spaces/vectors/v/vector-introduction-linear-algebra
- Khan Academy's introduction to matrices: https://www.khanacademy.org/math/precalculus-2018/precalc-matrices

### Derivation

- before, we were dealing with only one output node which made the code straightforward
- however now that we have multiple input units and multiple hidden units, the weights between them will require two indices: $w_{ij}$ where $i$ denotes input units and $j$ are the hidden units


- for example, the following image shows our network, with its input units labeled $x_1$, $x_2$, and $x_3$ and its hidden nodes labeled $h_1$ and $h_2$

<img src="resources/network-with-labeled-nodes.png" width="30%"/>

- the lines indicating the weights leading to $h_1$ have been colored differently from those leading to $h_2$ just to make it easier to read
- to index the weights, we take the input unit number for the $_i$ and the hidden unit number for the $_j$ 
  - that gives us $w_{11}$ for the weight leading from $x_1$ to $h_1$, and $w_{12}$ fr the weight leading from $x_1$ to $h_2$
- the following image includes all of the weights between the input layer and the hidden layer, labeled with their appropriate $w_{ij}$ indices:

<img src="resources/network-with-labeled-weights.png" width="30%"/>

- before, we were able to write the weights as an array, indexed as $w_i$
- but now, the weights need to be stored in a matrix, indexed as $w_{ij}$
  - each row in the matrix will correspond to the weights leading out of a single input unit
  - each column will correspond to the weights leading in to a single hidden unit
- for our three input units and two hidden units, the weights matrix looks like this:

<img src="resources/multilayer-diagram-weights.png" width="40%"/>

- be sure to compare the matrix above with the diagram shown before it so you can see where the different weights in the network end up in the matrix

- to initialize these weights in Numpy, we have to provide the shape of the matrix
- if features is a 2D array containing the input data:

```python
# Number of records and input units
n_records, n_inputs = features.shape
# Number of hidden units
n_hidden = 2
weights_input_to_hidden = np.random.normal(0, n_inputs**-0.5, size=(n_inputs, n_hidden))
```

- this creates a 2D array (i.e. a matrix) named `weights_input_to_hidden` with dimensions `n_inputs` by `n_hidden`

- remember how the input to a hidden unit is the sum of all the inputs multiplied by the hidden unit's weights
- so for each hidden layer unit, $h_j$, we need to calculate the following: $h_j = \sum_{i} w_{ij}x_i$
  - to do that, we now need to use matrix multiplication
  

- in this case, we're multiplying the inputs (a row vector here) by the weights
- to do this, you take the dot (inner) product of the inputs with each column in the weights matrix
- for example, to calculate the input to the first hidden unit, $j = 1$, you'd take the dot product of the inputs with the first column of the weights matrix, like so:

<img src="resources/input-times-weights.png" width="40%"/>

- calculating the input to the first hidden unit with the first column of the weighted matrix: $h_1 = x_1w_{11} + x_2w_{21} + x_3w_{31}$
- and for the second hidden layer input, you calculate the dot product of the inputs with the second column, and so on and so forth

- in NumPy, you can do this for all the inputs and all the outputs at once using `np.dot`

```python
hidden_inputs = np.dot(inputs, weights_input_to_hidden)
```

- you could also define your weights matrix such that it has dimensions `n_hidden` by `n_inputs` then multiply like so where the inputs form a column vector:

<img src="resources/inputs-matrix.png" width="40%"/>

**Note**
- the weight indices have changed in the above image and no longer match up with the labels used in the earlier diagrams
- that's because, in matrix notation, the row index always precedes the column index, so it would be misleading to label them the way we did in the neural net diagram
- just keep in mind that this is the same weight matrix as before, but rotated so the first column is now the first row, and the second column is now the second row
- if we were to use the labels from the earlier diagram, the weights would fit into the matrix in the following locations:

<img src="resources/weight-label-reference.gif" width="20%"/>

- remember, the above is not a correct view of the indices, but it uses the labels from the earlier neural net diagrams to show you where each weight ends up in the matrix

- the important thing with matrix multiplication is that the dimensions match
- for matrix multiplication to work, there has to be the same number of elements in the dot products
- in the first example, there are three columns in the input vector, and three rows in the weights matrix
- in the second example, there are three columns in the weights matrix and three rows in the input vector
- if the dimensions don't match, you'll get this:

```python
# Same weights and features as above, but swapped the order
hidden_inputs = np.dot(weights_input_to_hidden, features)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-1bfa0f615c45> in <module>()
----> 1 hidden_in = np.dot(weights_input_to_hidden, X)

ValueError: shapes (3,2) and (3,) not aligned: 2 (dim 1) != 3 (dim 0)
```

- the dot product can't be computed for a 3x2 matrix and 3-element array
- that's because the 2 columns in the matrix don't match the number of elements in the array
- some of the dimensions that could work would be the following:

<img src="resources/matrix-mult-3.png" width="30%"/>

- the rule is that if you're multiplying an array from the left, the array must have the same number of elements as there are rows in the matrix
- if you're multiplying the matrix from the left, the number of columns in the matrix must equal the number of elements in the array on the right

### Making a column vector

- you see above that sometimes you'll want a column vector, even though by default Numpy arrays work like row vectors
- it's possible to get the transpose of an array like so `arr.T`, but for a 1D array, the transpose will return a row vector
- instead, use `arr[:,None]` to create a column vector:

```python
print(features)
> array([ 0.49671415, -0.1382643 ,  0.64768854])

print(features.T)
> array([ 0.49671415, -0.1382643 ,  0.64768854])

print(features[:, None])
> array([[ 0.49671415],
       [-0.1382643 ],
       [ 0.64768854]])
```

- alternatively, you can create arrays with two dimensions
- then, you can use `arr.T` to get the column vector

```python
np.array(features, ndmin=2)
> array([[ 0.49671415, -0.1382643 ,  0.64768854]])

np.array(features, ndmin=2).T
> array([[ 0.49671415],
       [-0.1382643 ],
       [ 0.64768854]])
```

### Programming quiz

- below, implement a forward pass through a 4x3x2 network, with sigmoid activation functions for both layers
- things to do:
  - calculate the input to the hidden layer
  - calculate the hidden layer output
  - calculate the input to the output layer
  - calculate the output of the network

In [5]:
import numpy as np


def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-x))


# Network size
N_input = 4
N_hidden = 3
N_output = 2

np.random.seed(42)
# Make some fake data
X = np.random.randn(4)

weights_input_to_hidden = np.random.normal(0, scale=0.1, size=(N_input, N_hidden))
weights_hidden_to_output = np.random.normal(0, scale=0.1, size=(N_hidden, N_output))


# TODO: Make a forward pass through the network

hidden_layer_in = np.dot(X, weights_input_to_hidden)
hidden_layer_out = sigmoid(hidden_layer_in)

print("Hidden-layer Output:")
print(hidden_layer_out)

output_layer_in = np.dot(hidden_layer_out, weights_hidden_to_output)
output_layer_out = sigmoid(output_layer_in)

print("Output-layer Output:")
print(output_layer_out)

Hidden-layer Output:
[0.41492192 0.42604313 0.5002434 ]
Output-layer Output:
[0.49815196 0.48539772]


<IPython.core.display.Javascript object>

# Backpropagation

- we're ready to get our hands into training a neural network
- for this, we'll use the method known as **backpropagation**
- in a nutshell, backpropagation will consist of:
  - doing a feedforward operation
  - comparing the output of the model with the desired output
  - calculating the error
  - running the feedforward operation backwards (backpropagation) to spread the error to each of the weights
  - use this to update the weights, and get a better model
  - continue this until we have a model that is good

- now we're finally ready to get our hands into training a neural network
- let's quickly recall feedforward
  - we have our perceptron with a point coming in labeled positive
  - and our equation $w_1x_1 + w_2x_2 + b$, where $w_1$ and $w_2$ are the weights and $b$ is the bias
  - what the perceptron does is, it plots a point and returns a probability that the point is blue
    - which in this case is small since the point is in the red area
    - thus, this is a bad perceptron since it predicts that the point is red when the point is really blue


- let's recall what we did in the gradient descent algorithm
  - we did this thing called Backpropagation
    - we went in the opposite direction
    - we asked the point, "What do you want the model to do for you?"
    - and the point says, "Well, I am misclassified so I want this boundary to come closer to me."
    - and we saw that the line got closer to it by updating the weights
  - namely, in this case, let's say that it tells the weight $w_1$ to go lower and the weight $w_2$ to go higher
    - this is just an illustration, it's not meant to be exact
    - we obtain new weights, $w_1'$ and $w_2'$ which define a new line which is now closer to the point
    - what we're doing is like descending from Mt. Errorest, right?
    - the height is going to be the error function $E(W)$ and we calculate the gradient of the error function $- \triangledown E$ which is exactly like asking the point what does is it want the model to do
    - as we take the step down the direction of the negative of the gradient, we decrease the error to come down the mountain
    - this gives us a new error, $E(W')$ and a new model $W'$ with a smaller error, which means we get a new line closer to the point
    - we continue doing this process in order to minimize the error


- that was for a single perceptron; what do we do for multi-layer perceptrons?
  - we still do the same process of reducing the error by descending from the mountain, except now, since the error function is more complicated then it's not Mt. Errorest, now it's Mt. Kilimanjerror
  - but same thing, we calculate the error function and its gradient
  - we then walk in the direction of the negative of the gradient $- \triangledown E$ in order to find a new model $W'$ with a smaller error $E(W')$ which will give us a better prediction
  - we continue doing this process in order to minimize the error


- let's look again at what feedforward does in a multi-layer perceptron
  - the point comes in with coordinates $(x_1, x_2)$ and label $y = 1$
  - it gets plotted in the linear models corresponding to the hidden layer
  - then, as this layer gets combined the point gets plotted in the resulting non-linear model in the output layer
  - the probability that the point is blue is obtained by the position of this point in the final model


- now, pay close attention because this is the key for training neural networks, it's **backpropagation**
  - we'll do as before, we'll check the error
  - this model is not good because it predicts that the point will be red when in reality the point is blue
  - so we'll ask the point, "What do you want this model to do in order for you to be better classified?"
    - the point says, "I kind of want this blue region to come closer to me."
  - what does it mean for the region to come closer to it?
    - let's look at the two linear models in the hidden layer
    - it seems like the top one is badly misclassifying the point whereas the bottom one is classifying it correctly
    - we kind of want to listen to the bottom one more and to the top one less
    - what we want to do is to reduce the weight coming from the top model and increase the weight coming from the bottom model
    - now our final model will look a lot more like the bottom model than like the top model
  - we can do even more; we can actually go to the linear models and ask the point, "What can these models do to classify you better?"
    - the point will say, "Well, the top model is misclassifying me, so I kind of want this line to move closer to me
    - the second model is correctly classifying me, so I want this line to move farther away from me."
    - this change in the model will actually update the weights
    - let's say, it'll increase these two and decrease these two
  - now after we update all the weights we have better predictions at all the models in the hidden layer and also a better prediction at the model in the output layer
- notice that in this video we intentionally left the bias unit away for clarity
- in reality, when you update the weights we're also updating the bias unit

<img src="resources/backpropagation_model_example.png" width="70%"/>

## Backpropagation Math

- now we'll do the same thing as we did before, painting our weights in the neural network to better classify our points
- but we're going to do it formally

<img src="resources/backpropagation_math_simple_perceptron.png" width="70%"/>

- on your left, you have a single perceptron with the input vector, the weights and the bias and the sigmoid function inside the node
- on the right, we have a formula for the prediction, which is the sigmoid function of the linear function of the input
- below, we have a formula for the error, which is the average of all points of the blue term for the blue points and the red term for the red points
- in order to descend from Mount Errorest, we calculate the gradient which is simply the vector formed by all the partial derivatives of the error function with respect to the weights $w_1$ up to $w_n$ and and the bias $b$


-  what do we do in a multilayer perceptron?
<img src="resources/backpropagation_math_multilayer_perceptron.png" width="70%"/>
In the image, the edges should be directed to the sigmoid function and not the bias at that last layer; the edges of the last layer point to the bias currently which is incorrect.


- this time it's a little more complicated but it's pretty much the same thing
- we have our prediction, which is simply a composition of functions namely matrix multiplications and sigmoids
- the error function is pretty much the same, except the $\hat y$ is a bit more complicated
- the gradient is pretty much the same thing, it's just much, much longer
  - it's a huge vector where each entry is a partial derivative of the error with respect to each of the weights
    - these just correspond to all the edges


- if we want to write this more formally, we recall that the prediction is a composition of sigmoids and matrix multiplications, where these are the matrices and the gradient is just going to be formed by all these partial derivatives

<img src="resources/backpropagation_math_multilayer_perceptron_more_formally.png" width="70%"/>

- here it looks like a matrix but in reality, it's just a long vector
- the gradient descent is going to do the following;
  - we take each weight, $W_{ij}'^{(k)}$ and we update it by adding a small number, the learning rate times the partial derivative of $E$ with respect to that same weight; this is the gradient descent step, so it will give us new updated weight $W_{ij}'^{(k)} \leftarrow W_{ij}^{(k)} - \alpha  \dfrac{\partial E}{\partial W_{ij}^{(k)}}$
- that step is going to give us a whole new model with new weights that will classify the point much better

### Chain Rule

- before we start calculating derivatives, let's do a refresher on the chain rule which is the main technique we'll use to calculate them
- if you have a variable $x$ on a function $f$ that you apply to $x$ to get $f(x)$, which we're gonna call $A$, and then another function $g$, which you apply to $f(x)$ to get $g \circ f(x)$, which we're gonna call $B$
- the chain rule says, if you want to find the partial derivative of $B$ with respect to $x$, that's just a partial derivative of $B$ with respect to $A$ times the partial derivative of $A$ with respect to $x$
  - $ \dfrac{\partial B}{\partial x} = \dfrac{\partial B}{\partial A} \dfrac{\partial A}{\partial x} $
  - it literally says, when composing functions, that derivatives just multiply
    - that's gonna be super useful for us because feed forwarding is literally composing a bunch of functions
    - and back propagation is literally taking the derivative at each piece
  - since taking the derivative of a composition is the same as multiplying the partial derivatives, then all we're gonna do is multiply a bunch of partial derivatives to get what we want
    
<img src="resources/chain_rule.png" width="40%"/>    

- let's go back to our neural network with our weights and our input and recall that the weights with superscript 1 belong to the first layer, and the weights with superscript 2 belong to the second layer
- also, recall that the bias is not called $b$ anymore; now, it is called $W_{31}$, $W_{32}$ etc. for convenience, so that we can have everything in matrix notation


- now what happens with the input?
- let us do the feedforward process
- in the first layer, we take the input and multiply it by the weights and that gives us $h_1$, which is a linear function of the input and the weights
  - same thing with h2, given by this formula over here
- in the second layer, we would take this $h_1$ and $h_2$ and the new bias, apply the sigmoid function, and then apply a linear function to them by multiplying them by the weights and adding them to get a value of $h$
- and finally, in the third layer, we just take a sigmoid function of $h$ to get our prediction or probability between $0$ and $1$, which is $\hat y$

<img src="resources/backpropagation_feedforward.png" width="80%"/>    

- we can read this in more condensed notation by saying that the matrix corresponding to the first layer is $W_1$, the matrix corresponding to the second layer is $W_2$ and then the prediction we had is just going to be the sigmoid of $W_2$ combined with the sigmoid of $W_1$ applied to the input $x$ and that is feedforward


- now, we are going to develop backpropagation, which is precisely the reverse of feedforward

<img src="resources/backpropagation_backpropagation.png" width="80%"/>    

- we are going to calculate the derivative of this error function with respect to each of the weights in the labels by using the chain rule
- let's recall that our error function is this formula over here, which is a function of the prediction $\hat y$
- but, since the prediction is a function of all the weights $W_{ij}$, then the error function can be seen as the function on all the $W_{ij}$
- therefore, the gradient is simply the vector formed by all the partial derivatives of the error function $E$ with respect to each of the weights
  - let's calculate one of these derivatives
  - let's calculate derivative of $E$ with respect to $W_{11}^{(1)}$
  - since the prediction is simply a composition of functions and by the chain rule, we know that the derivative with respect to this is the product of all the partial derivatives


- in this case, the derivative $E$ with respect to $W_{11}$ is the derivative of either respect to $\hat y$ times the derivative $\hat y$ with respect to $h$ times the derivative $h$ with respect to $h_1$ times the derivative $h_1$ with respect to $W_{11}$
  - this may seem complicated, but the fact that we can calculate a derivative of such a complicated composition function by just multiplying 4 partial derivatives is remarkable
- we have calculated the first one, the derivative of $E$ with respect to $\hat y$ and if you remember, we got $\hat y - y$
- let's calculate the other ones

<img src="resources/backpropagation_error_derivative_subset.png" width="80%"/>

- let's zoom in a bit and look at just one piece of our multi-layer perceptron
- the inputs are some values $h_1$ and $h_2$, which are values coming in from before
- once we apply the sigmoid and a linear function on $h_1$ and $h_2$ and $1$ corresponding to the biased unit, we get a result $h$
  - what is the derivative of $h$ with respect to $h_1$?
  - $h$ is a sum of three things and only one of them contains $h_1$
  - so, the second and the third summon just give us a derivative of $0$
  - the first summon gives us $W_{11}^{(2)}$ because that is a constant, and that times the derivative of the sigmoid function with respect to $h_1$
  - this is something that we calculated below in the instructor comments, which is that the sigmoid function has a beautiful derivative, namely the derivative of sigmoid of $h$ is precisely sigmoid of $h$ times $1$ minus sigmoid of $h$

### Calculation of the derivative of the sigmoid function


- recall that the sigmoid function has a beautiful derivative, which we can see in the following calculation
- this will make our backpropagation step much cleaner

<img src="resources/gradient_descent_sigmoid_quotient_formula.gif"/>

## Further reading

- backpropagation is fundamental to deep learning
- TensorFlow and other libraries will perform the backprop for you, but you should really really understand the algorithm
-  we'll be going over backprop again, but here are some extra resources for you:
  - from Andrej Karpathy: Yes, you should understand backprop: https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b#.vt3ax2kg9
  - also from Andrej Karpathy, a lecture from Stanford's CS231n course: https://www.youtube.com/watch?v=59Hbtz7XgjM