# Feedforward data
Let's talk about the data we will work with and how handle them.

#### Recaption aka what we should know:
* multi class error function:
    * using cross-entropy: $-\frac{1}{n}{\sum_{i=1}^n{\sum_{j=1}^m y_{ij} ln(\hat y_{ij})}}$
    * squared errors (SSE): $E = \frac{1}{2}(y -\hat y)^2$
* error function (using cross-entropy): $-\frac{1}{n}{\sum_{i=1}^ny_iln(\hat y_i)} + (1 - y_i)(ln(1 -\hat y_i))$
* error value: $-(yln(\hat y) + (1 - y)(ln(1 -\hat y)))$
* expected output: $y_i = 0$ or $1$
* predicted output: probability $\hat y_i = \sigma(h_j)$
* avtivation function (may be):  
    * sigmoid (popular) $\sigma(h_j) = \frac{1}{1 + e^{-h_j}}$
    * softmax $\frac{e^{h_j}}{\sum_{i=1}^n e^{h_1}+...+e^{h_n}}$)
    * other.
* Linear function scores: $h_j = \sum_iW_ji\cdot x_i + b_j$ 
* Input: $X = (x_1, ..., x_n)$
* Model:
    * Weights: $W = (W_1, \dots, W_n)$ and $W_j = (w_j1, \dots, w_jn)$
    * Bias: $b = (b_1, \dots, b_n)$
* Sign for learning rate is $\alpha$

Let's get it all together now!

$E(W,b) = -\frac{1}{n}{\sum_{i=1}^n(1 - y_i)(ln(1 - \sigma(Wx^{(i)}+b)) - y_iln(\sigma(Wx^{(i)}+b)}$

Reminder! For multi class error function looks like that:

$E(W,b) = -\frac{1}{m}{\sum_{i=1}^n{\sum_{j=1}^m y_{ij} ln(\sigma(Wx^{(i)}+b))}}$

## No-linear data
Till now we had to deal with linear data (regions). In other words, we can draw a straight line to correctly divide this data into groups (regions). Look at the image below:

![title](linear_regions.png)

Classic (Rosenblatt) Perceptron work just fine for that kind of data. Unfortunately, it will have problems with non-linear data.

### What are non-linear data?
They are data that can be not linear separate into regions. In other words, we can not draw a straight line to correctly divide this data into groups (regions). Look at the image below:

![title](non-linear_regions.png)

### How to handle non-linear data?
We can just take non-linear data and treat it few times like linear data! Let's say we use our perceptron for linear data two times and we will get: 

![title](linear_non-linear_regions.png)

How we can see, this two sets gave us two different outputs.

#### How combine two numbers?

The simplest way is to add them.

![title](non-linear_regions_join_process_start.png)

But it's heigher then 1 and probability can not be heigher then 1! Nothing simpler, just use sigmoid function!

![title](non-linear_regions_join_process_end.png)

* Note: we do all steps to each point: calculating initial probability, add probabilities, use the sigmoid function on them.
#### we also can add weights to these inputs!
![title](non-linear_regions_join_process_weights.png)
* Note: (I heard you like perceptron...) how we can see, perceptron turn out to be our inputs to bigger perceptron. Perceptrons in perceptron!

We may represent it in a different way:
![title](neural_network_example_idea.png)

#### and this is what we call deep learning!

## What is Feedforward
Ok, now when we know how to handle non-linear data, let's talk about feedforward. 

Feedforward is the process neural networks use to turn the input into an output. Let's study it more carefully, before we dive into how to train the networks.

It is the process same as the one we described above. The trick is to use many perceptrons to create input for next layer perceptrons. And that's it!

Mathematically speaking it will look like that:

$\hat y = \sigma(W_2\cdot(\sigma (W_1\cdot X + b_1))+ b_2)$

where:
* $\hat y$ is predicted output,
* $\sigma$ is avtivation function,
* $W_1$ is a vector of weights for first layer,
* $b_1$ is a bias for $W_1$,
* $W_2$ is a vector of weights for second layer and bias for this layer,
* $b_2$ is a bias for $W_2$,
* $X$ is set of our inputs and bias.

So for simpilicifiying lets put weights and bias for each layer in one letter $W$:

$\hat y = \sigma(W_2\cdot(\sigma (W_1\cdot X)))$

## Feedforward Error Function
Is the same as usual:

$-\frac{1}{n}{\sum_{i=1}^ny_iln(\hat y_i)} + (1 - y_i)(ln(1 -\hat y_i))$

We need only remember that prediction $\hat y$ here will look like this:

$\hat y = \sigma(W_2\cdot(\sigma (W_1\cdot X)))$

And not like this:

$\hat y_i^n = \sigma(h_i)$

but for both cases $\sigma$ and $h_i$ looks like this:

* $\sigma(h_i) = \frac{1}{1 + e^{-h_i}}$
* $h_i = W_i\cdot x$
* $W = (w_0, \dots, w_n, bias)$

#### Recaption:
* prediction: $\hat y = \sigma(W_2\cdot(\sigma (W_1\cdot X)))$
    * avtivation function: $\sigma(h_j) = \frac{1}{1 + e^{-h_j}}$
    * Linear function scores: $h_j = \sum_iW_ji\cdot x_i + b_j$ 
    * Input: $X = (x_1, ..., x_n)$
    * Model:
        * Weights: $W = (W_1, \dots, W_n)$ and $W_j = (w_j1, \dots, w_jn)$
        * Bias: $b = (b_1, \dots, b_n)$
    * Sign for learning rate is $\alpha$
* multi class error function (using cross-entropy): $-\frac{1}{n}{\sum_{i=1}^n{\sum_{j=1}^m y_{ij} ln(\hat y_{ij})}}$
    * error function (using cross-entropy): $-\frac{1}{n}{\sum_{i=1}^ny_iln(\hat y_i)} + (1 - y_i)(ln(1 -\hat y_i))$
* Gradient of error function: $\nabla E =(\dots, \frac{\partial E}{\partial w_j^{(i)}}, \dots)$
    * $\frac{\partial}{\partial w_j} E = -(y - \hat y)x_j$

For this assumption, feedforward we may look like this:
![title](Feedforward.png)

## Backpropagation
Now, we're ready to get our hands into training a neural network. For this, we'll use the method known as backpropagation. In a nutshell, backpropagation will consist of:
1. Doing a feedforward operation.
2. Comparing the output of the model with the desired output.
3. Calculating the error.
4. Running the feedforward operation backwards (backpropagation) to spread the error to each of the weights.
5. Use this to update the weights, and get a better model.
6. Continue this until we have a model that is good.

Sounds more complicated than what it actually is. Let's take a look in the next few videos. The first step is to see conceptual interpretation of backpropagation.

### Running the feedforward operation backwards (backpropagation)
After we feedforward (run) out neural network and get output (and decided it's not good enough) let's go back (aka run feedforward backward or simpler, run backpropagation).

We go to each layer and see what we can do to increase the accuracy of output (tip: change weights and biases). We calculated error during feedforward so we can change weights and bias using it! We repeat the process until we reach the input layer. And that's it!

More mathematically speaking we need to calculate derivative of sigmoid function. So for one layer perceptron we had:
* prediction: $\hat y_j = \sigma(W_j\cdot x +b_j)$
* error function: $-\frac{1}{n}{\sum_{i=1}^ny_iln(\hat y_i)} + (1 - y_i)(ln(1 -\hat y_i))$
* Gradient of error function: $\nabla E =\left(\frac{\partial}{\partial w_1}E, \cdots, \frac{\partial}{\partial w_n}E, \frac{\partial}{\partial b}E \right)$
    * $\frac{\partial}{\partial w_j} E = -(y - \hat y)x_j$
    
For multi-layer perceptron we will have:
* prediction: $\hat y = \sigma(W_2\cdot(\sigma (W_1\cdot X)))$
    * $\sigma(h_i) = \frac{1}{1 + e^{-h_i}}$
    * $h_j = W_j\cdot x$
    * $W_j = (w_j0, \dots, w_jn, bias_j)$
* error function: $-\frac{1}{n}{\sum_{i=1}^ny_iln(\hat y_i)} + (1 - y_i)(ln(1 -\hat y_i))$
* Gradient of error function: $\nabla E =(\dots, \frac{\partial E}{\partial w_j^{(i)}}, \dots)$
    * $\frac{\partial}{\partial w_j} E = -(y - \hat y)x_j$
    
Example:
* if $W_1 = \begin{vmatrix}w_{11}^{(1)} & w_{12}^{(1)} \\w_{21}^{(1)} & w_{22}^{(1)} \\w_{31}^{(1)} & w_{32}^{(1)} \end{vmatrix}$ and $W_2 = \begin{vmatrix}w_{11}^{(2)} \\w_{21}^{(2)} \\w_{31}^{(2)} \end{vmatrix}$ so $\nabla E = \begin{vmatrix}\frac{\partial E}{\partial w_{11}^{(1)}} & \frac{\partial E}{\partial w_{12}^{(1)}} & \frac{\partial E}{\partial w_{11}^{(2)}} \\ \frac{\partial E}{\partial w_{21}^{(1)}} & \frac{\partial E}{\partial w_{22}^{(1)}} & \frac{\partial E}{\partial w_{21}^{(2)}} \\ \frac{\partial E}{\partial w_{31}^{(1)}} & \frac{\partial E}{\partial w_{32}^{(1)}} & \frac{\partial E}{\partial w_{31}^{(2)}}\end{vmatrix}$ and in other words: ${W'}_j^{k} \leftarrow W_j^{k} - \alpha\cdot\frac{\partial E}{\partial W_{ij}^{(k)}}$

For this assumption, backpropagation we may look like this:
![title](Backpropagation.png)

Just remember:
* $h = W_{11}^{(2)}\sigma(h_1) + W_{21}^{(2)}\sigma(h_2) + W_{31}^{(2)}$
* $\frac{\partial h}{\partial h_1} = W_{11}^{(2)}\sigma(h_1)(1 - \sigma(h_1))$

Time for...

## Implementation example
### Predicting Student Admissions with Neural Networks

In this notebook, we predict student admissions to graduate school at UCLA based on three pieces of data:
* GRE Scores (Test)
* GPA Scores (Grades)
* Class rank (1-4)
* The dataset originally came from here: http://www.ats.ucla.edu/

#### Loading the data
To load the data and format it nicely, we will use two very useful packages called Pandas and Numpy. You can read on the documentation here:
* https://pandas.pydata.org/pandas-docs/stable/
* https://docs.scipy.org/

In [1]:
# Importing pandas and numpy
import pandas as pd
import numpy as np

# Reading the csv file into a pandas DataFrame
data = pd.read_csv('student_data.csv')

# Printing out the first 10 rows of our data
data[:10]

FileNotFoundError: [Errno 2] No such file or directory: 'student_data.csv'

#### Plotting the data
First let's make a plot of our data to see how it looks. In order to have a 2D plot, let's ingore the rank.

In [2]:
# Importing matplotlib
import matplotlib.pyplot as plt

# Function to help us plot
def plot_points(data):
    X = np.array(data[["gre","gpa"]])
    y = np.array(data["admit"])
    admitted = X[np.argwhere(y==1)]
    rejected = X[np.argwhere(y==0)]
    plt.scatter([s[0][0] for s in rejected], [s[0][1] for s in rejected], s = 25, color = 'red', edgecolor = 'k')
    plt.scatter([s[0][0] for s in admitted], [s[0][1] for s in admitted], s = 25, color = 'cyan', edgecolor = 'k')
    plt.xlabel('Test (GRE)')
    plt.ylabel('Grades (GPA)')
    
# Plotting the points
plot_points(data)
plt.show()

NameError: name 'data' is not defined

Roughly, it looks like the students with high scores in the grades and test passed, while the ones with low scores didn't, but the data is not as nicely separable as we hoped it would. Maybe it would help to take the rank into account? Let's make 4 plots, each one for each rank.

In [3]:
# Separating the ranks
data_rank1 = data[data["rank"]==1]
data_rank2 = data[data["rank"]==2]
data_rank3 = data[data["rank"]==3]
data_rank4 = data[data["rank"]==4]

# Plotting the graphs
plot_points(data_rank1)
plt.title("Rank 1")
plt.show()
plot_points(data_rank2)
plt.title("Rank 2")
plt.show()
plot_points(data_rank3)
plt.title("Rank 3")
plt.show()
plot_points(data_rank4)
plt.title("Rank 4")
plt.show()

NameError: name 'data' is not defined

This looks more promising, as it seems that the lower the rank, the higher the acceptance rate. Let's use the rank as one of our inputs. In order to do this, we should one-hot encode it.

#### One-hot encoding the rank
Use the get_dummies function in Pandas in order to one-hot encode the data.

In [4]:
# Make dummy variables for rank
one_hot_data = pd.concat([data, pd.get_dummies(data['rank'], prefix='rank')], axis=1)

# Drop the previous rank column
one_hot_data = one_hot_data.drop('rank', axis=1)

# Print the first 10 rows of our data
one_hot_data[:10]

NameError: name 'data' is not defined

Scaling the data
The next step is to scale the data. We notice that the range for grades is 1.0-4.0, whereas the range for test scores is roughly 200-800, which is much larger. This means our data is skewed, and that makes it hard for a neural network to handle. Let's fit our two features into a range of 0-1, by dividing the grades by 4.0, and the test score by 800.

In [5]:
# Making a copy of our data
processed_data = one_hot_data[:]

# Scale the columns
processed_data['gre'] = processed_data['gre']/800
processed_data['gpa'] = processed_data['gpa']/4.0

# Printing the first 10 rows of our procesed data
processed_data[:10]

NameError: name 'one_hot_data' is not defined

#### Splitting the data into Training and Testing
In order to test our algorithm, we'll split the data into a Training and a Testing set. The size of the testing set will be 10% of the total data.

In [6]:
sample = np.random.choice(processed_data.index, size=int(len(processed_data)*0.9), replace=False)
train_data, test_data = processed_data.iloc[sample], processed_data.drop(sample)

print("Number of training samples is", len(train_data))
print("Number of testing samples is", len(test_data))
print(train_data[:10])
print(test_data[:10])

NameError: name 'processed_data' is not defined

#### Splitting the data into features and targets (labels)
Now, as a final step before the training, we'll split the data into features (X) and targets (y).

In [7]:
features = train_data.drop('admit', axis=1)
targets = train_data['admit']
features_test = test_data.drop('admit', axis=1)
targets_test = test_data['admit']

print(features[:10])
print(targets[:10])

NameError: name 'train_data' is not defined

#### Training the 2-layer Neural Network
The following function trains the 2-layer neural network. First, we'll write some helper functions.

In [8]:
# Activation (sigmoid) function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))
def sigmoid_prime(x):
    return sigmoid(x) * (1-sigmoid(x))
def error_formula(y, output):
    return - y*np.log(output) - (1 - y) * np.log(1-output)

##### Backpropagate the error
Now it's your turn to shine. Write the error term. Remember that this is given by the equation: $(𝑦−\hat y)\sigma′(𝑥)$
 


In [9]:
# Error term formula
def error_term_formula(x, y, output):
    return (y-output) * output * (1 - output)

In [10]:
# Neural Network hyperparameters
epochs = 1000
learnrate = 0.5

# Training function
def train_nn(features, targets, epochs, learnrate):
    
    # Use to same seed to make debugging easier
    np.random.seed(42)

    n_records, n_features = features.shape
    last_loss = None

    # Initialize weights
    weights = np.random.normal(scale=1 / n_features**.5, size=n_features)

    for e in range(epochs):
        del_w = np.zeros(weights.shape)
        for x, y in zip(features.values, targets):
            # Loop through all records, x is the input, y is the target

            # Activation of the output unit
            #   Notice we multiply the inputs and the weights here 
            #   rather than storing h as a separate variable 
            output = sigmoid(np.dot(x, weights))

            # The error, the target minus the network output
            error = error_formula(y, output)

            # The error term
            error_term = error_term_formula(x, y, output)

            # The gradient descent step, the error times the gradient times the inputs
            del_w += error_term * x

        # Update the weights here. The learning rate times the 
        # change in weights, divided by the number of records to average
        weights += learnrate * del_w / n_records

        # Printing out the mean square error on the training set
        if e % (epochs / 10) == 0:
            out = sigmoid(np.dot(features, weights))
            loss = np.mean((out - targets) ** 2)
            print("Epoch:", e)
            if last_loss and last_loss < loss:
                print("Train loss: ", loss, "  WARNING - Loss Increasing")
            else:
                print("Train loss: ", loss)
            last_loss = loss
            print("=========")
    print("Finished training!")
    return weights
    
weights = train_nn(features, targets, epochs, learnrate)

NameError: name 'features' is not defined

#### Calculating the Accuracy on the Test Data

In [11]:
# Calculate accuracy on test data
test_out = sigmoid(np.dot(features_test, weights))
predictions = test_out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))

NameError: name 'features_test' is not defined