# Introduction to Neural Networks

In these notes, we go over some basic examples in order to understand how artificial neural networks can be viewed as an extension or variant of existing classification and linear regression techniques. We also provide an example of how to train a neural network using a common library.

## Gradient descent: simplest possible examples

**Example**: Suppose we have a data set consisting of a single point $(x, y) = (2, 6)$. We wish to find a model for this data from the space of linear functions of the form $y = \beta \cdot x$. Thus, we need to find the parameter $\beta$.

We can proceed by setting up the equation in the usual way by plugging the data point into the equation (if we had more data, we would have a system of equations):

$$y = \beta \cdot x$$
$$6 = \beta \cdot 2$$

Rather than solve the above directly using algebra, suppose we instead wish to use calculus. We can define an **error** function $\varepsilon(\beta)$ in terms of the unknown parameter $\beta$:


$$\varepsilon(\beta) = (\beta \cdot x - y)^2$$
$$\varepsilon(\beta) = (\beta \cdot 2 - 6)^2$$

We can expand the above:

\begin{eqnarray*}
\varepsilon(\beta) & = & (x^2 \beta^2) - 2\cdot (y \cdot x \cdot \beta) + (y^2) \\
\varepsilon(\beta) &=& 4 \beta^2 - 6 \cdot 2 \cdot \beta - 6 \cdot 2 \cdot \beta + 6^2 \\
                   &=& 4 \beta^2 - 24 \beta + 36
\end{eqnarray*}

Note that the function $\varepsilon(\beta)$ actually represents a parabola. If we want to **minimize** the error that our choice of parameter $\beta$ introduces into the model (or, equivalently, if we want to minimize the value of $\varepsilon(\beta)$), we can find the minimum point on the parabola. This can be done by setting the derivative of $\varepsilon$ with respect to $\beta$ to $0$, since the bottom of the parabola has a slope of $0$.

\begin{eqnarray*}
\varepsilon'(\beta) &=& \frac{d\varepsilon}{d\beta} = 2 \cdot x^2 \cdot \beta - 2 \cdot y \cdot x \\
\varepsilon'(\beta) &=& \frac{d\varepsilon}{d\beta} = 2 \cdot 4 \cdot \beta - 24 \\
                    &=& \frac{d\varepsilon}{d\beta} = 8\beta - 24
\end{eqnarray*}

Thus, we solve for the $\beta$ that minimizes the error.

\begin{eqnarray*}
0 &=& \varepsilon'(\beta) \\
0 &=& 8\beta - 24 \\
24 &=& 8 \beta \\
\beta &=& 3
\end{eqnarray*}

We could take yet another approach instead of directly solving $\varepsilon'(\beta) = 0$. Suppose we want to **guess** some $\beta^\ast$ and then adjust it depending on the slope of the error at our chosen $\beta^\ast$?

As long as we know that $\varepsilon$ is a parabola (or, more generally, that it is convex), we can do so by computing $\varepsilon'(\beta^\ast)$ and then updating our guess based on the slope. Suppose $\beta^\ast = 4$. Then we have:

\begin{eqnarray*}
\varepsilon'(\beta^\ast) &=& \varepsilon'(4) \\
                         &=& 8 \cdot 4 - 24 \\
  &=& 32 - 24 \\
  &=& 8
\end{eqnarray*}

Now suppose $\beta^\ast = 2$. Then we have:

\begin{eqnarray*}
\varepsilon'(\beta^\ast) &=& \varepsilon'(2) \\
                         &=& 8 \cdot 2 - 24 \\
  &=& 16 - 24 \\
  &=& -8
\end{eqnarray*}

So one approach we could take is to compute $\varepsilon'(\beta^\ast)$ and then update our guess $\beta^\ast$ based on some version of this slope (e.g., weighted by an update coefficient $a$):

$$\beta_{j+1}^\ast = \beta_j^\ast - a \cdot \varepsilon'(\beta_j^\ast)$$

You will notice that for $a = \frac{1}{8}$, we would converge from a guess of $2$ or $4$ onto the correct value $3$ with only one update step. On the other hand, notice that for a poorly chosen update coefficient $a$, our guess would actually become progressively worse!


**Example**: Suppose we have a data set consisting of two points $(x_1, y_1)$ and $(x_2, y_2)$. We can expand the approach from the above example. The system of equations would be:

$$y_1 = \beta \cdot x_1$$
$$y_2 = \beta \cdot x_2$$

The error function could be the typical sum of squares:

\begin{eqnarray*}
\varepsilon(\beta) & = & (\beta x_1 - y_1)^2 + (\beta x_2 - y_2)^2 \\
                   & = & \sum_{i = 1}^{2} (\beta x_i - y_i)^2
\end{eqnarray*}

We can expand the above as we did in the example above:

\begin{eqnarray*}
\varepsilon(\beta) &=& \sum_{i = 1}^{2} (x_i^2 \beta^2 - 2 y_i x_i \beta + y_i^2) \\
\varepsilon(\beta) &=& \sum_{i = 1}^{2} x_i^2 \beta^2 - \sum_{i = 1}^{2} 2 y_i x_i \beta + \sum_{i = 1}^{2} y_i^2 \\
                   &=& \beta^2 \sum_{i = 1}^{2} x_i^2 - \beta \sum_{i = 1}^{2} 2 y_i x_i + \sum_{i = 1}^{2} y_i^2 \\
                   &=& \beta^2 (...) - \beta (...) + (...)
\end{eqnarray*}

Notice that in the above, the terms $\beta^2$ and $\beta$ have been isolated using algebra, and their coefficients are a function of the data itself. Thus, from this point forward the problem can be solved in exactly the same way as was done in the previous example.

**Example**: Suppose we have a situation with some data set $(x_1,y_1),...,(x_n,y_n)$ and two model parameters $\alpha$ and $\beta$. Then our system of equations would consist of equations of the form:

$$y_i = \beta x_i + \alpha$$

We can define an error function in this case, as well, and then isolate the $\alpha$ and $\beta$ factors from the terms that contain the actual data values:

\begin{eqnarray*}
\varepsilon(\beta, \alpha) & = & \sum_{i} ((\beta x_i + \alpha) - y_i)^2 \\
                   & = & \sum_{i} ((\beta x_i + \alpha)^2 - 2 y_i (\beta x_i + \alpha) + y_i^2) \\
                   & = & \sum_{i} ((\beta^2 x_i^2 + 2 \alpha \beta x_i + \alpha^2) - 2 y_i (\beta x_i + \alpha) + y_i^2) \\
                   & = & \sum_{i} (\beta^2 x_i^2 + 2 \alpha \beta x_i + \alpha^2 - 2 y_i \beta x_i - 2 y_i \alpha + y_i^2) \\
                   & = & \beta^2 (\sum_{i} x_i^2) + 2 \alpha \beta (\sum_{i} x_i) + \alpha^2 (\sum_{i} 1)  - 2 \beta (\sum_{i} y_i x_i) - 2 \alpha (\sum_{i} y_i) + (\sum_i y_i^2)  
\end{eqnarray*}

We can now compute the derivatives of $\varepsilon$ with respect to both $\alpha$ and $\beta$:

\begin{eqnarray*}
\frac{\partial\varepsilon(\beta, \alpha)}{\partial\beta} & = & 2 \beta (\sum_{i} x_i^2) + 2 \alpha (\sum_{i} x_i) - 2 (\sum_{i} y_i x_i)\\
\frac{\partial\varepsilon(\beta, \alpha)}{\partial\alpha} & = & 2 \beta (\sum_{i} x_i) + 2 \alpha (\sum_{i} 1) - 2 (\sum_{i} y_i)
\end{eqnarray*}

We sometimes write $\nabla\varepsilon(\beta, \alpha) = (\frac{\partial\varepsilon(\beta, \alpha)}{\partial\beta}, \frac{\partial\varepsilon(\beta, \alpha)}{\partial\alpha})$ or $\nabla\varepsilon = (\frac{\partial\varepsilon}{\partial\beta}, \frac{\partial\varepsilon}{\partial\alpha})$ .

If we start with guesses $(\beta^\ast, \alpha^\ast)$ for our parameters, we can then compute $\nabla \varepsilon(\beta^\ast, \alpha^\ast)$ to obtain a gradient vector. We can then add a weighted version of this vector to our original guess using vector addition. This yields an update rule for each iteration $i$ of a gradient descent algorithm:

$$(\beta_{j+1}^\ast, \alpha_{j+1}^\ast) = (\beta_{j}^\ast, \alpha_{j}^\ast) - a \cdot \nabla \varepsilon(\beta_j^\ast, \alpha_j^\ast)$$

Finally, notice that in our examples the summation has always ranged over all data points. However, this is not necessary. We could instead perform each of the above iterations by computing $\varepsilon(\beta_j^\ast, \alpha_j^\ast)$ over only **some** of the data. For each iteration, we could introduce a new subset of the data.

## TensorFlow example: classification

In [51]:

# This notebook modified by Adam Smith

# Original version copyright 2016 The TensorFlow Authors. All Rights Reserved.
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0

"""An Example of a DNNClassifier for the Iris dataset."""

import argparse
import tensorflow as tf

import iris_data


In [52]:
# This code can be modified to read arguments from the command line, when appropriate. 
parser = argparse.ArgumentParser()
parser.add_argument('--batch_size', default=100, type=int, help='batch size')
parser.add_argument('--train_steps', default=1000, type=int,
                    help='number of training steps')
args = parser.parse_args([])

First, we load the data into dataframes.

In [53]:
# Fetch the data.
(train_x, train_y), (test_x, test_y) = iris_data.load_data()


We can examine the data.

In [54]:
type(train_x)

pandas.core.frame.DataFrame

In [55]:
train_x.head()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth
0,6.4,2.8,5.6,2.2
1,5.0,2.3,3.3,1.0
2,4.9,2.5,4.5,1.7
3,4.9,3.1,1.5,0.1
4,5.7,3.8,1.7,0.3


In [56]:
type(train_y)

pandas.core.series.Series

In [57]:
train_y.head()

0    2
1    1
2    2
3    0
4    0
Name: Species, dtype: int64

The test/train split is 80/20.

In [58]:
(train_x.size, test_x.size)

(480, 120)

We now instantiate the classifer. This object is specific to discrete classification and can be customized in terms of:
- features,
- network structure,
- number of classes, and
- (optionally) many other options (activitation function, optimization methods, and so on).

Defaults worth knowing: 
- activitation function is ReLU, and
- "dropout" regularization is not used.

In [59]:

# Feature columns describe how to use the input.
# We are adding one numeric feature for each column of the training data.
my_feature_columns = []
for key in train_x.keys():
    fc = tf.feature_column.numeric_column(key=key)
    print(key, fc)
    my_feature_columns.append(fc)

print()

# Build 2 hidden layer DNN with 10, 10 units respectively.
classifier = tf.estimator.DNNClassifier(
    feature_columns = my_feature_columns,
    
    # Two hidden layers of 10 nodes each.
    hidden_units = [10, 10],
    
    # The model must choose between 3 classes.
    n_classes=3,
    
    ## We can also set the directory where model information will be saved.
    ##model_dir='models/iris'
    )

SepalLength _NumericColumn(key='SepalLength', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)
SepalWidth _NumericColumn(key='SepalWidth', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)
PetalLength _NumericColumn(key='PetalLength', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)
PetalWidth _NumericColumn(key='PetalWidth', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/var/folders/h_/s381b4dj4cx2ywclhcp_3v4m0000gn/T/tmpbz5rnm50', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_pro

In [60]:
type(classifier)

tensorflow.python.estimator.canned.dnn.DNNClassifier

We are now ready to train. We pass the input to the classifer as a function. That function takes no arguments and returns a `tf.data.Dataset` object. 

In [61]:
# This code is copied from iris_data.py.

def train_input_fn(features, labels, batch_size):
    """An input function for training"""
    # Convert the inputs to a Dataset.
    dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))

    # Shuffle, repeat, and batch the examples.
    dataset = dataset.shuffle(1000).repeat().batch(batch_size)

    # Return the dataset.
    return dataset

In [62]:
classifier.train(
    input_fn = lambda:iris_data.train_input_fn(train_x, train_y,
                                                 args.batch_size),
    steps = args.train_steps)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /var/folders/h_/s381b4dj4cx2ywclhcp_3v4m0000gn/T/tmpbz5rnm50/model.ckpt.
INFO:tensorflow:loss = 124.93739, step = 1
INFO:tensorflow:global_step/sec: 281.448
INFO:tensorflow:loss = 21.743896, step = 101 (0.357 sec)
INFO:tensorflow:global_step/sec: 441.261
INFO:tensorflow:loss = 13.786755, step = 201 (0.228 sec)
INFO:tensorflow:global_step/sec: 388.666
INFO:tensorflow:loss = 10.441718, step = 301 (0.256 sec)
INFO:tensorflow:global_step/sec: 467.135
INFO:tensorflow:loss = 6.570615, step = 401 (0.214 sec)
INFO:tensorflow:global_step/sec: 527.335
INFO:tensorflow:loss = 10.371038, step = 501 (0.189 sec)
INFO:tensorflow:global_step/sec: 540.657
INFO:tensorflow:loss = 5.057136, step = 601 (0.185 sec)
INFO:tensorflow

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x1253eca20>

The most straightforward way to construct input functions is directly from a dataframe (can also do this from a numpy array):

```python
import pandas as pd
# pandas input_fn.
my_input_fn = tf.estimator.inputs.pandas_input_fn(
    x=pd.DataFrame({"x": x_data}),
    y=pd.Series(y_data),
    ...)
```

You can see and other examples here: 
 https://www.tensorflow.org/versions/r1.3/get_started/input_fn

In [63]:
classifier.get_variable_names()

['dnn/hiddenlayer_0/bias',
 'dnn/hiddenlayer_0/bias/t_0/Adagrad',
 'dnn/hiddenlayer_0/kernel',
 'dnn/hiddenlayer_0/kernel/t_0/Adagrad',
 'dnn/hiddenlayer_1/bias',
 'dnn/hiddenlayer_1/bias/t_0/Adagrad',
 'dnn/hiddenlayer_1/kernel',
 'dnn/hiddenlayer_1/kernel/t_0/Adagrad',
 'dnn/logits/bias',
 'dnn/logits/bias/t_0/Adagrad',
 'dnn/logits/kernel',
 'dnn/logits/kernel/t_0/Adagrad',
 'global_step']

In [64]:
# We can inspect the weights and biases of the resulting model.
classifier.get_variable_value('dnn/hiddenlayer_0/kernel')

array([[-0.44063115, -0.23042291,  0.13243337,  0.8365632 ,  0.36168227,
         0.00820287,  0.35834   ,  0.17601633,  0.29611564,  0.234959  ],
       [-0.30628192, -1.0811728 , -0.5592357 ,  1.0111309 , -0.7359063 ,
        -0.85464525,  0.7836679 , -0.5366037 ,  0.20144102,  0.41184592],
       [ 0.57232094, -0.0753327 ,  0.3342247 ,  0.01109344,  0.01686647,
         0.4699366 ,  0.46981207, -0.3487868 , -0.5416549 , -0.4551163 ],
       [-0.00952396,  0.91267526,  0.48889056, -0.5507718 ,  0.5058314 ,
         0.4080208 , -0.44290552,  0.08135599,  0.3890145 ,  0.11176193]],
      dtype=float32)

In [65]:
eval_result = classifier.evaluate(
        input_fn = lambda:iris_data.eval_input_fn(test_x, test_y,
                                                  args.batch_size))

print('\nTest set accuracy: {accuracy:0.3f}\n'.format(**eval_result))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-11-06-22:20:31
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /var/folders/h_/s381b4dj4cx2ywclhcp_3v4m0000gn/T/tmpbz5rnm50/model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-11-06-22:20:31
INFO:tensorflow:Saving dict for global step 1000: accuracy = 0.96666664, average_loss = 0.06147208, global_step = 1000, loss = 1.8441625
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1000: /var/folders/h_/s381b4dj4cx2ywclhcp_3v4m0000gn/T/tmpbz5rnm50/model.ckpt-1000

Test set accuracy: 0.967



In [66]:
# eval_result is a dictionary with a few basic statistics
for key in eval_result.keys():
    print(key, ": ", eval_result[key])

accuracy :  0.96666664
average_loss :  0.06147208
loss :  1.8441625
global_step :  1000
