# Solution to my first NN

We are here following the pipeline suggested from the [assignment](https://www.coursera.org/learn/intro-to-deep-learning/peer/0AgYP/my1stnn).

In [None]:
%matplotlib notebook

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

from preprocessed_mnist import load_dataset

* Begin with logistic regression from the previous assignment to classify some number against others (e.g. zero vs nonzero)

From `preprocessed_mnist` the data has already been:

1. Been normalized (note that the images only have one channel)
2. Split into train, validation and test

## Logistic regression separating zeros from non-zeros

In [None]:
# Load the dataset
X_train, y_train, X_val, y_val, X_test, y_test = load_dataset()

In [None]:
# Parameters
small_number = 1e-3
n_iter = 10
# Maybe not needed
batch = 4

In [None]:
def reshaper(var):
    """
    Reshapes a 3-d array to a 2-d array, collapsing the two last dimensions
    
    Parameters
    ----------
    var : array, shape (samples, image-rows, image-colums)
        The variable to reshape
        
    Returns
    -------
    reshaped : array, shape (samples, image-rows, image-colums)
        The reshaped variable    
    """
    
    reshaped = var.reshape(var.shape[0], var.shape[1]*var.shape[2])
    
    return reshaped

In [None]:
X_train_r = reshaper(X_train)
X_val_r = reshaper(X_val)
X_test_r = reshaper(X_test)

In [None]:
n_training_ex = X_train_r.shape[0]
n_features = X_train_r.shape[1]

In [None]:
# The first dimension is None, as we would like to vary the number of input examples

# NOTE: How can a network take in all training examples at once?
#       When predicting one example, we are essential sending in a row-vector (1 x n-matrix)
#       When training several examples, we are sending in several one-vectors (m x n-matix)
#       The loss will still be a scalar as the input_y will be m x 1-dimensional, where we will take an inner product
#       with predicted_y, which is also m x 1 dimensional

input_X = tf.placeholder("float32", shape=(None, n_features), name="input_x")
input_y = tf.placeholder("float32", shape=(None, 1), name="input_y")

In [None]:
def get_w_and_b(rows, cols):
    """
    Returns weights and biases based on the input dimensions
    
    Parameters
    ----------
    rows : int
        Number of rows in the weights matrix and the bias matrix
        This corresponds to training examples in the input layer
    cols : int
        Number of columns in the weights matrix
        This corresponds to features in the input layer and number of nodes in the previous layer for hidden layers
        
    Returns
    -------
    W : Variable, shape (rows, cols)
        The weights variable
    b : Variable, shape (rows, 1)
        The bias variable
    """
     
    # We initialize with random weights to break symmetry
    W = tf.Variable(initial_value=np.random.randn(rows, cols)*small_number,
                    name="weights",
                    dtype='float32')

    b = tf.Variable(initial_value=np.random.randn(rows, 1)*small_number,
                    name="bias",
                    dtype='float32')
    
    return W, b

In [None]:
W, b = get_w_and_b(n_training_ex, n_features)

In [None]:
# The model code

# Compute a vector of predictions, resulting shape should be [input_X.shape[0],]
# This is 1D, if you have extra dimensions, you can  get rid of them with tf.squeeze .
# Don't forget the sigmoid.

# predicted_y = <predicted probabilities for input_X>
# NOTE: Predicted y will have the same number of rows as the number of input examples
# NOTE: Squeezing gets rid of the extra "bracket" (that is the 1 in (dim, 1)). This is needed for the scaffold
predicted_y = tf.squeeze(tf.nn.sigmoid(tf.matmul(input_X, weights) + b))

# Loss. Should be a scalar number - average loss over all the objects
# tf.reduce_mean is your friend here
# loss = <logistic loss (scalar, mean over sample)>
# NOTE: We are not using tf.matmul(input_y , tf.log(predicted_y)) as matmul requires tensors with rank > 1
# NOTE: When optimizing, the 1/m factor when taking reduce_mean contra taking matmul will not change the 
#       location of the minima 
loss = tf.reduce_mean(- input_y * tf.log(predicted_y) - (1-input_y) * tf.log(1 - predicted_y))

# See above for an example. tf.train.*Optimizer
# optimizer = <optimizer that minimizes loss>
# NOTE: No var_list here as we are optimizing predicted_y, which is a placeholder


# TODO: Maybe need a var list
optimizer = tf.train.MomentumOptimizer(0.01, 0.5).minimize(loss)

* Generalize it to multiclass logistic regression. Either try to remember the week 1 lectures or google it.

* Instead of a weights vector you'll have to use a matrix with `shape=(features, classes)`

* softmax (exp over sum of exps) can implemented manually or as `tf.nn.softmax`

* probably better to use STOCHASTIC gradient descent (minibatch)

* in which case sample should probably be shuffled (or use random subsamples on each iteration)

* Add a hidden layer. Now your logistic regression uses hidden neurons instead of inputs.

* Hidden layer uses the same math as output layer (ex-logistic regression), but uses some nonlinearity (sigmoid) instead of softmax

* You need to train both layers, not just output layer :)

* Do not initialize layers with zeros (due to symmetry effects). A gaussian noize with small sigma will do.

* 50 hidden neurons and a sigmoid nonlinearity will do for a start. Many ways to improve here.

* In an ideal case this totals to 2 .dot's, 1 softmax and 1 sigmoid