*Notebook created by Jacob Kreider*

Notes from 'Deep Learning with Python' by Francois Chollet, Manning Press
<br/>
*Anything in quotes in the markdown sections is a direct quote from the book*

### 3.1.1 - 3.3.4
Notes to be transcribed from handwritten later

### 3.4.1 The IMDB Dataset
Working with IMDB data to classify reviews as positive or negative

In [0]:
# Downloading/loading the built-in imdb data
from keras.datasets import imdb

#Setting up train and test data
(trainData, trainLabels), (testData, testLabels) = imdb.load_data(
    num_words = 10000 #Only keep top 10K words
)


In [0]:
#Decoding one of the reviews back to English, just to see how it's done
wordIndex = imdb.get_word_index()
reverseWordIndex = dict(
    [(value, key) for (key, value) in wordIndex.items()])
decodedReview = ' '.join(
    [reverseWordIndex.get(i - 3, '?') for i in trainData[0]])
print(decodedReview)


### 3.4.2 Preparing the data
"You can't feed a list of integers into a neural network. You have to turn them
into into tensors. There are two ways to do that:<br/><br/>
1. Pad your lists so they all have the same length, turn them into n integer
tensor of shape (samples, word_indices), and then use as the first layer in
your network-- a layer capable of handling such integer tensors (the 'embedding'
layer, covered later in the book).<br/><br/>
2. One-hot encode your lists to turn them into vectors of 0s and 1s. This would mean,
for instance, turning the sequence [3, 5] into a 10K-dimensional vector
that would all be zeroes except for indices 3 and 5, which would be ones. Then you
could use as the first layer in your network a 'Dense' layer, capable of handling
floating-point vector data

Let's go with the latter solution to vectorize the data, which you;ll do manually for
maximum clarity"

In [0]:
import numpy as np 

# Create an all-zero matrix of shape(len(sequences), dimension)
def vectorizeSequences(sequences, dimension = 10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        #Set specific indices of results[i] to 1s
        results[i, sequence] = 1.
    return results

xTrain = vectorizeSequences(trainData) #Vectorized training data
xTest = vectorizeSequences(testData) #Vectorized test data

#Vectorize the labels, as well
yTrain = np.asarray(trainLabels).astype('float32')
yTest = np.asarray(testLabels).astype('float32')


    

### 3.4.3 Building Your Network
"The input data are vectors, and the labels are scalars."
This type of data works well with a simple stack of fully connected ('Dense')
layers with 'relu' activations : Dense(16, activation = 'relu')<br/>
The above line passes 16 to the Dense layer because that's the
number of 'hidden units' in the layer. <br/><br/>
Hidden Units = dimension in the representation space
of the layer. A way of thinking about hidden units is that they
represent "how much freedom you're allowing the representation to
have when learning internal representations." As hidden units (dimensional
representation space) increases, your model can handle higher-complexity
problems, but computational complexity goes up, as does the potential
for overfitting.
### Two Key Architecture Decisions - How many layers to use and
### how many hidden units per layer
(We'll cover how to do this in the next chapter. For now, he chooses for us)<br/><br/>
For this chapter, we'll use this architecture:
* Two intermediate layers with 16 hidden units each
* A third, output layer that will return the scalar prediction
Relu activation will be used on the intermediate layers, and we'll use signmoid
activation on the output layer so that we get probability scores<br/><br/>
*Note: Relu (rectified linear unit) zeroes out negative numbers*
# The Model Definition in Keras

In [0]:
from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(16, activation = 'relu', input_shape = (10000, )))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(1, activation = 'sigmoid'))




### What are activation functions and why are they necessary?
Activation functions like 'relu' provide the ability to deal with non-linearity.
Without them, the 'Dense' layer would only consist of linear operations-- dot product
and addition (output = dot(W, input) + b)<br/><br/>
If this were the case, we could only handle linear transformations: "The hypthesis
space of the layer would be the set of all possible linear transformations of the
input data into a 16-dimensional space." Therefore, adding extra layers would not add
any extra benefit, as each successive stack would still just be implementing linear
operations.<br/><br/>
relu is the most common activation function, but there are many others.

### Choosing a loss function and an optimizer
In this problem, we are performing a binary classification with probability
as the output, so we'l be using *binary_crossentropy* as our loss function.
(We could also use something like *mean_squared_error*, but binary_crossentropy
is a better choice when we're dealing with output probabilities.)<br/><br/>
#### *Crossentropy* measures the distance between probability distributions. In this example, it measures the distance between the actual and predicted values.

In [0]:
# We configure the model in this step
model.compile(optimizer = 'rmsprop',
              loss = 'binary_crossentropy',
              metrics = ['accuracy'])

# This is not the only option. We're passing the optimizer, loss, and metrics as strings
# because they are packaged in keras. If we wanted to either configure the parameters of 
# the optimizer, we could pass it as a class instance, seen here:

# from keras import optimizers
# model.compile(optimizer = optimizers.RMSprop(lr = 0.001),)
# <br/>

# If we wanted to pass custom loss functions or metrics, we could create them as a 
# function, then pass them as the loss or metric arguments:<br/>

# loss = losses.binary_crossentropy,
# metrics = [metrics.binary_accuracy])




### 3.4.4 Validating your approach

In [0]:
# First, we'll split off a validation set from the training data
xVal = xTrain[:10000]
partialXtrain = xTrain[10000:]
yVal = yTrain[:10000]
partialYtrain = yTrain[10000:]


In [0]:
# Next, we'll train the model for 20 *epochs* (which just means we'll iterate
# over the xTrain and yTrain tensors 20 times). We'll use *mini-batches* of 512
# samples. Loss and accuracy will be monitored on our validation set.
# This is achieved by passing xVal and yVal to the 'validation_data' argument

history = model.fit(partialXtrain,
                    partialYtrain,
                    epochs = 20,
                    batch_size = 512,
                    validation_data = (xVal, yVal))


In [0]:
# The *model.fit* call above returns a history object. This contains a *member history*
# which is a dict containing data about every event in the training of the model.<br/>

# Examine the history:
historyDict = history.history
historyDict.keys()


### Examining the learning history
There are four entries in our example-- one per metric in the training and validation.
We can use Matplotlib to plot the loss and accuracy

In [0]:
# Plotting the training and validation loss
import matplotlib.pyplot as plt 

historyDict = history.history
lossValues = historyDict['loss']
valLossValues = historyDict['val_loss']

epochs = range(1, len(lossValues) + 1)

plt.plot(epochs, lossValues, 'bo', label = 'Training Loss')
plt.plot(epochs, valLossValues, 'b', label = 'Validation Loss')
plt.title('Training and Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()


In [0]:
# Plotting the training and validation accuracy
plt.clf()
accValues = historyDict['acc']
valAccValues = historyDict['val_acc']

plt.plot(epochs, accValues, 'bo', label = 'Training Accuracy')
plt.plot(epochs, valAccValues, 'b', label = 'Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()


In the above charts, we see the telltale signs of *overfitting*. While accuracy increased
with each successive epoch on the training set, the validation accuracy peaked at about 4
or 5 epochs into the training. After just the second epoch, we were learning
representations that only really apply to the training data.<br/>
Next, we'll retain the model from scratch, but use only four epochs

In [0]:
model = models.Sequential()
model.add(layers.Dense(16, activation = 'relu', input_shape = (10000, )))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(1, activation = 'sigmoid'))

model.compile(optimizer = 'rmsprop',
              loss = 'binary_crossentropy',
              metrics = ['accuracy'])

model.fit(xTrain, yTrain, epochs = 4, batch_size = 512)
results = model.evaluate(xTest, yTest)


In the above model, we achieved slightly higher accuracy than our first model (88& vs 85%)
using a simpler, naive method that was far less computationally expensive.

### 3.4.5 Using a trained network to generate predictions on new data
Now that the model is trained, we can call *predict* to have it tell us the
likelihood that a review is positive or negative

In [0]:
model.predict(xTest)


## 3.4.6 Further Experiments
The following experiments will help convince you that the architecture choices
you’ve made are all fairly reasonable, although there’s still room for improvement:
* You used two hidden layers. Try using one or three hidden layers, and see how doing so affects validation and test accuracy.
* Try using layers with more hidden units or fewer hidden units: 32 units, 64 units, and so on.
* Try using the mse loss function instead of binary_crossentropy.
* Try using the tanh activation (an activation that was popular in the early days of neural networks) instead of relu.

In [0]:
# Using 1 hidden layer:

model = models.Sequential()
model.add(layers.Dense(16, activation = 'relu', input_shape = (10000, )))
model.add(layers.Dense(1, activation = 'sigmoid'))

model.compile(optimizer = 'rmsprop',
              loss = 'binary_crossentropy',
              metrics = ['accuracy'])

history = model.fit(partialXtrain,
                    partialYtrain,
                    epochs=4,
                    batch_size=512,
                    validation_data=(xVal, yVal))

historyDict = history.history

# Plotting the training and validation accuracy
plt.clf()
accValues = historyDict['acc']
valAccValues = historyDict['val_acc']

epochs = range(1, len(accValues) + 1)

plt.plot(epochs, accValues, 'bo', label = 'Training Accuracy')
plt.plot(epochs, valAccValues, 'b', label = 'Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

# Evaluate test data
model.evaluate(xTest, yTest)


In [0]:
# Using 3 hidden layers:

model = models.Sequential()
model.add(layers.Dense(16, activation = 'relu', input_shape = (10000, )))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(1, activation = 'sigmoid'))

model.compile(optimizer = 'rmsprop',
              loss = 'binary_crossentropy',
              metrics = ['accuracy'])

history = model.fit(partialXtrain,
                    partialYtrain,
                    epochs=4,
                    batch_size=512,
                    validation_data=(xVal, yVal))

historyDict = history.history

# Plotting the training and validation accuracy
plt.clf()
accValues = historyDict['acc']
valAccValues = historyDict['val_acc']

epochs = range(1, len(accValues) + 1)

plt.plot(epochs, accValues, 'bo', label = 'Training Accuracy')
plt.plot(epochs, valAccValues, 'b', label = 'Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

# Evaluate test data
model.evaluate(xTest, yTest)


Decreasing and increasing layers caused slight changes to loss and accuracy.
Interestingly, the single layer model performed (marginally) better than either
the 2 or 3 layer models.

In [0]:
# Using MSE instead of binary crossentropy
model = models.Sequential()
model.add(layers.Dense(16, activation = 'relu', input_shape = (10000, )))
model.add(layers.Dense(16, activation = 'relu'))
model.add(layers.Dense(1, activation = 'sigmoid'))

model.compile(optimizer = 'rmsprop',
              loss = 'mse',
              metrics = ['accuracy'])

history = model.fit(partialXtrain,
                    partialYtrain,
                    epochs=4,
                    batch_size=512,
                    validation_data=(xVal, yVal))

historyDict = history.history

# Plotting the training and validation accuracy
plt.clf()
accValues = historyDict['acc']
valAccValues = historyDict['val_acc']

epochs = range(1, len(accValues) + 1)

plt.plot(epochs, accValues, 'bo', label = 'Training Accuracy')
plt.plot(epochs, valAccValues, 'b', label = 'Validation Accuracy')
plt.title('Training and Validation Accuracy with MSE')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

# Evaluate test data
model.evaluate(xTest, yTest)


With MSE as the loss function, loss dropped significantly, but I'm not sure if that's
because MSE and crossentropy produce values on a different scale or not. Test set accuracy
was slightly lower than crossentropy.

In [0]:
# Try using the tanh activation
model = models.Sequential()
model.add(layers.Dense(16, activation = 'tanh', input_shape = (10000, )))
model.add(layers.Dense(16, activation = 'tanh'))
model.add(layers.Dense(1, activation = 'sigmoid'))

model.compile(optimizer = 'rmsprop',
              loss = 'binary_crossentropy',
              metrics = ['accuracy'])

history = model.fit(partialXtrain,
                    partialYtrain,
                    epochs=4,
                    batch_size=512,
                    validation_data=(xVal, yVal))

historyDict = history.history

# Plotting the training and validation accuracy
plt.clf()
accValues = historyDict['acc']
valAccValues = historyDict['val_acc']

epochs = range(1, len(accValues) + 1)

plt.plot(epochs, accValues, 'bo', label = 'Training Accuracy')
plt.plot(epochs, valAccValues, 'b', label = 'Validation Accuracy')
plt.title('Training and Validation Accuracy with MSE')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

# Evaluate test data
model.evaluate(xTest, yTest)


tanh activation returned similar results to binary crossentropy with MSE loss function

In [0]:
#Try using layers with more hidden units or fewer hidden units

# 8 hiddent units
model = models.Sequential()
model.add(layers.Dense(8, activation = 'tanh', input_shape = (10000, )))
model.add(layers.Dense(8, activation = 'tanh'))
model.add(layers.Dense(1, activation = 'sigmoid'))

model.compile(optimizer = 'rmsprop',
              loss = 'binary_crossentropy',
              metrics = ['accuracy'])

history = model.fit(partialXtrain,
                    partialYtrain,
                    epochs=4,
                    batch_size=512,
                    validation_data=(xVal, yVal))

historyDict = history.history

# Plotting the training and validation accuracy
plt.clf()
accValues = historyDict['acc']
valAccValues = historyDict['val_acc']

epochs = range(1, len(accValues) + 1)

plt.plot(epochs, accValues, 'bo', label = 'Training Accuracy')
plt.plot(epochs, valAccValues, 'b', label = 'Validation Accuracy')
plt.title('Training and Validation Accuracy with MSE')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

# Evaluate test data
model.evaluate(xTest, yTest)

# 32 hidden units

model = models.Sequential()
model.add(layers.Dense(32, activation = 'tanh', input_shape = (10000, )))
model.add(layers.Dense(32, activation = 'tanh'))
model.add(layers.Dense(1, activation = 'sigmoid'))

model.compile(optimizer = 'rmsprop',
              loss = 'binary_crossentropy',
              metrics = ['accuracy'])

history = model.fit(partialXtrain,
                    partialYtrain,
                    epochs=4,
                    batch_size=512,
                    validation_data=(xVal, yVal))

historyDict = history.history

# Plotting the training and validation accuracy
plt.clf()
accValues = historyDict['acc']
valAccValues = historyDict['val_acc']

epochs = range(1, len(accValues) + 1)

plt.plot(epochs, accValues, 'bo', label = 'Training Accuracy')
plt.plot(epochs, valAccValues, 'b', label = 'Validation Accuracy')
plt.title('Training and Validation Accuracy with MSE')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

# Evaluate test data
model.evaluate(xTest, yTest)


8 hidden units produced a slight increase in accuracy, 32 a slight drop.
### Note that none of these changes in accuracy tell us any base truths about these changes. They all depend on the data

## 3.5 Classifying newsires: a multiclass classification example
"In this section, you’ll build a network to classify Reuters newswires into 46 mutually exclusive topics. Because you have
many classes, this problem is an instance of multiclass classification; and because each data point should be classified
into only one category, the problem is more specifically an instance of single-label, multiclass classification.
If each data point could belong to multiple categories (in this case, topics), you’d be facing a multilabel, multiclass
classification problem."
#### *Single-label, multiclass classification*: each data point gets thrown into a single category, of which there are many
#### *Multilabel, multiclass classification* : each data point can belong to multiple categories

### 3.5.1 The Reuters Dataset
The Reuters datset contains short newswires and their topics, of which there are 46. Each topic has at least 10 examples in
the training set.

In [0]:
# Load the dataset
from keras.datasets import reuters

# Create train and test data
(trainData, trainLabels), (testData, testLabels) = reuters.load_data(
    num_words = 10000) #restricts the data to the 10K most frequently used words


In [0]:
# Check the number of examples in the train and test sets
print(len(trainData))
len(testData)


As with our earlier example, each example in the training set is a list of integers
that map back to an index of words.
The label associated with each example is an integer between 0-45 that maps back to
an index of topics.
### 3.5.2 Preparing the Data

In [0]:
# Now, we'll vectorize the data using the same code we used in the last exercise

import numpy as np

def vectorizeSequences(sequences, dimension = 10000):
    results =np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1
    return results

xTrain = vectorizeSequences(trainData)
xTest = vectorizeSequences(testData)


We have a couple options for vectorizing the labels: cast the list as an integer tensor,
or use one-hot encoding (which is discussed further in chapter 6)
In this case, one-hot encoding is implemented the same way that the vectorization was above:

In [0]:
def oneHot(labels, dimension = 46):
    results = np.zeros((len(labels), dimension))
    for i, label in enumerate(labels):
        results[i, label] = 1
    return results

oneHotTrainLabels = oneHot(trainLabels)
oneHotTestLabels = oneHot(testLabels)


In [0]:
#Keras can do this for us sing to_categorical

from keras.utils.np_utils import to_categorical

oneHotTrainLabels = to_categorical(trainLabels)
oneHotTestLabels = to_categorical(testLabels)


### 3.5.3 Building Your Network
While this problem is similar to our movie review classifications, the dimensionality
is much higher-- we've gone from two classification groups to 46.
<br/>
A 16-dimensional space (hidden units) likely won't work here as it did in the last problem.
As information passes through stacks of Dense layers, the layer might drop some of that
information. When it does, it can't be recovered by deeper layers. This can create an
"information bottleneck", where relevant information for the output is permanently dropped.
To avoid that, we'll increase the dimensionality of our hidden layers by increasing the
hidden units to 64.

In [0]:
# Model Definition

from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(64, activation = 'relu', input_shape = (10000, )))
model.add(layers.Dense(64, activation = 'relu'))
model.add(layers.Dense(46, activation = 'softmax'))



#### Some notes above the above architecture:
* The output layer has dimensionality of 46 to match the topic list
* The *softmax* activation in the final layer returns a probability distribution across the 46 classes.
* The softmax probability distribution will sum to 1 and give the likelihood that the inout belongs to each class.
The best loss function in this case is *categorical_crossentropy*. This measures tje distance between two probability
distributions-- by minimizing this, we train the network to get as close to the true labels as possible

In [0]:
# Compile the model

model.compile(optimizer = 'rmsprop',
              loss = 'categorical_crossentropy',
              metrics = ['accuracy'])


### 3.5.4 Validating Your Approach
We;ll create a validation set and train the model for 20 epochs:

In [0]:
#Create the validation data
xVal = xTrain[:1000]
partialXtrain = xTrain[1000:]

yVal = oneHotTrainLabels[:1000]
partialYtrain = oneHotTrainLabels[1000:]

# Train the model

history = model.fit(partialXtrain,
                    partialYtrain,
                    epochs=20,
                    batch_size=512,
                    validation_data=(xVal, yVal))


In [0]:
# Display the loss and accuracy curves for the model

import matplotlib.pyplot as plt

loss = history.history['loss']
valLoss = history.history['val_loss']

epochs = range(1, len(loss) + 1)

plt.plot(epochs, loss, 'bo', label = "Training Loss")
plt.plot(epochs, valLoss, 'b', label = "Validation Loss")
plt.title('Training and Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()


In [0]:
#The Accuracy Curve:

plt.clf()

acc = history.history['acc']
valAcc = history.history['val_acc']

plt.plot(epochs, acc, 'bo', label = 'Training Accuracy')
plt.plot(epochs, valAcc, 'b', label = "Validation Accuracy")
plt.title("Training and Validation Accuracy")
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()


At 9 epochs, the model begins to overfit (there is a slight drop in accuracy at that
point before it increases throughout the remaining iterations)
<br/><br/>
We'll rebuild the model from scratch using only 9 epochs and evaluate it against our test set

In [0]:
# Retrain the model from scratch

model = models.Sequential()
model.add(layers.Dense(64, activation = 'relu', input_shape = (10000, )))
model.add(layers.Dense(64, activation = 'relu'))
model.add(layers.Dense(46, activation = 'softmax'))

model.compile(optimizer = 'rmsprop',
              loss = 'categorical_crossentropy',
              metrics = ['accuracy'])

model.fit(partialXtrain,
          partialYtrain,
          epochs=9,
          batch_size=512,
          validation_data=(xVal, yVal))

results = model.evaluate(xTest, oneHotTestLabels)


The model results of 77% accuracy far outperform the random baseline of ~19%
### 3.5.5 Generating predictions on new data

In [0]:
#Generate prediction on the test data
predictions = model.predict(xTest)

# Each entry should be a vector with the same lenght as the number of topics (46)
predictions[0].shape

# and the coefficients of each of those vectors should sum to one
np.sum(predictions[0])

#Whatever class in each vector has the highest value is the predicted class
np.argmax(predictions[0])


### 3.5.6 A different way to handle labels and loss
We could have cast the labels as an integer tensor instead of one hot encoding them. To do
this, you just call yTrain = np.array(trainLabels).
<br/><br/>
Not much would change by doing this, except we wouldn't be able to use categorical_crossentropy
for our loss function. That method requires labels to follow categorical encoding.
<br/><br/>
Instead, we would use *sparse_categorical_crossentropy* which is the same loss function, it just
interacs with the data differently.
### 3.5.7 The importance of having sufficiently large intermediate layers
What would happen if we had layers with dimensionality smaller than our final output? As mentioned
earlier, it would create an "information bottleneck". Let's see what that would do to our
model's performance:

In [0]:
# A model with an information bottleneck

model = models.Sequential()
model.add(layers.Dense(64, activation = 'relu', input_shape = (10000, )))
model.add(layers.Dense(4, activation = 'relu'))
model.add(layers.Dense(46, activation = 'softmax'))

model.compile(optimizer = 'rmsprop',
              loss = 'categorical_crossentropy',
              metrics = ['accuracy'])

model.fit(partialXtrain,
          partialYtrain,
          epochs=20,
          batch_size=512,
          validation_data=(xVal, yVal))


We get a 70.2% accuracy on the validation data, a nearly 10% absolute drop from our initial model.
### 3.5.8 Further Experiments
* Try using larger or smaller layers
* Try a single or three hidden layers
## Key Takeaways from This Example
* When you have N classes to categorize into, your ouput layer should be a Dense layer of size N
* If the problem is single-layer, multiclass, then the output layer should use *softmax* activation
* Categorical crossentropy is nearly always the correst loss function for this type of problem
* Labels can either be case as integers (loss function becomes sparse categorical crossentropy) or one hot encoded
* Avoid creating information bottlenecks: make sure the hidden layers are big enough so that info isn't dropped
## 3.6 Prediction house prices: a regression example
### 3.6.1 The Boston Housing Prices dataset
We'll attempt to predict the median price of homes in Boston suburbs in the mid-70s. Unlike the previous
examples, this dataset has relatively few data points: 506, split between 404 for training and 102 for testing.
<br/><br/>
Also, each feature in the dataset (e.g. crime rate) has a different scale.

In [0]:
# Load the data
from keras.datasets import boston_housing

#Create train and test data
(trainData, trainTargets), (testData, testTargets) = boston_housing.load_data()


### 3.6.2 Preparing the Data
Neural networks want data that is homogeneous-- not in different scales. As such, we'll perform some
*feature-wise normalization*: for each feature, we'll subtract the mean of the feature and divide by
the standard deviation. This way the feature is centered around zero and has a *unit standard deviation*.

In [0]:
# Using numpy to normalize the data
import numpy as np

mean = trainData.mean(axis = 0)
trainData -= mean
std = trainData.std(axis = 0)
trainData /= std

testData -= mean
testData /= std



#### *Note: The training mean and std was used on the test data. Never compute **anything** on the test data!*
<br/><br/>
### 3.6.3 Building your network
Since we have so few examples in our data, we'll create a small network: 2 hidden layers, 64 hidden units each.
Overfitting is always an issue with small training sets, so we can try and mitigate that a bit by
training a smal network

In [0]:
from keras import models
from keras import layers

def buildModel(): #defining the model as a function that we can call
    model = models.Sequential()
    model.add(layers.Dense(64, activation = 'relu', input_shape = (trainData.shape[1], )))
    model.add(layers.Dense(64, activation = 'relu'))
    model.add(layers.Dense(64, activation = 'relu'))
    model.add(layers.Dense(1))
    
    model.compile(optimizer = 'rmsprop',
                  loss = 'mse',
                  metrics = ['mae'])
    return model


### Some notes on the above model specification:
* The output layer has no activation because we are predicting linear values. Activation functions would limit its range.
* We're using mean squared error as our loss function.
* Our metric is *mean absolute error*. We want to minimize the absolute value between the prediction and targets
### 3.6.4 Validating your approach using K-fold cross-validation
Since we have so few data points, simply splitting into a train and validation set can cause problems. Namely, our
validation scores might not be a good predictor of performance on our test data. There can be a high variance between
validation scores depending on which part of the small dataset gets set aside.
<br/><br/>
To deal with this, we'll use K-fold cross-validation. We;ll split the available data in K partitions,
and run the model on each K-1. Then, we average the scores.

In [0]:
# Running the model with K-fold validation

k = 4
numValSamples = len(trainData) // k #Determining our fold size
numEpochs = 100

allScores = [] #We'll place our score after each iteration of the k-fold here

for i in range(k): #for each of our k-fold iterations
    print('processing fold #', i)
    valData = trainData[i * numValSamples: (i + 1) * numValSamples] #creating our data partition
    valTargets = trainTargets[i * numValSamples : (i + 1) * numValSamples]

    partialTrainData = np.concatenate(
        [trainData[:i * numValSamples],
        trainData[(i + 1) * numValSamples:]],
        axis = 0)
    partialTrainTargets = np.concatenate(
        [trainTargets[:i * numValSamples],
        trainTargets[(i + 1) * numValSamples:]],
        axis = 0)
    
    model = buildModel()
    model.fit(partialTrainData, partialTrainTargets,
              epochs = numEpochs, batch_size=1, verbose=0)
    valMSE, valMAE = model.evaluate(valData, valTargets, verbose=0)
    allScores.append(valMAE)





In [0]:
# Checking MAE scores and findings the mean
print(allScores)

print(np.mean(allScores))


As seen above, we're off by about $2,400 on each prediction. Since most house prices are < $50K, that's a bit wide.
<br/><br/>
As seen above, we're off on our predictions by ~$2,400...a pretty big gap considering how low prices are.
Let's modify our code slightly so that we train across 500 epochs instead of 100, and track performance of each epoch
by saving the per-epoch validation score log:

In [0]:
# Saving the validation logs at each fold
numEpochs = 500
allMAEhistories = []

for i in range(k):
    print('processing fold #', i)
    valData = trainData[i * numValSamples: (i + 1) * numValSamples]
    valTargets = trainTargets[i * numValSamples: (i + 1) * numValSamples]
    partialTrainData = np.concatenate(
        [trainData[:i * numValSamples],
        trainData[(i + 1) * numValSamples:]],
        axis = 0)
    partialTrainTargets = np.concatenate(
        [trainTargets[:i * numValSamples],
        trainTargets[(i + 1) * numValSamples:]],
        axis = 0)
    
    model = buildModel()
    history = model.fit(partialTrainData, partialTrainTargets,
                        validation_data=(valData, valTargets),
                        epochs=numEpochs, batch_size=1, verbose=0)
    maeHistory = history.history['val_mean_absolute_error']
    allMAEhistories.append(maeHistory)


In [0]:
# Now, we'll compute the history of successive mean K-fold scores

averageMaeHistory = [
    np.mean([x[i] for x in allMAEhistories]) for i in range(numEpochs)]

# and plot the results

import matplotlib.pyplot as plt

plt.plot(range(1, len(averageMaeHistory) + 1), averageMaeHistory)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()


In [0]:
# That chart was a bit tough to read, so we'll do a few things:
# * Omit the first 10 data points, to get the chart's scale a bit more readable 
# * Replace the points with an exponential moving average to smooth out the line

def smoothCurve(points, factor = 0.9):
    smoothedPoints = []
    for point in points:
        if smoothedPoints:
            previous = smoothedPoints[-1]
            smoothedPoints.append(previous * factor + point * (1 - factor))
        else:
            smoothedPoints.append(point)
    return smoothedPoints

smoothMAEhistory = smoothCurve(averageMaeHistory[10:])

plt.plot(range(1, len(smoothMAEhistory) + 1), smoothMAEhistory)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()


We can see that the model stops improving after ~80 epochs. So, we can tune our
parameters to reflect that in our final model. *Note: we'll also change our batch size to 16,
reflecting his final model in the book

In [0]:

model = buildModel()
model.fit(trainData, trainTargets,
          epochs=80, batch_size=16, verbose=0)
testMseScore, testMaeScore = model.evaluate(testData, testTargets)


We're off by ~$2,900 on average
## Key Takeaways from This Example
* Regression uses different loss functions (MSE being a popular one)
* Evaluation metrics are also different for regression. Mean absolute error (MAE) is common
* When features (columns) in the input data have different scales, each feature should be normalized in preprocessing
* This can be done by subtracting the feature mean and dividing by the feature std
* When the dataset has low observations, K-fold validation can increase evaluation reliability
* When observations are low, use a small network with only one or two hidden layers