# Introduction to Keras

We will use extensively numpy, pandas, and matplotlib libraries over the lectures.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

### 1 .Recall of the previous notebook

In the previous notebook, we implemented our own logistic regression classifier based using the negative log likelihood loss function and a stochastic gradient descent optimizer

In [None]:
from sklearn.metrics import log_loss


class LogisticRegression:
    
    def __init__(self, learning_rate=0.1, max_iter=100, tol=1e-3,
                 batch_size=20):
        self.learning_rate = learning_rate
        self.max_iter = max_iter
        self.tol = tol
        self.batch_size = batch_size
    
    def _sigmoid(self, X):
        return 1 / (1 + np.exp(-X))
    
    def _decision_function(self, X):
        return np.dot(X, self.coefs_)
    
    def decision_function(self, X):
        X = self._add_intercept(X)
        return self._decision_function(X).ravel()
    
    def _grad_nll(self, X, y):
        grad = (self._predict_proba(X) - y)
        return np.dot(X.T, grad)
    
    def _add_intercept(self, X):
        return np.hstack((X, np.ones(shape=(X.shape[0], 1))))

    def fit(self, X, y):
        X = self._add_intercept(X)
        # Make y to be a column vector for later operation
        y = np.atleast_2d(y).T
        # Initialize randomly the weights
        self.coefs_ = np.random.rand(X.shape[1], 1)
        
        it = 0
        loss = np.inf
        while it < self.max_iter and loss > self.tol:
            # select a minibatch
            idx = np.random.choice(np.arange(X.shape[0]),
                                   size=self.batch_size)
            X_subset, y_subset = X[idx], y[idx]
            # compute the gradient
            dnll = self._grad_nll(X_subset, y_subset)
            # update the parameter
            self.coefs_ -= (self.learning_rate / X_subset.shape[0]) * dnll
            # update the loss and the number of iteration
            loss = log_loss(y, self._predict_proba(X))
            it += 1
        return self
    
    def _predict_proba(self, X):
        return self._sigmoid(self._decision_function(X))
    
    def predict_proba(self, X):
        X = self._add_intercept(X)
        return self._predict_proba(X)

    def predict(self, X):
        prob = self.predict_proba(X)
        prob[prob < 0.5] = 0
        prob[prob >= 0.5] = 1
        return prob.astype(int).ravel()
    
    def score(self, X, y):
        return np.mean(y == self.predict(X))

We should that it was working quite well on a very small dataset

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target

In [None]:
X = X[:, :2]

In addition, we will use only the samples corresponding to the class `0` and `1`.

In [None]:
mask_class_0_1 = np.bitwise_or(y == 0, y == 1)

In [None]:
X = X[mask_class_0_1]
y = y[mask_class_0_1]

In [None]:
clf = LogisticRegression(learning_rate=0.1)
clf.fit(X, y).score(X, y)

### 2. What is Keras?

Keras is an open source neural network library written in Python. Then, what is the relationship between our logistic regression and a neural network. Indeed, a logistic regression is equivalent to a neural network which does not have an hidden layer. Therefore, we will be able to use Keras to implement our logistic regression. In this regard, we will get use to the Keras API. 

Keras will give us all the different tools which we need to create our logistic regression

In [None]:
from keras.models import Sequential
from keras.layers.core import Dense, Activation

In [None]:
model = Sequential()
model.add(Dense(1, input_shape=(X.shape[1],)))
model.add(Activation("sigmoid"))

In [None]:
from keras import optimizers

In [None]:
model.compile(optimizer=optimizers.SGD(lr=0.1),
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [None]:
model.fit(X, y, epochs=10, batch_size=20)

In [None]:
from sklearn.metrics import accuracy_score

print('The mean accuracy is: ', 
      accuracy_score(y, model.predict_classes(X)))

We can see that Keras will allow us to define simply the architecture of a neural network and will manage the computation of the gradient to optimize the weights of the network. Now that we know the different componenent required by a supervised classifier, we can start to learn more about Keras.

### 3. Training Neural Networks with Keras

Now, we will use the `digits` dataset to train a neural networks using Keras.

In [None]:
from sklearn.datasets import load_digits

digits = load_digits()

#### 3.1 Preprocessing

Before to be used in a classifier, it is preprocess the data. In the following code, we are converting the data into 32 bits precision and standardize the data to have zero mean and a unit standard deviation.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

data = np.asarray(digits.data, dtype='float32')
target = np.asarray(digits.target, dtype='int32')

X_train, X_test, y_train, y_test = train_test_split(
    data, target, test_size=0.15, random_state=37)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print(X_train.mean(axis=0))
print(X_train.std(axis=0))

Let's display the one of the transformed sample (after feature standardization):

In [None]:
sample_index = 45
plt.figure(figsize=(3, 3))
plt.imshow(X_train[sample_index].reshape(8, 8),
           cmap=plt.cm.gray_r, interpolation='nearest')
plt.axis('off')
plt.title("transformed sample\n(standardization)");

The scaler objects makes it possible to recover the original sample:

In [None]:
plt.figure(figsize=(3, 3))
plt.imshow(scaler.inverse_transform(X_train[sample_index]).reshape(8, 8),
           cmap=plt.cm.gray_r, interpolation='nearest')
plt.title("original sample");

In [None]:
print(X_train.shape, y_train.shape)

In [None]:
print(X_test.shape, y_test.shape)

#### 3.2 Feed-forward neural network with Keras

Objectives of this section:

- Build and train a first feedforward network using `Keras`
    - https://keras.io/getting-started/sequential-model-guide/
- Experiment with different optimizers, activations, size of layers, initializations

#### 3.2.1 Keras Workflow

To build a first neural network we need to turn the target variable into a vector "one-hot-encoding" representation. Here are the labels of the first samples in the training set encoded as integers:

In [None]:
y_train[:3]

Keras provides a utility function to convert integer-encoded categorical variables as one-hot encoded values:

In [None]:
import keras
from keras.utils.np_utils import to_categorical

Y_train = to_categorical(y_train)
Y_train[:3]

We can now build an train a our first feed forward neural network using the high level API from keras:

- first we define the model by stacking layers with the right dimensions
- then we define a loss function and plug the SGD optimizer
- then we feed the model the training data for fixed number of epochs

In [None]:
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras import optimizers

N = X_train.shape[1]
H = 100
K = 10

model = Sequential()
model.add(Dense(H, input_dim=N))
model.add(Activation("tanh"))
model.add(Dense(K))
model.add(Activation("softmax"))

model.compile(optimizer=optimizers.SGD(lr=10),
              loss='categorical_crossentropy', metrics=['accuracy'])

model.fit(X_train, Y_train, epochs=15, batch_size=32);

#### 3.2.2 Exercises: Impact of the Optimizer

- Try to decrease the learning rate value by 10 or 100. What do you observe?

- Try to increase the learning rate value to make the optimization diverge.

- Configure the SGD optimizer to enable a Nesterov momentum of 0.9
  
Note that the keras API documentation is available at:

https://keras.io/

It is also possible to learn more about the parameters of a class by using the question mark: type and evaluate:

```python
optimizers.SGD?
```

in a jupyter notebook cell.

In [None]:
# %load solutions/02_01.py

- Replace the SGD optimizer by the Adam optimizer from keras and run it
  with the default parameters.

- Add another hidden layer and use the "Rectified Linear Unit" for each
  hidden layer. Can you still train the model with Adam with its default global
  learning rate?

- Bonus: try the Adadelta optimizer (no learning rate to set).

Hint: use `optimizers.<TAB>` to tab-complete the list of implemented optimizers in Keras.

In [None]:
# %load solutions/02_02.py

#### 3.2.3 Exercises: forward pass and generalization

- Compute predictions on test set using `model.predict_classes(...)`
- Compute average accuracy of the model on the test set

In [None]:
# %load solutions/02_03.py