<img src="imgs/keras-logo-small.jpg" width="20%" />

## Keras: The Python Deep Learning library

## Agenda

1. Reading in the Kaggle data, prepare BoW features as inputs and one-hot vector as labels
2. Using Keras for Neural Network
3. Add more layers in Keras

* Keras is a minimalist, highly modular neural networks library, written in Python and capable of running on top of either TensorFlow, CNTK or Theano. 

* It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.
ref: https://keras.io/

* Tensorflow is a deep learning lib for numerical computation and machine intelligence. As a open source resource, data flow graphs are adopted for numerical computation. Mathematical operations are represented by nodes and tensors are represented by graph edges. It is sometimes extremely technical.

* In contrast, Keras makes deep neural network coding simple. It also runs seamlessly on CPU and GPU machines.

## Part 1: Reading in the Kaggle data, prepare BoW features

- Our goal is to predict the **cuisine** of a recipe, given its **ingredients**.
- **Feature engineering** is the process through which you create features that don't natively exist in the dataset.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# define a function that accepts a DataFrame and adds new features
def make_features(df):
    # string representation of the ingredient list
    df['ingredients_str'] = df.ingredients.astype(str)
    return df

In [4]:
# create the same features in the training data and the new data
train = make_features(pd.read_json('C:/Users/Administrator/Desktop/MSBA/BT5153 TOPICS IN BUSINESS ANALYTICS/BT5153_data/week6_train.json'))
new = make_features(pd.read_json('C:/Users/Administrator/Desktop/MSBA/BT5153 TOPICS IN BUSINESS ANALYTICS/BT5153_data/week6_test.json'))

In [5]:
# replace the regex pattern that is used for tokenization
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(token_pattern=r"'([a-z ]+)'")

In [6]:
# define X and y
X = train.ingredients_str
y = train.cuisine

In [7]:
# X is just a Series of strings
X.head()

0    ['romaine lettuce', 'black olives', 'grape tom...
1    ['plain flour', 'ground pepper', 'salt', 'toma...
2    ['eggs', 'pepper', 'salt', 'mayonaise', 'cooki...
3          ['water', 'vegetable oil', 'wheat', 'salt']
4    ['black pepper', 'shallots', 'cornflour', 'cay...
Name: ingredients_str, dtype: object

In [8]:
X_dtm = vect.fit_transform(X)

In [9]:
# to avoid the following warning: `DeprecationWarning: The truth value of an empty array is ambiguous`,
# update your sklearn version above 0.20.0
from sklearn.preprocessing import LabelEncoder  
from keras.utils import to_categorical

Using TensorFlow backend.


#### One-hot Encoding

   1. LabelEncoder: encode labels with a value between 0 and n_classes-1 where n is the number of distinct labels. If a label repeats it assigns the same value to as assigned earlier. 

`For categorical variables where no such ordinal relationship exists, the integer encoding learned from LabelEncoder is not enough.`
   2. OnehotEncoder: convert the distinct interger into vector which is filled with 0 and 1.
   
`In neural network for classification, the last layer is always softmax layer, which require one-hot vector`


In [9]:
#interger encoding is not good enough, cause we give some of the attributes more weight
# one-hot encoding, y is a list of string
le = LabelEncoder()
le.fit(y)

LabelEncoder()

In [10]:
le.classes_

array(['brazilian', 'british', 'cajun_creole', 'chinese', 'filipino',
       'french', 'greek', 'indian', 'irish', 'italian', 'jamaican',
       'japanese', 'korean', 'mexican', 'moroccan', 'russian',
       'southern_us', 'spanish', 'thai', 'vietnamese'], dtype=object)

In [11]:
# from string to its unique number
def encode(y_str):
    # from string to its unique number
    y_numeric = le.transform(y_str)
    # from unique number to one-hot vector
    y_onehot = to_categorical(y_numeric)
    return y_onehot

In [12]:
y_onehot = encode(y)

In [13]:
# inverse transform
def decode(vec_onehot):
    idx = np.argmax(vec_onehot)
    return le.inverse_transform([idx])[0]

In [14]:
print(y_onehot[10])
print(y[10])

[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
italian


---

## Part 2: Standard Deep Learning Pipeline in Keras

1. Model Construction (two ways)
    * Sequential 
    * Funcation API
2. Model Compiling
    * loss function
    * optimization functions
3. Model Training
4. Model Predictions
5. Model save and re-load

In [15]:
from keras.models import Sequential
from keras.layers import Dense, Activation

In [16]:
print(type(X_dtm))
X_train = X_dtm.todense()
print(type(X_train))

<class 'scipy.sparse.csr.csr_matrix'>
<class 'numpy.matrixlib.defmatrix.matrix'>


In [17]:
print('sparse vector format:')
print(X_dtm[0])
print('dense vector format:')
print(X_train[0])

sparse vector format:
  (0, 2033)	1
  (0, 2361)	1
  (0, 4943)	1
  (0, 4450)	1
  (0, 4100)	1
  (0, 2367)	1
  (0, 2511)	1
  (0, 464)	1
  (0, 4755)	1
dense vector format:
[[0 0 0 ... 0 0 0]]


In [18]:
dims = X_train.shape[1]
print(dims, 'dims')
print("Building model...")

Y_train = y_onehot
nb_classes = Y_train.shape[1]
print(nb_classes, 'classes')

6250 dims
Building model...
20 classes


#### Build Model in the Sequential Mode

In [19]:
model = Sequential()
# add layers to model
model.add(Dense(400, input_shape=(dims,), activation='relu')) #only first layers 
model.add(Dense(nb_classes, activation='softmax'))

Instructions for updating:
Colocations handled automatically by placer.


#### Compile the model
    1. loss and optimizer are two reuqired arguments for compiling a keras model
    2. different optimizer may result in various performances, try 'sgd' and check the performances

>**Activation** Supported : [https://keras.io/activations/] 
Advanced: [https://keras.io/layers/advanced-activations/]

>**Optimizer**
If you need to, you can further configure your optimizer. A core principle of Keras is to make things reasonably simple, while allowing the user to be fully in control when they need to (the ultimate control being the easy extensibility of the source code).
Here we used <b>adam</b> (Adaptive Moment Estimation) as an optimization algorithm for our trainable weights.  

<img src="http://sebastianruder.com/content/images/2016/09/saddle_point_evaluation_optimizers.gif" width="40%">

Source & Reference: http://sebastianruder.com/content/images/2016/09/saddle_point_evaluation_optimizers.gif

In [20]:
# Selection of optimization is quite important, you may try 'sgd' to replace 'adam' and compare the performances 
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

For neural network fit function, several key terms are batch_size, epoch and iterations:

1. one epoch: one forward pass and one backward pass of all the training examples
2. batch size: the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.
3. number of iterations: number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).


In [21]:
model.fit(X_train, Y_train, batch_size=1024, epochs=5)
#the model will train very fast

Instructions for updating:
Use tf.cast instead.
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x124a0c8d0>

#### Model Prediction

In [22]:
X_strtest = new.ingredients_str
X_dtmtest = vect.transform(X_strtest)
X_test    = X_dtmtest.todense()

In [23]:
# check the predicted results
Y_testpred = model.predict(X_test)

print(Y_testpred[10])
print(Y_train[10])
print(decode(Y_testpred[10]))

[9.36937577e-05 1.25375800e-05 4.15923190e-04 2.24857722e-05
 2.43239101e-05 1.69380556e-03 4.42192395e-04 1.47851842e-05
 1.63494533e-05 9.94349658e-01 8.11868267e-06 1.00494626e-05
 1.80442366e-05 1.01241772e-03 3.20408508e-05 1.41383325e-05
 1.40296924e-03 2.85323535e-04 1.17317082e-04 1.39473213e-05]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
italian


In [24]:
# another way to check the probability
Y_testlabel = model.predict_classes(X_test)

# check the predicted results
print(Y_testlabel[10])
print(le.inverse_transform([Y_testlabel[10]])[0]) 

9
italian


In [25]:
new_pred_label = [decode(vec) for vec in Y_testpred]
# create a submission file
pd.DataFrame({'id':new.id, 'cuisine':new_pred_label}).set_index('id').to_csv('sub5.csv')

#### Model save and re-load

>The model has a `save` method, which saves all the details necessary to reconstitue the model. You can check the following example:

https://keras.io/getting-started/faq/#how-can-i-save-a-keras-model

In [26]:
from keras.models import load_model

model.save('my_model.h5')  # creates a HDF5 file 'my_model.h5'
del model  # deletes the existing model

# returns a compiled model
# identical to the previous one
model = load_model('my_model.h5')
model.predict_classes(X_test[0])

array([1])

## Part 3: How to add more layers in Keras

Simplicity is pretty impressive right? :) Keras is based on object-oriented design principles. Its author `François Chollet` said
<pre>Another important decision was to use an <b>object-oriented design</b>. Deep learning models can be understood as <b>chains</b> of functions, thus making a functional approach look potentially interesting. However, these functions are heavily parameterized, mostly by their weight tensors, and manipulating these parameters in a functional way would just be impractical. So in Keras, <b>everything is an object</b>: layers, models, optimizers, etc. All parameters of a model can be accessed as object properties: e.g. `model.layers[3].output` is the output tensor of the 3rd layer in the model, `model.layers[3].weights` is the list of symbolic weight tensors of the layer, and so on.</pre>

Now lets understand:
<pre>The core data structure of Keras is a <b>model</b>, a way to organize layers. The main type of model is the <b>Sequential</b> model, a linear stack of layers. Sequential() can return a kind of container which can call the add function to add layers. </pre>


<img src="imgs/MLP.png" width="45%">

**Q:** _How hard can it be to build a Multi-Layer Fully-Connected Network with keras?_

**A:** _It is basically the same, just add more layers!_

In [27]:
model = Sequential()
# add layers to model
model.add(Dense(400, input_shape=(dims,), activation='relu')) #only first layers 
model.add(Dense(600, activation='relu')) #only first layers 
model.add(Dense(nb_classes, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_3 (Dense)              (None, 400)               2500400   
_________________________________________________________________
dense_4 (Dense)              (None, 600)               240600    
_________________________________________________________________
dense_5 (Dense)              (None, 20)                12020     
Total params: 2,753,020
Trainable params: 2,753,020
Non-trainable params: 0
_________________________________________________________________


In [28]:
model.fit(X_train, Y_train, batch_size=1024, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1ab7527048>

---

#### Theoretical Motivations for depth

>Much has been studied about the depth of neural nets. Is has been proven mathematically[1] and empirically that convolutional neural network benifit from depth! 

[1] - On the Expressive Power of Deep Learning: A Tensor Analysis - Cohen, et al 2015

#### Theoretical Motivations for depth

One much quoted theorem about neural network states that:

>Universal approximation theorem states[1] that a feed-forward network with a single hidden layer containing a finite number of neurons (i.e., a multilayer perceptron), can approximate continuous functions on compact subsets of $\mathbb{R}^n$, under mild assumptions on the activation function. The theorem thus states that simple neural networks can represent a wide variety of interesting functions when given appropriate parameters; however, it does not touch upon the algorithmic learnability of those parameters.

[1] - Approximation Capabilities of Multilayer Feedforward Networks - Kurt Hornik 1991