# A beginner's guide to Keras Functional API

I am sure a lot of you agree that Tensorflow Keras is rich with Sequential model API. Sequential models are great but recently I have been playing with the Keras Functional APIs and found that it is even better.  

***The Keras Functional API*** is a way to create models that are more flexible than the Sequential API. The Functional API can handle models with non-linear topology, shared layers, and even multiple inputs or outputs.

A Sequential model is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor. But what if your model has multiple inputs? And it is also possible that you might need to design different models for diverse types of inputs. It turns out that sequence models are not a smart choice in such cases.

For example, a Wine Reviews dataset from Kaggle. The original problem posted on kaggle.com was to build a model to identify the variety of wine from the parameters like description, price, region, and country of origin.

You can download the wine review dataset from here.

As you can see from this dataset, we have multiple inputs for this data. Also, the relationship between the input and output is not always clear. For example, the relationship between the parameters *country of origin* and the *variety* of wine is quite simple but the relationship between the parameters *description* and the *variety* of wine is not that simple. 

You may need to build a separate model to use the relationship between parameters *description* and *variety* of wine and a separate model for the relationship between parameters *country of origin* and *variety* of wine.

But what if you could write a model that combines both models? The Keras Functional API helps us achieve that goal.
If we want to work with the above problem, it seems that best results can be obtain by using ***Wide and Deep Networks***. There are tons of resources available on this subject and the best one is [here](https://ai.googleblog.com/2016/06/wide-deep-learning-better-together-with.html).

## Wide and Deep Learning Networks

If you have read the [link](https://ai.googleblog.com/2016/06/wide-deep-learning-better-together-with.html) I posted above, you now know how *Wide and Deep Learning* works. But why it is the better choice for this solution?

It turns out that it's useful for generic large-scale regression and classification problems with sparse inputs (categorical features with a large number of possible feature values). Don't worry if you don't understand this long sentence in one go. We will see this in action with the code. What I am trying to say is that in a classification problem, when we have a large number of possible classes - in this case, varieties of wine (there are 619 varieties of wine in our dataset) and we are doing large-scale regression (you will see in a while why it is the case), *Wide & Deep Learning* is the weapon of choice.

## Keras Functional API comes to rescue

Now that we agree that the best way to approach this problem is by using two models with two different types of inputs. But the Keras Sequential models cannot give you that flexibility. That's where Keras Functional API comes to the rescue.

Let's see that in action. Let's rephrase our problem statement to simplify.

**Let's say we want to predict the price of the wine based on the features - *description* and the *variety*.**

So our inputs are 
 - **Description:**   
 This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. Balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. Enjoy 2022–2030.
 - **Variety:**   
 Cabernet Sauvignon
 
and our predction should be 
 - **Price:**   
  235.0
     

I am going to walk you through the code. You can find complete code on my github repository. I will be posting the link in the comment section.

First, let's start by importing necessary libraries.

In [1]:
#import necessory libraries

import numpy as np
import pandas as pd
import tensorflow as tf

from sklearn.preprocessing import LabelEncoder
from tensorflow import keras
layers = keras.layers

### Obtain data and preprocess the inputs

I have already downloaded the dataset from [here](https://www.kaggle.com/zynicide/wine-reviews?select=winemag-data_first150k.csv) and stored in my local file system.

In [2]:
data = pd.read_csv("./datasets/winemag-data_first150k.csv")
print(data.shape)

(150930, 11)


As you can see, we have close to 150k records and 11 columns. Hence, there  are 11 features. We may not be interested in all features. Let's look at the data.

In [3]:
data.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


We are interested in only a few features here. We want a description, variety, and price. As this is a supervised learning algorithm, we will use price as the output label for our training and test dataset.

I also noticed we have some null values. Also if you look at the Kaggle dataset description, you will find there are some invalid values in the data. We can remove the records that have null values.

### Data Preprocessing ###

In [4]:
data = data[pd.notnull(data['price'])] #obviously, we want only records that have price mentioned.
data = data[pd.notnull(data['variety'])]

# we don't need to find null values for description column. There is no record that does not have description. I can see that on kaggle dataset page.
#let's print the shape of the data

data.shape

(137235, 11)

As you can see we removed quite a few records from our dataset. We removed close to 14k rows that are inconsistent with the inputs we need.


### Split the data in train and test sets ###

Let's split the data in the train and the test set. I usually go for 75:25 for train:test split. You can modify this hyperparameter. Try to put different values of train:test split and see  if it improves the model

In [5]:
train_size = int(len(data) * .8)

#It's always a good idea to shuffle the data before the split. You can also improvise this code by using best practices for the split

data = data.sample(frac=1)

# Train features and labels
description_train = data['description'][:train_size]
variety_train = data['variety'][:train_size]
labels_train = data['price'][:train_size]

# Test features and labels
description_test = data['description'][train_size:]
variety_test = data['variety'][train_size:]
labels_test = data['price'][train_size:]

print("Train features and labels")
print("description_train", description_train.shape)
print("variety_train", variety_train.shape)
print("labels_train", labels_train.shape)

print("Train features and labels", description_test.shape)
print("description_test", description_test.shape)
print("variety_test", variety_test.shape)
print("labels_test", description_test.shape)

Train features and labels
description_train (109788,)
variety_train (109788,)
labels_train (109788,)
Train features and labels (27447,)
description_test (27447,)
variety_test (27447,)
labels_test (27447,)


   
***Great*** , we now have 109788 records for training and 27447 records for test

### Preparing input features for our model

In [6]:
#Let's check sample of our first feature - description

print("description:\n", description_train[0])
print("\nvariety:\n", variety_train[0])
print("\nprice:\n", labels_train[0])

description:
 This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. Balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. Enjoy 2022–2030.

variety:
 Cabernet Sauvignon

price:
 235.0


### Feature 1: Variety

As you can see, the variety is not a complicated feature. It's just a class. One of the best practices is to use a Keras utility to convert each of these varieties to integer representation and then one-hot vectors for each input to indicate the variety. Remember our earlier discussion on why *Wide & Deep network* is useful?  Our vector to represent one-hot encoding for a variety for each record is close to 619 elements long. Also, it is sparse. 

619 elements long because there are close to 619 classes in varieties in our dataset. Also, the data is *sparse* because for each record, except one element corresponding to that variety, the rest of the elements are 0s. And as you might have read from the link on *Wide & Deep Networks*,  it is useful for classification problems with sparse inputs. So this is the best choice we have so far.

So let's go ahead and prepare one-hot-vector for the classes.

In [7]:
encoder = LabelEncoder()
encoder.fit(variety_train)
encoder.fit(data['variety'])
variety_train_enc = encoder.transform(variety_train)
variety_test_enc = encoder.transform(variety_test)
num_classes = np.max(variety_train_enc) + 1

# Convert labels to one hot vectors
variety_train_onehot = keras.utils.to_categorical(variety_train_enc, num_classes)
variety_test_onehot = keras.utils.to_categorical(variety_test_enc, num_classes)

### Feature 2: Wine description

For complicated features like the *description* - which is a collection of text, the best model to use is Bag-Of-Words model - more information on that [here](https://en.wikipedia.org/wiki/Bag-of-words_model).

A quick recap: a *bag of words* model looks for the presence of words in each input to our model. You can think of each input as a bag of Scrabble tiles, where each tile contains a word instead of a letter. The model doesn’t take into account the order of words in a description, just the presence or absence of a word.

We will start by preparing a vocabulary or bag of words from our dataset. We will limit the size of the vocabulary to 15,000 words. This parameter is ***tunable***. I chose 15000 words because I think the description for a single item (wine) will use more or less the same set of words within the 15000 different words. so I am assuming a dictionary or *Bag of Words* 15000 words should be sufficiently long. We will take all descriptions from all records, split them into words, and store them into our *Bag of Words*. Keras has pretty cool libraries to do this easily.

We can take this representation of feature *description* in our (Wide Model* because this input to our model for each description will be a 15k element wide vector with 1s and 0s indicating the presence of words from our vocabulary in a particular description. 

In [8]:
voc_size = 15000
tokenizer = keras.preprocessing.text.Tokenizer(num_words=voc_size, char_level=False)
tokenizer.fit_on_texts(description_train)

#convert each description to a bag of words vector

desc_bow_train = tokenizer.texts_to_matrix(description_train)
desc_bow_test = tokenizer.texts_to_matrix(description_test)

In [9]:
#Let's check the sample

print(description_train[0])
print(desc_bow_train[0])

This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. Balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. Enjoy 2022–2030.
[0. 1. 1. ... 0. 0. 0.]


As you can see above, the first record has a description in words, and `desc_bow_train` has it converted into vectors of 0s and 1s. The vector has element 1 where that word is part of the *Bag of Words* and 0 where that word is not part of *Bag of Words*. 

As you can see, once you convert words into 0s and 1s, the sentence loses all grammatical context. **We may not be able to get good results if we use this only with Wide Model**. That's why we need Deep Model as well. As I said above, "Description" is a complex feature. So while I am using the simpler version of it with *Wide Network*, I want to use *Deep Network* to figure out the complex relationship between the *Description* and *Output*, without losing the context. 

The best way to use the feature *Description* for Deep Network is to use ***Word Embedding***. You can learn more about word embedding [here](https://en.wikipedia.org/wiki/Word_embedding) but the short version is that they provide a way to map a word to vectors so that similar words are closer together in vector space.

We will start by converting each description to a vector of integers corresponding to each word in our bag of words.


In [10]:
train_embed = tokenizer.texts_to_sequences(description_train)
test_embed = tokenizer.texts_to_sequences(description_test)

Now let's see what this has done. Let's print the first description in words and it's sequence form

In [11]:
print(description_train[0])
print("\n",train_embed[0])

This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. Balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. Enjoy 2022–2030.

 [19, 4, 882, 13, 452, 1, 120, 302, 2, 47, 2, 104, 24, 7, 159, 324, 1159, 14, 1240, 155, 27, 20, 545, 27, 80, 360, 1, 163, 405, 501, 1932, 23]


As you can see above, `texts_to_sequences` replaced the words with the corresponding index of that word in our *Bag of Words*.

One issue that we have with this transformation is that all descriptions are of different length. We need to make sure the embedding vector is of the same length. We will pad the shorter vectors with 0s. Let's say the maximum length of the *description* we want to allow is 200.

In [12]:
seq_length = 200
train_embed = keras.preprocessing.sequence.pad_sequences(train_embed, maxlen=seq_length)
test_embed = keras.preprocessing.sequence.pad_sequences(test_embed, maxlen=seq_length)

## Build models using Functional API ##

Now that we have completed preparing for the features, it is time to build models using Functional API. As discussed earlier, the benefit of using Functional API is that we can prepare multiple models for multiple inputs and combine both models. 
Here the *Description* and *Variety* are two separate inputs, but we will still be able to prepare the *Wide Model* and pass these inputs to the model.

### Model 1: Wide Model

We will define two input layers, one for *decription* and one for *variety* and merge them.

The first input to our *Wide Model* is the tokenized version of the feature *description*. Recall that it is the vector, each of the length of 15,000, with each element as 0 or 1.

The second input is *variety* - which is a one-hot vector of size `num_classes` as defined above, which in our case is 619

We will define these two layers and then merge them into a dense output layer to predict the price of the wine.


In [13]:
desc_inputs = layers.Input(shape=(voc_size,))
var_inputs = layers.Input(shape=(num_classes,))
merged_layer = layers.concatenate([desc_inputs, var_inputs])
merged_layer = layers.Dense(256, activation='relu')(merged_layer)
predictions = layers.Dense(1)(merged_layer)
wide_model = keras.Model(inputs=[desc_inputs, var_inputs], outputs=predictions)

# compile the model

wide_model.compile(loss='mse', optimizer='adam', metrics=['mse'])
wide_model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 15000)]      0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 619)]        0                                            
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 15619)        0           input_1[0][0]                    
                                                                 input_2[0][0]                    
__________________________________________________________________________________________________
dense (Dense)                   (None, 256)          3998720     concatenate[0][0]            

**Some explanation of the summary of Wide Model**

As we discussed, 
The first input is vectors of the length of 15,000 elements with 0s and 1s. That is the description feature. Here it is the first line with `InputLayer` with `shape (none, 15000)`

The second input is a one-hot vector of *varieties*. There are 619 varieties. So to describe a variety, we use one hot vector of length 619, with 618 elements 0 and one element as 1 at the index of that variety of the wine. 

We are not training the model yet. We want to build another *Deep Network model*, merge it with this *Wide Model* and then train that hybrid model

So with this model ready, let's build the *Deep Network Model*

### Model 2: Deep Model

Just like we did with Wide Model, here too we need to create a layer and add it to our model. 

As we are using **word embedding** for *Deep Network*, we should be using an **embedding layer**. There are many pre-train embedding layers you can use. For that, you need to download the layers and load them here.

The second option is to learn the embedding layer. I am using the second option because this is a non-production problem and we can easily train our network to learn from our *Bag of Words*.



In [14]:
# first we define our input layer
deep_inputs = layers.Input(shape=(seq_length,))

# we feed this inputs to embedding layer, this is where it will learn the embeddings
# The output of the Embedding layer will be a 3D vector with shape: [batch size, sequence length of 200), embedding dimension (8 in this case)]
embedding = layers.Embedding(voc_size, 8,   input_length=seq_length)(deep_inputs)

#Now we need to flatten it
embedding = layers.Flatten()(embedding)

#Let's build the model with these layers

_out = layers.Dense(1, activation='linear')(embedding)

deep_model = keras.Model(inputs=deep_inputs, outputs=_out)
deep_model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
print(deep_model.summary())

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         [(None, 200)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 200, 8)            120000    
_________________________________________________________________
flatten (Flatten)            (None, 1600)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 1601      
Total params: 121,601
Trainable params: 121,601
Non-trainable params: 0
_________________________________________________________________
None


**Some explaination of the summary of Deep Model.**

The first input is vectors of the length of 200 elements with the location of words in the vocabulary. That is the description feature. Here it is the first line with `InputLayer` with `shape (none, 200)`

The second input is the output from the embedded layer. As mentioned above, a 3D vector with `shape: batch = none, the sequence length of 200, and the embedding dimension of 8` in this case

in the flatten layer, we are taking `200 x 8` and flatting it, giving us a shape of `200x8=1600`

## Putting Wide and Deep Networks togather

Now it is time for us to put both networks together and train them for our data.

We will do this by creating a layer that concatenates the outputs from each model. Then, we merge them into a fully connected dense layer. And finally, define a combined model that combines the input and output from each one.



In [17]:
merged_out = layers.concatenate([wide_model.output, deep_model.output])
merged_out = layers.Dense(1)(merged_out)
combined_model = keras.Model(wide_model.input + [deep_model.input], merged_out)
print(combined_model.summary())

combined_model.compile(loss='mse', optimizer='adam', metrics=['mae'])

Model: "model_3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 15000)]      0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 619)]        0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            [(None, 200)]        0                                            
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 15619)        0           input_1[0][0]                    
                                                                 input_2[0][0]              

## Moment of truth

It’s time to run training and evaluation. It will tell us how our model is doing with previously unseen data.

In [18]:
#Training - this might take a while
combined_model.fit([desc_bow_train, variety_train_onehot] + [train_embed], labels_train, epochs=20, batch_size=128)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f51afb83100>

As we can see, with each epoch loss - MSE is going down, which is always a good sign that we are doing something right.

Next, let's evaluate our model on previously unseen test data.

In [19]:
combined_model.evaluate([desc_bow_test, variety_test_onehot] + [test_embed], labels_test, batch_size=128)



[883.3397216796875, 9.229154586791992]

Our test MSE is much higher than our train MSE. Which is not a good sign but I am going to leave it up to the audience to optimize the code.

Now let's run the prediction on all our test datasets and see what we got. 

Finally, we trained our model to take description and variety and predict the price. Let's see how it is doing. By code below, we are giving our entire test data set that it has not seen before and we will see what it predicts.

In [20]:
predictions = combined_model.predict([desc_bow_test, variety_test_onehot] + [test_embed])

In [21]:
num_predictions = 5
diff = 0

for i in range(num_predictions):
    val = predictions[i]
    print(description_test.iloc[i])
    print('Predicted: ', val[0], 'Actual: ', labels_test.iloc[i])
    diff += abs(val[0] - labels_test.iloc[i])
    print('Difference: ',abs(val[0] - labels_test.iloc[i]), '\n')
    print('Average prediction difference: ', diff / num_predictions)

This bold, spicy wine will appeal to those who love modern reds with thick concentration and lingering oak-driven flavors. Black cherry, prune and ripe blueberry do come into view, but the protagonist here is the oak. The wine has chewy tannins, good length on the finish and would pair with grilled meats or aged cheeses.
Predicted:  50.215633 Actual:  49.0
Difference:  1.2156333923339844 

Average prediction difference:  0.24312667846679686
A musky perfume note is sultry and exotic in this rich, textured Pinot Grigio, which was fermented on its skins and matured in oak. The fruity palate offers penetrating dried-pear, nut oil and apple flavors, which are layered with citrusy acids and soft, tea-leaf-like tannins that linger on the finish.
Predicted:  33.513668 Actual:  24.0
Difference:  9.513668060302734 

Average prediction difference:  2.145860290527344
Fresh and healthy for Toro Malvasia, with soft apple and sweet lime flavors. The feel and approach are good, and the wine is acidic 

   
      
         
As you can, see our model is decently accurate. Barring a few examples, it's price prediction is very close. 

For the first example, Based on the variety and description, we predicted that the price of wine should 50 and the actual price is 49. 

The average prediction difference for 5 example is appx. $4 which is not bad.

## What’s next?

I want to improvise this model by putting some optimization algorithms and also want to try with few other features like regions and country of origin.

Hope you guys enjoyed the code
