# Deep Learning for Text

In [1]:
import sys
import os

import pandas as pd

# FOLDERS
package_path = os.path.dirname(os.getcwd())
data_path = package_path + '/data/'

# LOAD DATA
input_name = 'train.json'
input_file = data_path + input_name

df = pd.read_json(input_file)

df.head()

Unnamed: 0,cuisine,id,ingredients
0,greek,10259,"[romaine lettuce, black olives, grape tomatoes..."
1,southern_us,25693,"[plain flour, ground pepper, salt, tomatoes, g..."
2,filipino,20130,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,indian,22213,"[water, vegetable oil, wheat, salt]"
4,indian,13162,"[black pepper, shallots, cornflour, cayenne pe..."


The `ingredients` entries in the dataframe are lists and within each **list** entry has an individual ingredient of a particular recipe. To make the tokenization process easier we will need to convert each recipe into a **string** object, so let's create a new column for the dataframe called `ingredients_str`.

In [2]:
df['ingredients_str'] = [', '.join(i).strip() for i in df['ingredients']] 

df['ingredients_str'].head()

0    romaine lettuce, black olives, grape tomatoes,...
1    plain flour, ground pepper, salt, tomatoes, gr...
2    eggs, pepper, salt, mayonaise, cooking oil, gr...
3                    water, vegetable oil, wheat, salt
4    black pepper, shallots, cornflour, cayenne pep...
Name: ingredients_str, dtype: object

Now each recipe is a **string** with the ingredients separated by commas. Now it should be easier to proceed with the tokenization process.

## Using *Keras* for word level representations.

The `Keras` library has a class to deal with the tokenization of text documents. Next, we are going to use the `Tokenizer` class to create three types of **word level representations**:

* One-Hot Encoding
* Frequency Document
* TF-IDF

Note that for this problem the expression **word level representationg** does not mean that we re going to tokenize each word separately. What we are goint to do is to tokenize each ingredient inside a particular recipe, even if it has more than one word. The example bellow could be used to clarify.

**Task:** Tokenize the following recipe: *ground black pepper, cold water*

* ['ground', 'black', 'pepper', 'cold', 'water]

But for this problem we will tokenize the recipe in the following way:

* ['ground black pepper', 'cold water']

In [3]:
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


In [4]:
# creates a Tokenizer object using all the words in the vocabulary
tokenizer = Tokenizer(num_words=None, split=', ', lower=True)

# builds the word index
tokenizer.fit_on_texts(df['ingredients'])

# recover the word index
word_index = tokenizer.word_index

# turn strings into lists of integer indices
sequences = tokenizer.texts_to_sequences(df['ingredients'])

In [5]:
for key in list(word_index.keys())[:5]:
    print('word_index[%s] = %s' % (key, word_index[key]))

word_index[salt] = 1
word_index[onions] = 2
word_index[olive oil] = 3
word_index[water] = 4
word_index[garlic] = 5


As we can see above, the `word_index` is a dictionary with a integer index for each word/ingredient present in the dataset. The words/ingredients are automatically organized by the most common

In [6]:
# directly get the representations
one_hot_data = tokenizer.texts_to_matrix(df['ingredients'], mode='binary')
freq_data = tokenizer.texts_to_matrix(df['ingredients'], mode='freq')
tfidf_data = tokenizer.texts_to_matrix(df['ingredients'], mode='tfidf')

In [7]:
n_samples = tfidf_data.shape[0]
n_features = tfidf_data.shape[1]

print('The training dataset with the new representation have:')
print('  - %i entries/recipes' % n_samples)
print('  - %i features/ingredients' % n_features)

The training dataset with the new representation have:
  - 39774 entries/recipes
  - 6715 features/ingredients


## Create Cross Validation Partitions

As we know from the EDA notebook of this dataset, the classes are unbalanced. So, we are going to use the class `StratifiedKFold` from the **Sklearn** library to help us to create folds which are made by preserving the percentage of samples for each class.

In [8]:
# construct the target vector
import numpy as np
from sklearn.preprocessing import LabelBinarizer, LabelEncoder

# categorical target (one-hot encoded)
lb = LabelBinarizer()
target_cat = lb.fit_transform(df['cuisine']) 

# integer target, used in the StratifiedKfold class 
# in order to make each fold with balanced classes
le = LabelEncoder()
target = le.fit_transform(df['cuisine']) 

n_classes = len(np.unique(target))
print('The dataset has %i unique classes.' % n_classes)

The dataset has 20 unique classes.


In [9]:
from sklearn.model_selection import StratifiedKFold

n_splits = 5
seed = 2018
folds = list(StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed).split(tfidf_data, target))

for i, fold in enumerate(folds):
    percentage_trn = (len(fold[0]) / n_samples) * 100
    percentage_val = (len(fold[1]) / n_samples) * 100
    print('Fold #%i has %1.0f%% events for training and %1.0f%% for validation.' % (i+1, percentage_trn, percentage_val))

Fold #1 has 80% events for training and 20% for validation.
Fold #2 has 80% events for training and 20% for validation.
Fold #3 has 80% events for training and 20% for validation.
Fold #4 has 80% events for training and 20% for validation.
Fold #5 has 80% events for training and 20% for validation.


We can see that the amount of events in training and validation sets atre the same for every fold.

## Create a Model

Now it's time to create a Deep Neural Network Model to train our dataset.

In [10]:
from keras.models import Sequential
from keras.layers import Flatten, Dense, Dropout, BatchNormalization
from keras.optimizers import Adam
import keras.backend as K

In [14]:
def load_model():
    """
    Function to create the Neural Network Model
    """
    K.clear_session()
    
    # creating the Deep Neural Net Model
    model = Sequential()

    # layer 1
    model.add(Dense(units=512, 
                    activation='relu', 
                    input_shape=(tfidf_data.shape[1], )))
    model.add(BatchNormalization())
    model.add(Dropout(0.5))

    # layer 2
    model.add(Dense(units=128, 
                    activation='relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.5))

    # output layer
    model.add(Dense(units=n_classes,
                    activation='softmax'))

    # compile model
    model.compile(loss='categorical_crossentropy', 
                  optimizer=Adam(lr=0.005), 
                  metrics=['acc'])
    
    return model

In [16]:
from keras.callbacks import EarlyStopping

# add callbacks to the model
early_stop = EarlyStopping(monitor='val_loss', patience=3)

callbacks = [early_stop]

In [None]:
cv_scores = []
cv_hist = []

# train the model
for fold, (trn_idx, val_idx) in enumerate(folds):
    print('>> Fold %i# <<' % int(fold+1))
    
    # get training and validation data folds
    X_trn = tfidf_data[trn_idx, :]
    y_trn = target_cat[trn_idx, :]
    X_val = tfidf_data[val_idx, :]
    y_val = target_cat[val_idx, :]
    
    print('  Training on %i examples.' % X_trn.shape[0])
    print('  Validating on %i examples.' % X_val.shape[0])
    
    model = load_model()
    
    hist = model.fit(X_trn, y_trn, 
                     validation_data=(X_val, y_val), 
                     batch_size=64, 
                     epochs=100, 
                     callbacks=callbacks,
                     verbose=0)
    
    
    scores = model.evaluate(X_val, y_val)
    print('  This model has %1.2f validation accuraccy.\n' % scores[1])
    
    cv_scores.append(scores)
    cv_hist.append(hist)

>> Fold 1# <<
  Training on 31812 examples.
  Validating on 7962 examples.
  This model has 0.77 validation accuraccy.

>> Fold 2# <<
  Training on 31816 examples.
  Validating on 7958 examples.
