## Overview

In this jupyter notebook, we showcase a possible product of our sub-task in building language-vision model.
- First, we create a mock data from visual model, consists of label names, x-y coordinates of the grabbing point, in the real-time scene. We also import our trained language models, its dictionary and tokenizer. 
- Then, we build a function that takes textual description, the model, and data from visual models. Through the function `max_prob`, we outputs the coordinates ranked and accompanied by their predicted probability. This is only one of the simplest way of combining the results of our language and vision model. In further research, we could further explore other possible combining methods. We send this output to the central control system where the robot arm automatically grab the piece.
- In the last part, we simulate the real-time word-by-word input (incremental feeding), which our group would actually receive from ASR system.

## Create mock coordinate data from visual model

In [1]:
mock_visual = [('blue v', {'x': 50, 'y': 63}),
 ('wooden v', {'x': 73, 'y': 56}),
 ('green w', {'x': 32, 'y': 98}),
 ('green t', {'x': 129, 'y': 69}),
 ('purple n', {'x': 58, 'y': 139}),
 ('wooden z', {'x': 40, 'y': 105}),
 ('blue z', {'x': 137, 'y': 163}),
 ('wooden w', {'x': 57, 'y': 24}),
# ('blue i', {'x': 74, 'y': 133}),
# ('yellow t', {'x': 48, 'y': 26}),
# ('wooden i', {'x': 145, 'y': 29}),
# ('yellow u', {'x': 67, 'y': 27})
              ]

# Import tokenizer and language model

In [2]:
import numpy as np
import json
from keras_preprocessing.text import tokenizer_from_json
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model

In [3]:
# import tokenizer
with open('tokenizer.json') as f:
    data = json.load(f)
    tokenizer = tokenizer_from_json(data)
    
tokenizer

<keras_preprocessing.text.Tokenizer at 0x24b4d50e0a0>

In [4]:
# get label dictionary
# Read data from file:
all_label_dict = json.load( open( "all_label_dict.json" ) )
all_label = list(all_label_dict)

In [7]:
#load model
model = load_model("blstm_aug.h5")
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 287, 100)          20000     
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 287, 100)          0         
_________________________________________________________________
bidirectional (Bidirectional (None, 256)               234496    
_________________________________________________________________
dense (Dense)                (None, 23)                5911      
Total params: 260,407
Trainable params: 260,407
Non-trainable params: 0
_________________________________________________________________
None


In [8]:
model_shape = []
for l in model.layers:
    model_shape.append(l.output_shape)
model_shape

[(None, 287, 100), (None, 287, 100), (None, 256), (None, 23)]

### Test the model

In [97]:
def text2pred(text, model):
    seq = tokenizer.texts_to_sequences(text)
    padded = pad_sequences(seq, maxlen=MAX_SEQUENCE_LENGTH)    
    return model.predict(padded)

In [98]:
# check if model is predicting
MAX_SEQUENCE_LENGTH = model_shape[1][1] # shape[1] from embedding_layer of the model
MAX_NB_WORDS = 200

# test with new data
new_text = ['we are looking for the dark wooden piece that shape like l']
new_text = ['yellow u']
new_text = ['Wshaped green Center Above']
pred = text2pred(new_text,model)

print(np.argmax(pred))
print(all_label[np.argmax(pred)]) #the list of all label here MUST match the list of label that the model is trained on as well

5
green w


# Predicting coordinates

We input textual description, model, and visual coordinates to the function `max_prob`. The function outputs probabilities of each label, then filters those only detected in the visual scene. Then, it returns pairs of `n` most probable coordinates of the description.

In [102]:
# with neural model
def max_prob(text, model, visual_dict, n):
    
    visual_list = [i for (i,j) in visual_dict]

    #get label idx from the list
    idx_list = [all_label_dict.get(k) for k in all_label if k in visual_list]
    
    # make prediction
    pred = text2pred(text,model)

    # map proba with labels
    pred_list = list(pred.reshape(-1))
    proba_dict = {i:pred_list[idx] for idx,i in enumerate(all_label)}
    
    # get only prediction of the visible piece
    pick_from_result = {d:proba_dict.get(d) for d in visual_list}
    max_prob = sorted(pick_from_result.items(), key=lambda x:x[1], reverse=True)[:n]
    
    res_dict = {}
    
    for i in max_prob:
        for j in visual_dict:
            if i[0] == j[0]:
                res_dict.update({i[1]:j[1]})
            
    return res_dict

In [100]:
# sample texts
text = 'the green one'
text = 'give me the straight piece please'
text = 'the yellowish one'
text = 'give me the pink piece'

In [103]:
max_prob(text, model, mock_visual,3)

{0.62383866: {'x': 32, 'y': 98},
 0.58254784: {'x': 129, 'y': 69},
 0.45683467: {'x': 73, 'y': 56}}

# Incrememtal feeding

In [14]:
def feed_one(text,k):
    """
    input: one input string
    output: a list, consist of strings where 1 word is added at a time
    """
    feed_one = []
    tokenized = text.split()
    for i, t in enumerate(tokenized):
        if i+1 >= k:
            feed_one.append(' '.join(tokenized[:i+1]))
    return feed_one

In [15]:
# test 'feed_one' function
feed_one("top center Lshaped stone blue", 3)

['top center Lshaped',
 'top center Lshaped stone',
 'top center Lshaped stone blue']

In [106]:
def increm_feed_predict(model, text, top_proba, threshold):
    """
    input: text string
    output: list of dictionary of top_proba

    *the function also print the input, its probs, the prediction
    
    """
    list_of_proba = []
    
    for t in feed_one(text,3):
        x = list()
        x.append(t)
        print(t)
        
        print (max_prob([t], model, mock_visual,3))

In [107]:
# use random made-up sentence to get prediction and prob
# comment out when not need
test_text = "looking for brown piece with weird Lshape with elogated part jutted out from the longer side"
test_text = "give me the yellow one there on your right?"
test_text = "I want that something weirdly angular shape that's in pink"
test_text = 'the n as a piece consisting of two rectangles'
test_text = 'the green t shape on the top left corner'
test_text = 'I want the pink one oh no I actually want the purple one'

increm_feed_predict(model,test_text,5,0.1) # input = text, max no. of proba to output, threshold of proba

I want the
{0.99552584: {'x': 137, 'y': 163}, 0.3412792: {'x': 57, 'y': 24}, 0.2778092: {'x': 58, 'y': 139}}
I want the pink
{0.99927044: {'x': 137, 'y': 163}, 0.99729013: {'x': 73, 'y': 56}, 0.9928199: {'x': 57, 'y': 24}}
I want the pink one
{0.99952453: {'x': 137, 'y': 163}, 0.99868923: {'x': 73, 'y': 56}, 0.9970174: {'x': 57, 'y': 24}}
I want the pink one oh
{0.99952453: {'x': 137, 'y': 163}, 0.99868923: {'x': 73, 'y': 56}, 0.9970174: {'x': 57, 'y': 24}}
I want the pink one oh no
{0.99952453: {'x': 137, 'y': 163}, 0.99868923: {'x': 73, 'y': 56}, 0.9970174: {'x': 57, 'y': 24}}
I want the pink one oh no I
{0.9999925: {'x': 137, 'y': 163}, 0.99315476: {'x': 73, 'y': 56}, 0.9867716: {'x': 57, 'y': 24}}
I want the pink one oh no I actually
{0.9999925: {'x': 137, 'y': 163}, 0.99315476: {'x': 73, 'y': 56}, 0.9867716: {'x': 57, 'y': 24}}
I want the pink one oh no I actually want
{0.9999925: {'x': 137, 'y': 163}, 0.99315476: {'x': 73, 'y': 56}, 0.9867716: {'x': 57, 'y': 24}}
I want the pink 