    
## GoldenRetriever Demo

#### High level example
When given a question, GoldenRetriever will predict the most appropriate response to the question.  

    
This is done in two steps:
1. Golden Retriever takes in a knowledge base in the form of a list of string sentences. 
2. It then gives the top `top_k` responses in its knowledge base. 

In [2]:
# gain access to src and data folders
import sys
sys.path.append('..')

from src.model import GoldenRetriever
import tf_sentencepiece

# 1. Init 
gr = GoldenRetriever()

# 2. Load knowledge base
gr.load_kb(text_list=['I love my chew toy!', 'I hate Mondays.'])

# 3. Given a question, it will predict top response
gr.make_query('what do you not love?', top_k=1)

Instructions for updating:
Colocations handled automatically by placer.


Instructions for updating:
Colocations handled automatically by placer.


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


Instructions for updating:
Use tf.cast instead.


Instructions for updating:
Use tf.cast instead.


Instructions for updating:
Use tf.cast instead.


Instructions for updating:
Use tf.cast instead.


model initiated!
knowledge base lock and loaded!


(['I hate Mondays.'], array([[0.20694011]], dtype=float32))

 
    
#### 1. Initializing Golden Retriever
The model is specified within the Graph API:
<br>
1. GoldenRetriever needs to initialize two encoders; one for questions and one for responses.
<br>
2. The cosine similarities of the encoded responses and question are then calculated. The possible responses are ranked according to this similarity. 
<img src='img/Golden Retriever 2 Embeddings.png'>

The two encoders are initalized in Tensorflow's Graph API below. 

In [3]:
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import tf_sentencepiece
from src.metric_learning import triplet_loss, contrastive_loss
from tensorflow.train import Saver
from src.utils import split_txt, read_txt, clean_txt, read_kb_csv
from sklearn.metrics.pairwise import cosine_similarity

# Set up graph.
g = tf.Graph()
with g.as_default():

    # load USE as a 'module', which is tf_hub's interfacable transfer learning component
    embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/1", trainable=True)
    
    # put variable placeholders
    question = tf.placeholder(dtype=tf.string, shape=[None]) 
    
    response = tf.placeholder(dtype=tf.string, shape=[None]) 
    response_context = tf.placeholder(dtype=tf.string, shape=[None])

    # init the embeddings
    question_embeddings = embed(dict(input=question),
                                signature="question_encoder", as_dict=True)
    response_embeddings = embed(dict(input=response,
                                     context=response_context),
                                signature="response_encoder", as_dict=True)
    
    # init session 
    init_op = tf.group([tf.global_variables_initializer(), tf.tables_initializer()])
    
# Initialize session
session = tf.Session(graph=g, config=tf.ConfigProto(log_device_placement=False))
session.run(init_op)
print('model initiated!')

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


model initiated!



    
#### 2. Load knowledge bases
The definition of 'loaded' is when the model has saved the knowledge base and is able to answer a relevant question.  
This is done by:
1. inputting the list of string sentences, `list_of_possible_responses`, and
2. passing it through the responses embedding and save the resulting array of encoded responses, `array_of_encoded_responses`. The shape of this array is (Number of responses x embedding size)

In [4]:
list_of_possible_responses = ['I love my chew toy!', 
                              'I hate Mondays.']

array_of_encoded_responses = session.run(response_embeddings, 
                                         feed_dict={response:list_of_possible_responses,
                                                    response_context:list_of_possible_responses
                                                   })['outputs']

print("encoded responses shape: {}".format(array_of_encoded_responses.shape))
array_of_encoded_responses

encoded responses shape: (2, 512)


array([[ 0.07049897,  0.06177055,  0.02428828, ..., -0.0170604 ,
         0.04470512, -0.0412127 ],
       [-0.04005127,  0.04781166,  0.01393019, ..., -0.02000583,
         0.04993735,  0.02181664]], dtype=float32)

  
    
#### 3. Rank the sentences and return the best answer
The question, `question_string`, is passed through the encoder and the resulting vector representation of the question, `encoded_ques`, is kept. 

In [5]:
question_string = ['what do you not love?']

encoded_ques= session.run(question_embeddings, 
                          feed_dict={question:question_string,
                                    })['outputs']
print("encoded question shape: {}".format(encoded_ques.shape))

encoded question shape: (1, 512)



    
Rank the appropriateness of saved responses to the question by cosine similarity.

In [6]:
similarity_score=cosine_similarity(array_of_encoded_responses, encoded_ques)
print("Similarity score:")
print(similarity_score)

print('')

print("Best answer: {}".format(list_of_possible_responses[similarity_score.argmax()]))

Similarity score:
[[0.16681367]
 [0.20694011]]

Best answer: I hate Mondays.



    
### 4. Loading databases
The data, just before it is fed into the model, needs to be in Question and Answer pairs.   
This can be done either by loading data that (a) is already saved in such a format or by splitting long texts into their Question Answer pairs. 
    
#### 4a. Loading PDPA QnA
Scraped PDPA data is saved conveniently in Question Answer pairs. Simply cleaning is applied.

In [7]:
import pandas as pd
df = pd.read_csv('../data/pdpa.csv')
df['kb'] = df['meta']+df['answer']

def clean_1_txt(text):
    """Strips formatting"""
    text=text.replace('\n', '. ') # not sure how newlines are tokenized
    text=text.replace('.. ', '. ').rstrip() # remove artifact
    return text

answers_, ques_ = df['kb'].apply(clean_1_txt), df['question'].apply(clean_1_txt)


    
#### 4b. Reading AIAP QnA
The AIAP QnA text, however, requires splitting the text into question and answer pairs. 

In [8]:
def split_txt(text, qa=False):
    """
    Splits a text document into clauses based on whitespaces. 
    Additionally, reads a faq document by assuming that the first line is a question 
    between each whitespaced group
    """
    
    # condition_terms save our long strings
    # stringg is a placeholder for long stirngs
    condition_terms = []
    stringg=''
    
    # the loop goes through the text and save the long string 
    # when it encounters \n and whitespace ''
    for tex in text:
        if (tex=='\n'):
            if (stringg != ''):
                condition_terms.append(stringg)
                stringg=''
            else: pass
        else: stringg+=tex
          
    # now that we have the list of condition_terms
    # we may need to split the strings into Questions and Answers
    # function takes the first sentence the question and the rest as questions
    if qa:
        condition_context = [x.split('\n')[0] for x in condition_terms]
        condition_terms = ['\n'.join(x.split('\n')[1:]) for x in condition_terms]
        return condition_terms, condition_context
    else: return condition_terms

def read_txt(path):
    """Used with split_txt() to read and split kb into clauses"""
    with open(path, 'r', encoding="utf-8") as f:
        text = f.readlines()
    return text
    
text, questions = split_txt(read_txt('../data/aiap.txt'), qa=True)

print("{} questions".format(len(questions)))
print('')
print(questions[0])
print('')
print(text[0])

10 questions

Q1. WHAT SORT OF CANDIDATES ARE YOU LOOKING FOR?

We are looking for candidates who possess a keen interest in the area of machine learning and data science. We believe that candidates can come from any area of specialisation, and our requirements are as follow:
i)   Singaporean with a polytechnic diploma or university degree,
ii) Proficient in Python or R and iii) Is able to implement Machine Learning Algorithms or have a background in Mathematics / Statistics / Computer Science. 
Beyond that, demonstrated statistical fundamentals and programming ability will be helpful for the technical tests, but a keen learning attitude will be the most important to carry you through the programme. 



    
### 5. Finetune and evaluate model performances

In order to finetune the model, some features are added to enable the model to update its parameters via (triplet) loss function:  
1. Negative Responses and their respective context placeholders.
2. Optimizer. Importantly, `var_finetune` allows selective update of parameters.

In [9]:
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import tf_sentencepiece
from src.metric_learning import triplet_loss, contrastive_loss
from tensorflow.train import Saver
from src.utils import split_txt, read_txt, clean_txt, read_kb_csv
from sklearn.metrics.pairwise import cosine_similarity

# Set up graph.
tf.reset_default_graph() # finetune
g = tf.Graph()
with g.as_default():

    # load USE as a 'module', which is tf_hub's interfacable transfer learning component
    embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/1", trainable=True)
    
    # put variable placeholders
    question = tf.placeholder(dtype=tf.string, shape=[None])  
    response = tf.placeholder(dtype=tf.string, shape=[None])  
    response_context = tf.placeholder(dtype=tf.string, shape=[None])


    # QnA embeddings
    question_embeddings = embed(dict(input=question),
                                signature="question_encoder", as_dict=True)

    response_embeddings = embed(dict(input=response,
                                    context=response_context),
                                signature="response_encoder", as_dict=True)
    
    # negative response placeholder and embeddings
    # triplet loss requires a negative example
    neg_response = tf.placeholder(dtype=tf.string, shape=[None])  
    neg_response_context = tf.placeholder(dtype=tf.string, shape=[None]) 
    neg_response_embeddings = embed(dict(input=neg_response,
                                        context=neg_response_context),
                                    signature="response_encoder", as_dict=True)

    # However, we may instead choose to have a contrastive loss
    # this requires a label rather than a negative example
    label = tf.placeholder(tf.int32, [None], name='label')
    
    # either triplet or contrastive loss
    cost = triplet_loss(question_embeddings['outputs'], response_embeddings['outputs'], neg_response_embeddings['outputs'], margin=0.3)

    # init operation
    init_op = tf.group([tf.global_variables_initializer(), tf.tables_initializer()])
    opt = tf.train.GradientDescentOptimizer(learning_rate=0.6)
    
    # get the weights we want to finetune.
    # the embed object has parameters and the parameters are named
    # we list the variables that we want to tune in v
    # and put them into var_list so tensorflow will only change 
    # these specific params
    v=['module/QA/Final/Response_tuning/ResidualHidden_1/AdjustDepth/projection/kernel']
    var_finetune=[x for x in embed.variables for vv in v if vv in x.name] 
    opt_op = opt.minimize(cost, var_list=var_finetune)

# Initialize session
session = tf.Session(graph=g, config=tf.ConfigProto(log_device_placement=False))
session.run(init_op)
print('model initiated!')

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


model initiated!



#### A simple finetuning can be carried out in loops carrying out 3 steps each loop:
1. Record loss
2. Run optimizer
3. Record accuracy

A simple train-test set is built for the finetuning.

In [16]:
randomized_idx = np.random.permutation(np.arange(len(answers_)))
train_idx = randomized_idx[:200].tolist()
test_idx = randomized_idx[200:].tolist()

train_ans, train_ques = answers_[train_idx].tolist(), ques_[train_idx].tolist()
test_ans, test_ques = answers_[test_idx].tolist(), ques_[test_idx].tolist()

train_neg_ans = answers_[np.random.permutation(train_idx)].tolist()

##### 1. Record loss

In [17]:
current_loss = session.run(cost, 
                           feed_dict={
                                    question:train_ques,
                                    response:train_ans,
                                    response_context:train_ans,
                                    neg_response:train_neg_ans,
                                    neg_response_context:train_neg_ans,
                                    #label:label_
                                    })
current_loss

0.026533306

    
##### 2. Update params (optimize)

In [18]:
session.run(opt_op, 
           feed_dict={
                    question:train_ques,
                    response:train_ans,
                    response_context:train_ans,
                    neg_response:train_neg_ans,
                    neg_response_context:train_neg_ans,
                    #label:label_
                    })

    
###### 3. Record Accuracy
Accuracy for the QA model works by:
1. Rank solutions to each question 
2. Add 1 if correct answer is in top 3

In [19]:
def ranker(session, question_list, answer_list):
    """for model evaluation on InsuranceQA datset"""
    predictions=[]
    for ii, question_str in enumerate(question_list):
        ques_vectors = session.run(response_embeddings, 
                                 feed_dict={response:[question_str],
                                            response_context:[question_str]
                                           })['outputs']
        
        doc_vectors = session.run(response_embeddings, 
                                 feed_dict={response:answer_list,
                                            response_context:answer_list
                                           })['outputs']
        
        cossim = cosine_similarity(doc_vectors, ques_vectors.reshape(1, -1))
        sortargs=np.flip(cossim.argsort(axis=0))
        returnedans = [answer_list[jj[0]] for jj in sortargs]
        predictions.append(returnedans)
        
    return predictions

def scorer(predictions, gts, k=3):
    """For model evaluation on InsuranceQA datset. Returns score@k."""
    score=0
    total=0
    for gt, prediction in zip(gts, predictions):
        if bool(set([gt]) & set(prediction[:k])):
            score+=1
        total+=1
    return score/total

# may take a while
predictionss = ranker(session, test_ques, test_ans)

# check accuracy score
acc_score = scorer( predictionss, test_ans , k=1)


#### Loop through all 3 steps in each loop

In [20]:
for i in range(3):
        
    # keep track of current loss
    current_loss = session.run(cost, 
                               feed_dict={  question:train_ques,
                                            response:train_ans,
                                            response_context:train_ans,
                                            neg_response:train_neg_ans,
                                            neg_response_context:train_neg_ans,
                                            #label:label_
                                            })
    print("Epoch {}".format(i))
    print("Loss : {}".format(current_loss))

    
    # update params
    session.run(opt_op, 
                feed_dict={ question:train_ques,
                            response:train_ans,
                            response_context:train_ans,
                            neg_response:train_neg_ans,
                            neg_response_context:train_neg_ans,
                            #label:label_
                            })

    # create predictions to test accuracy
    # may take a while
    predictionss = ranker(session, test_ques, test_ans)
    # check accuracy score
    for k_ in [1,2,3,4]:
        acc_score = scorer( predictionss, test_ans , k=k_)
        print("Score @{} : {}".format(k_, acc_score))
    
    print('')

Epoch 0
Loss : 0.02144646644592285
Score @1 : 0.5806451612903226
Score @2 : 0.7580645161290323
Score @3 : 0.8387096774193549
Score @4 : 0.8709677419354839

Epoch 1
Loss : 0.016312837600708008
Score @1 : 0.5806451612903226
Score @2 : 0.7580645161290323
Score @3 : 0.8387096774193549
Score @4 : 0.8709677419354839

Epoch 2
Loss : 0.011141598224639893
Score @1 : 0.5806451612903226
Score @2 : 0.7580645161290323
Score @3 : 0.8387096774193549
Score @4 : 0.8709677419354839

