# Test du chatbot

## How does it work?

The chatbot use a tsv file with two fields: Questions, answer
The goal is that when a user enter a questions, using the glove vectors and the cosine similarity measure, we find the closest question matching the user sentence so as to find the closest answer

In [1]:
# check the distance between two sentences using cosine similarity and glove for word representation
def cosine_distance_sentence(sent1, sent2, weights1, weights2):
  vector_1 = np.average([glove_vectors[word] for word in sent1 if word in glove_vectors], axis=0, weights=weights1)
  vector_2 = np.average([glove_vectors[word] for word in sent2 if word in glove_vectors], axis=0, weights=weights2)
  cosine = scipy.spatial.distance.cosine(vector_1, vector_2)
  return 1 - cosine

Each word have a weight associated to it coming from the tfidf matrix, for example the word 'it' is less important than the word 'plane' because it is less common in the whole corpus, but also for each question in that corpus.

Also When computing the cosine we compare the question of the user to one of the question of the corpus. For the user's question we compute the weight of each word by taking the average of the value of that word on each document where it appears.


In [2]:
def get_tfidf_dense_matrix(questions):
  vectors = vectorizer.fit_transform(questions)
  feature_names = vectorizer.get_feature_names()
  dense = vectors.todense()
  denselist = dense.tolist()
  df = pd.DataFrame(denselist, columns=feature_names)
  return df

In [3]:
def get_weight_questions_token(questions_tokens, dense_tfidf_matrix):
  weight_matrix = []
  for index, tokens in enumerate(questions_tokens):
    weight_vector = []
    for token in tokens:
      if token in dense_tfidf_matrix.columns and token in glove_vectors:
        weight_vector.append(dense_tfidf_matrix[token][index])
      elif token in glove_vectors:
        weight_vector.append(1.0)
    weight_matrix.append(weight_vector)
  return weight_matrix

# using the information we have from the dense_tfidf_matrix we compute the weight vector
def get_weights_new_sentences(sentence_tokens, dense_tfidf_matrix):
  weight_vector = []
  for token in sentence_tokens:
    if token in dense_tfidf_matrix.columns and token in glove_vectors:
      not_null_vector_token = dense_tfidf_matrix[token][dense_tfidf_matrix[token] > 0]
      weight_vector.append(np.average(not_null_vector_token))
    elif token in glove_vectors:
      # we don't wan those weight vector to be longer than the on we use to
      # make the np.average
      weight_vector.append(1.0)
  return weight_vector

## An example

In [4]:
from chatbot_model import weight_matrix
from chatbot_model import questions_tokens
from chatbot_model import dense_tfidf_matrix
from chatbot_model import weight_matrix
from chatbot_model import predict_answer

In [6]:
question = 'how are you today?'

answer, similarity = predict_answer(question, questions_tokens, dense_tfidf_matrix, weight_matrix, verbose=True)
print("    --> {}".format(answer))
print("    (similarity is {})".format(similarity))

The question closest to the sentence provided is "how are you doing today?"
    --> i'm doing great. what about you?
    (similarity is 0.9961597526733927)


In [10]:
question = 'What school are you going to?'

answer, similarity = predict_answer(question, questions_tokens, dense_tfidf_matrix, weight_matrix, verbose=True)
print("    --> {}".format(answer))
print("    (similarity is {})".format(similarity))

The question closest to the sentence provided is "what school do you go to?"
    --> i go to pcc.
    (similarity is 0.9958053079883509)
