# Utterance generation prototype notebook
Author: Matthew Stachyra <br>
Date: 16 June 2022 <br>
Version: 0.1 - prototyping 3 approaches

## *Approach 1:* replacement with similar words
Note: This can be used to generate very similar sentences with similar structure. They may also be used for the machine learning in approaches 2 and 3 below.

### Subproblems
1. generate possible synonyms
2. filter synonyms using similarity measure
2. identify which words to replace in an utterance
3. replace words in utterance one at a time and generate new set

In [201]:
import re
import itertools
import numpy as np
from numpy.linalg import norm
import warnings
warnings.filterwarnings('ignore')

# nltk
import nltk
from nltk.corpus import wordnet as wn
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
nltk.download('wordnet')
nltk.download('omw-1.4')

# spacy
import spacy
nlp = spacy.load("en_core_web_sm")

# gensim
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/matthewstachyra/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/matthewstachyra/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


### 1. generate similar words

##### using `nltk.corpus.wordnet.synsets` and `spacy` part of speech tagging
nltk doc: https://www.nltk.org/howto/wordnet.html <br>
spacy doc: https://spacy.io/usage/linguistic-features#pos-tagging <br>
pos tags used in spacy: https://universaldependencies.org/u/pos/ <br>

In [255]:
posmap = {'VERB':'v', 'NOUN':'n', 'PRON':'n', 'PROPN':'n', 'ADJ':'a', 'ADV':'r'}

def preprocess(utterance):
    '''return list of words in utterance preprocessed to be lower case, removing any non 
    alphabetic characters, removing words less than 2 characters, removing 
    '''
    utterance = utterance.lower()
    cleanr    = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', utterance)
    rem_num   = re.sub('[0-9]+', '', cleantext)
    tokenizer = RegexpTokenizer(r'\w+')
    
    return " ".join(tokenizer.tokenize(rem_num))


def get_pos(word, utterance):
    '''return the part of speech of the word in the utterance if it is a verb
    noun, pronoun, proper noun, adjective, or adverb.
    '''
    if not utterance or not word: raise ValueError("Error: Input is empty string.")
    if word not in utterance:     raise ValueError("Error: The word is not in the utterance.")
        
    for w in nlp(utterance):
        if str(w)==word: return w.pos_
        

def get_synonyms(word, utterance):
    '''return synonyms by taking the lemma generated by synsets that have the same part of speech,
    given a word if its part of speech is a verb, noun, adverb, or adjective.
    
    Note: it is necessary to pass in the pos to get the relevant kind of synonym.
    '''
    pos = get_pos(word, utterance)
    
    if pos not in ['VERB', 'NOUN', 'PRON', 'PROPN', 'ADJ', 'ADV']: return 
    
    return set(list(itertools.chain(
        [synonym
         for synset in wn.synsets(word, pos=posmap[pos])
         for synonym in synset.lemma_names()
         if len(word)>1])))

In [256]:
sometext = "Will I need to go to the doctor again next Monday?"

In [257]:
clean = preprocess(sometext)
print(clean)

will i need to go to the doctor again next monday


In [258]:
get_synonyms('need', clean)

{'ask',
 'call_for',
 'demand',
 'involve',
 'necessitate',
 'need',
 'postulate',
 'require',
 'take',
 'want'}

### 2. filter synonyms using similarity measure 

##### using `gensim` with `GloVe` embeddings
https://radimrehurek.com/gensim/models/word2vec.html

In [259]:
# working with two models to see which performs best
model_glove_twitter = api.load("glove-twitter-25")
model_gigaword = api.load("glove-wiki-gigaword-100")

In [260]:
model_gigaword.most_similar(positive=['need'],topn=10)

[('should', 0.8821427226066589),
 ('want', 0.8693705797195435),
 ('we', 0.8659201264381409),
 ('must', 0.8644395470619202),
 ('needed', 0.8635740280151367),
 ('needs', 0.8618939518928528),
 ('get', 0.8493343591690063),
 ('make', 0.8488180637359619),
 ('do', 0.8422726988792419),
 ('able', 0.8364372849464417)]

In [261]:
model_glove_twitter.most_similar(positive=['need'], topn=10)

[('take', 0.9688727259635925),
 ('get', 0.9679074883460999),
 ('give', 0.9652544856071472),
 ('make', 0.9647879004478455),
 ("n't", 0.9613597393035889),
 ('better', 0.9595354199409485),
 ("'ll", 0.9594433903694153),
 ('let', 0.9594159126281738),
 ('bring', 0.9558347463607788),
 ('have', 0.9553070068359375)]

**Note:** it seems that the wikipedia model provides more relevant 'synonyms' than the model trained on twitter. The below code works exclusively with `model_gigaword`, or the embeddings trained on wikipedia data.

In [262]:
def embed(vector, model):
    '''return a (100,) embedding of the vector using the model.
    '''
    try:
        vec = model.get_vector(vector)
    except:
        return np.empty(0)
    
    return vec

    
def get_similarities(word, synonyms, model):
    '''return dictionary with synonym:cosine similarity key-value pairs.
    '''
    cosinesim = lambda v1, v2: (np.dot(v1, v2 / (norm(v1) * norm(v2))))
    
    sims = {word: 1.0}
    ref  = embed(word, model)
    
    for s in synonyms:
        vec = embed(s, model)                 
        if vec.any():
            sim     = cosinesim(ref, vec)
            sims[s] = sim
    
    return sims

        
def filter_synonyms(similarities, threshold=0.70):
    '''return subset of dictionary where the similarity is at least the threshold
    value, with a default of 0.70 cosine similarity.
    '''
    return [synonym 
            for synonym, similarity in similarities.items() 
            if similarity>=threshold]
    

def print_similarities(similarities):
    '''print each word with its cosine similarity to a reference vector.
    '''
    for synonym, similarity in similarities.items():
        print(f"word: {synonym}, cosine similarity: {similarity}")

In [263]:
model_gigaword.get_vector('doctor')

array([ 0.043244, -0.47529 ,  0.15808 ,  0.20413 , -0.15383 ,  0.72284 ,
        0.26145 ,  0.20892 , -0.3147  , -0.070307, -0.43367 ,  0.053109,
        0.73635 ,  0.98111 ,  0.23535 , -0.10449 ,  0.50258 , -0.033356,
       -0.35537 ,  0.64549 , -0.37103 , -0.10052 , -0.76929 , -0.16957 ,
       -0.15648 ,  0.53548 ,  0.35146 , -1.5126  ,  0.050984,  0.24445 ,
       -0.35688 ,  0.43968 , -0.62985 ,  0.32891 , -0.53009 ,  0.49832 ,
       -1.2061  ,  0.27797 ,  0.42734 ,  0.095773, -0.43527 ,  0.93561 ,
        0.36039 , -0.83114 ,  0.12966 , -0.1363  , -0.58124 ,  0.092946,
       -0.014708,  0.32562 ,  0.41204 ,  0.1451  ,  0.49803 ,  0.86926 ,
       -0.18033 , -1.6227  , -0.64565 ,  0.17504 ,  0.73849 ,  0.39156 ,
        0.83135 ,  0.51308 ,  0.12999 , -0.21288 ,  0.68456 ,  0.056297,
        0.090792,  0.28032 , -0.12233 ,  0.60761 , -0.57913 , -0.024127,
       -0.063252,  0.40747 ,  0.10775 ,  0.57977 ,  0.092789, -0.15588 ,
       -0.36494 , -0.46632 ,  0.37553 ,  0.164   , 

In [264]:
word = 'need' 
sims = get_similarities(word,
                        get_synonyms(word, clean),
                        model_gigaword)
print_similarities(sims)

word: need, cosine similarity: 1.0
word: ask, cosine similarity: 0.7399007678031921
word: necessitate, cosine similarity: 0.18295712769031525
word: demand, cosine similarity: 0.6003023386001587
word: postulate, cosine similarity: 0.018827084451913834
word: involve, cosine similarity: 0.5790201425552368
word: take, cosine similarity: 0.8283353447914124
word: want, cosine similarity: 0.8693705797195435
word: require, cosine similarity: 0.7587778568267822


In [265]:
print(f"The context for the word is: \n '{clean}'. \n")
print(f"Full list of synonyms is: \n {list(get_synonyms(word, clean))}. \n")
print(f"Filtered list of synonyms is: \n {filter_synonyms(sims)}.")

The context for the word is: 
 'will i need to go to the doctor again next monday'. 

Full list of synonyms is: 
 ['ask', 'necessitate', 'need', 'call_for', 'demand', 'postulate', 'involve', 'take', 'want', 'require']. 

Filtered list of synonyms is: 
 ['need', 'ask', 'take', 'want', 'require'].


### 3. select which words to replace 
Note: only replacing nouns, verbs, pronouns, proper nouns, adjectives, and adverbs <br>
spacy doc: https://spacy.io/usage/linguistic-features#pos-tagging

##### using `spacy`

In [266]:
for word in nlp(sometext):
    print(word.pos_)

AUX
PRON
VERB
PART
VERB
ADP
DET
NOUN
ADV
ADP
PROPN
PUNCT


In [267]:
get_pos('need', preprocess(sometext))

'VERB'

In [268]:
def synonyms_map(utterance, synonymfilter=False):
    '''return dictionary of words to synonyms for words, removing any words 
    that do not have synonyms returned.
    
    NOTE  current version removes ngrams.
    '''
    d = {}
    
    for word in preprocess(utterance).split():
        synonyms = get_synonyms(word, clean)
        if synonyms:
            similarities = get_similarities(word, synonyms, model_gigaword)
            synonyms     = filter_synonyms(similarities) if synonymfilter else synonyms
            synonyms     = [synonym
                            for synonym in synonyms 
                            if len(synonym.split("_"))==1 and preprocess(synonym)!=word] 
        
        if synonyms: d[word] = synonyms 
            
    return d

In [269]:
# default, without filter
synonyms_map(sometext)

{'need': ['ask',
  'necessitate',
  'demand',
  'postulate',
  'involve',
  'take',
  'want',
  'require'],
 'go': ['survive',
  'depart',
  'blend',
  'last',
  'operate',
  'die',
  'exit',
  'plump',
  'fail',
  'endure',
  'extend',
  'function',
  'start',
  'break',
  'choke',
  'lead',
  'fit',
  'move',
  'perish',
  'travel',
  'locomote',
  'get',
  'sound',
  'run',
  'belong',
  'rifle',
  'live',
  'proceed',
  'decease',
  'croak',
  'pass',
  'expire',
  'become',
  'conk',
  'work'],
 'doctor': ['physician', 'Dr.', 'medico', 'doc', 'MD'],
 'next': ['future', 'succeeding', 'following', 'adjacent'],
 'monday': ['Mon']}

In [270]:
# with filter
synonyms_map(sometext, True)

{'need': ['ask', 'take', 'want', 'require'],
 'go': ['start', 'break', 'move', 'get', 'run'],
 'doctor': ['physician'],
 'next': ['future']}

In [271]:
# list of lists of most common phrases in questions
need = ['do i need to', 'must i', 'is it required that i', 'will i need to']
frequency = ['how often do i need', 'what is the timeframe for']
scheduling = ['when is my', 'on what date', 'when do i see']
insurance = ['is this covered', 'will my insurance cover', 'do i need to pay', 'how much will i pay', 
             'what is my bill']
location = ['where is', 'where can i find', 'how can i find', 'i cant find', 'what is the location', 
            'can i have the location']
ability = ['what can i', 'is there anything i can', 'can i']
preparation = ['what do i need', 'how do i prepare', 'how can i get ready for', 'what should i bring']
forgetfulness = ['what if i forgot', 'i forgot to', 'is it ok if i forgot']
explanation = ['what is', 'tell me what is', 'describe', 'i want to understand']
phrasebank = [need, 
              frequency, 
              scheduling, 
              insurance, 
              location, 
              ability, 
              preparation, 
              forgetfulness, 
              explanation]

In [272]:
phrasebank

[['do i need to', 'must i', 'is it required that i', 'will i need to'],
 ['how often do i need', 'what is the timeframe for'],
 ['when is my', 'on what date', 'when do i see'],
 ['is this covered',
  'will my insurance cover',
  'do i need to pay',
  'how much will i pay',
  'what is my bill'],
 ['where is',
  'where can i find',
  'how can i find',
  'i cant find',
  'what is the location',
  'can i have the location'],
 ['what can i', 'is there anything i can', 'can i'],
 ['what do i need',
  'how do i prepare',
  'how can i get ready for',
  'what should i bring'],
 ['what if i forgot', 'i forgot to', 'is it ok if i forgot'],
 ['what is', 'tell me what is', 'describe', 'i want to understand']]

In [273]:
def synonym_tokens(utterance, synonyms):
    '''return new list of tokens generated from the given utterance using the synonyms.
    '''
    newutterance = []
    
    for word in preprocess(utterance).split():
        if word in synonyms:
            newutterance.append(list(itertools.chain(*[[word], synonyms[word]])))
        else:
            newutterance.append([word])
            
    return newutterance

def phrase_tokens(utterance, phrases):
    '''return new list of tokens generated from the given utterance using the phrases.
    '''
    newutterances  = []
    cleanutterance = preprocess(utterance)
    
    for phraselist in phrases:
        for phrase in phraselist:               
            if phrase in cleanutterance:
                for i in range(len(phraselist)-1):
                    if phraselist[i]==phrase: continue
                    copy = cleanutterance
                    newutterances.append(copy.replace(phrase, phraselist[i]))
                    
    return newutterances


In [274]:
synonym_tokens(sometext, synonyms_map(sometext))

[['will'],
 ['i'],
 ['need',
  'ask',
  'necessitate',
  'demand',
  'postulate',
  'involve',
  'take',
  'want',
  'require'],
 ['to'],
 ['go',
  'survive',
  'depart',
  'blend',
  'last',
  'operate',
  'die',
  'exit',
  'plump',
  'fail',
  'endure',
  'extend',
  'function',
  'start',
  'break',
  'choke',
  'lead',
  'fit',
  'move',
  'perish',
  'travel',
  'locomote',
  'get',
  'sound',
  'run',
  'belong',
  'rifle',
  'live',
  'proceed',
  'decease',
  'croak',
  'pass',
  'expire',
  'become',
  'conk',
  'work'],
 ['to'],
 ['the'],
 ['doctor', 'physician', 'Dr.', 'medico', 'doc', 'MD'],
 ['again'],
 ['next', 'future', 'succeeding', 'following', 'adjacent'],
 ['monday', 'Mon']]

In [275]:
phrase_tokens(sometext, phrasebank)

['do i need to go to the doctor again next monday',
 'must i go to the doctor again next monday',
 'is it required that i go to the doctor again next monday']

In [276]:
def add_synonyms(utterance, synonyms):
    '''utility for generate_utterances() to return a list of generated utterances where
    the inputted synonyms from synonym_tokens() are used to replace word in the original
    utterance.
    '''
    genlist = []
    clean   = preprocess(utterance)
    prev    = 0
    
    for i in range(len(synonyms)):
        slist = synonyms[i]
        word = synonyms[i][0]
        for j in range(len(slist)):
            start = clean.find(word, prev) 
            end   = start + len(word)
            gen   = clean[:start] + slist[j] + clean[end:]
            genlist.append(gen)
        prev = end  
        
    return list(set(genlist))

def add_phrases(utterance, phrasebank):
    '''utility for generate_utterances() to return a list of generated utterances where
    the inputted synonyms from synonym_tokens() are used to replace word in the original
    utterance.
    '''
    if not phrasebank: return []
    
    genlist = []
    clean   = preprocess(utterance)
    
    for plist in phrasebank:
        match = [p for p in plist if p in clean]
        match = match[0] if match else None
        if not match: continue
            
        start = clean.index(match.split()[0]) 
        
        for p in plist:
            end = start + len(match)
            gen = clean[:start] + p + clean[end:]
            genlist.append(gen)
    
    return genlist

def generate_utterances(utterance, phrasebank=None):
    '''return new list of utterances including the inputted utterance and any generated
    utterances.
    
    NOTE  this current method requires validating the generated text manually.
    NOTE  synonyms is a list of lists with form [[list of synonym(s)]].
    NOTE  phrases is a list with form [list of any phrases].
    '''
    clean    = preprocess(utterance)
    synonyms = synonym_tokens(clean, synonyms_map(clean))
    phrases  = phrase_tokens(clean, phrasebank) if phrasebank else []
    genlist  = add_phrases(clean, phrasebank)
    genlist.extend(add_synonyms(clean, synonyms))
    
    return genlist


In [277]:
generate_utterances(sometext, phrasebank)

['do i need to go to the doctor again next monday',
 'must i go to the doctor again next monday',
 'is it required that i go to the doctor again next monday',
 'will i need to go to the doctor again next monday',
 'will i need to blend to the doctor again next monday',
 'will i need to endure to the doctor again next monday',
 'will i need to rifle to the doctor again next monday',
 'will i need to die to the doctor again next monday',
 'will i take to go to the doctor again next monday',
 'will i need to run to the doctor again next monday',
 'will i need to belong to the doctor again next monday',
 'will i need to go to the doctor again next monday',
 'will i involve to go to the doctor again next monday',
 'will i want to go to the doctor again next monday',
 'will i require to go to the doctor again next monday',
 'will i need to function to the doctor again next monday',
 'will i need to lead to the doctor again next monday',
 'will i need to go to the medico again next monday',
 

## *Approach 2:* text generation of similar sentences using GANs and/or BERT

## *Approach 3:* embedding question types and running similarity measure (as alternative if there isn't a direct hit)