# Utterance generation prototype notebook
Author: Matthew Stachyra <br>
Date: 16 June 2022 <br>
Version: 0.1 - prototyping 3 approaches

## *Approach 1:* replacement with similar words
Note: This can be used to generate very similar sentences with similar structure. They may also be used for the machine learning in approaches 2 and 3 below.

### Subproblems
1. generate similar words
2. identify which words to replace in an utterance
3. replace words in utterance one at a time and generate new set

In [26]:
import nltk
from nltk.corpus import wordnet as wn
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
nltk.download('wordnet')
nltk.download('omw-1.4')

import spacy
nlp = spacy.load("en_core_web_sm")

import re
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer() 

import itertools

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/matthewstachyra/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/matthewstachyra/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


#### 1. generate similar words

##### using `nltk.corpus.wordnet.synsets` and `spacy` part of speech tagging
nltk doc: https://www.nltk.org/howto/wordnet.html <br>
spacy doc: https://spacy.io/usage/linguistic-features#pos-tagging <br>
pos tags used in spacy: https://universaldependencies.org/u/pos/ <br>

In [60]:
posmap = {'VERB':'v', 'NOUN':'n', 'PRON':'n', 'PROPN':'n', 'ADJ':'a', 'ADV':'r'}

def preprocess(utterance):
    '''return list of words in utterance preprocessed to be lower case, removing any non 
    alphabetic characters, removing words less than 2 characters, removing 
    '''
    utterance = utterance.lower()
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', utterance)
    rem_num = re.sub('[0-9]+', '', cleantext)
    tokenizer = RegexpTokenizer(r'\w+')
    return " ".join(tokenizer.tokenize(rem_num))


def get_pos(word, utterance):
    '''return the part of speech of the word in the utterance if it is a verb
    noun, pronoun, proper noun, adjective, or adverb.
    '''
    if word not in utterance: raise ValueError("Error: The word is not in the utterance.")
    pos = next(filter(lambda x : str(x[0])==word,
                      [(word, word.pos_)
                       for word in nlp(utterance)]))[1]
    print(pos)
    if pos in posmap: 
        return posmap[pos]
    else: 
        return None
        

def get_synonyms(word, pos):
    '''return synonyms given a word if its part of speech is a verb, noun, adverb,
    or adjective.
    
    Note: it is necessary to pass in the pos to get the relevant kind of synonym.
    '''
    if pos not in ['v', 'n', 'r', 'a']: return # nothing if not a verb, noun, adjective, or adverb
    return set(
            list(
                itertools.chain(
                    [synonym
                     for synset in wn.synsets(word, pos=pos)
                     for synonym in synset.lemma_names()])))

In [61]:
sometext = "Will I need to go to the doctor again next Monday?"

In [62]:
preprocess(sometext)

'will i need to go to the doctor again next monday'

In [63]:
get_synonyms('need', 'v')

{'ask',
 'call_for',
 'demand',
 'involve',
 'necessitate',
 'need',
 'postulate',
 'require',
 'take',
 'want'}

In [64]:
print(get_synonyms('go', get_pos('go', sometext)))

VERB
{'exit', 'hold_up', 'pop_off', 'rifle', 'extend', 'last', 'plump', 'pass', 'get_going', 'die', 'belong', 'break_down', 'live_on', 'function', 'lead', 'give-up_the_ghost', 'travel', 'survive', 'live', 'operate', 'depart', "cash_in_one's_chips", 'drop_dead', 'fit', 'conk', 'blend', 'run', 'run_short', 'perish', 'go_bad', 'move', 'snuff_it', 'decease', 'go_away', 'give_out', 'become', 'fail', 'blend_in', 'work', 'go', 'sound', 'get', 'pass_away', 'conk_out', 'expire', 'buy_the_farm', 'endure', 'start', 'give_way', 'hold_out', 'locomote', 'kick_the_bucket', 'run_low', 'choke', 'proceed', 'croak', 'break'}


#### 2. select which words to replace 
Note: only replacing nouns, verbs, pronouns, proper nouns, adjectives, and adverbs <br>
spacy doc: https://spacy.io/usage/linguistic-features#pos-tagging

##### using `spacy`

In [20]:
for word in nlp(sometext):
    print(word.pos_)

PRON
AUX
DET
NOUN
NOUN
PUNCT


In [69]:
get_pos('need', preprocess(sometext))

VERB


'v'

In [70]:
def map_synonyms(utterance):
    '''return dictionary of words to synonyms for words.
    '''
    wordpos = [(word, get_pos(word, utterance)) for word in utterance]
    return {tup[0] : get_synonyms(tup[0], tup[1]) 
            for tup in wordpos
            if tup[1]!=None}

In [71]:
map_synonyms(preprocess(sometext))

StopIteration: 

In [None]:
# list of lists of most common phrases in questions
need = ['do i need to', 'must i', 'is it required']
frequency = ['how often do i need', 'what is the timeframe for']
scheduling = ['when is my', 'on what date', 'when do i see']
insurance = ['is this covered', 'will my insurance cover', 'do i need to pay', 'how much will i pay', 
             'what is my bill']
location = ['where is', 'where can i find', 'how can i find', 'i cant find', 'what is the location', 
            'can i have the location']
ability = ['what can i', 'is there anything i can', 'can i']
preparation = ['what do i need', 'how do i prepare', 'how can i get ready for', 'what should i bring']
forgetfulness = ['what if i forgot', 'i forgot to', 'is it ok if i forgot']
explanation = ['what is', 'tell me what is', 'describe', 'i want to understand']
phrasebank = [need, 
              frequency, 
              scheduling, 
              insurance, 
              location, 
              ability, 
              preparation, 
              forgetfulness, 
              understanding]

In [None]:
# need to generate every possible combination
# so need to identify which combination is available and which combination has been tried
# so need to also check

## *Approach 2:* text generation of similar sentences using GANs and/or BERT

## *Approach 3:* embedding question types and running similarity measure (as alternative if there isn't a direct hit)