# Utterance generation prototype notebook
Author: Matthew Stachyra <br>
Date: 16 June 2022 <br>
Version: 0.1 - prototyping 3 approaches

## *Approach 1:* replacement with similar words
Note: This can be used to generate very similar sentences with similar structure. They may also be used for the machine learning in approaches 2 and 3 below.

### Subproblems
1. generate similar words
2. identify which words to replace in an utterance
3. replace words in utterance one at a time and generate new set

In [26]:
import nltk
from nltk.corpus import wordnet as wn
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
nltk.download('wordnet')
nltk.download('omw-1.4')

import spacy
nlp = spacy.load("en_core_web_sm")

import re
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer() 

import itertools

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/matthewstachyra/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/matthewstachyra/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


#### 1. generate similar words

##### using `nltk.corpus.wordnet.synsets` and `spacy` part of speech tagging
nltk doc: https://www.nltk.org/howto/wordnet.html <br>
spacy doc: https://spacy.io/usage/linguistic-features#pos-tagging <br>
pos tags used in spacy: https://universaldependencies.org/u/pos/ <br>

In [27]:
posmap = {'VERB':'v', 'NOUN':'n', 'PRON':'n', 'PROPN':'n', 'ADJ':'a', 'ADV':'r'}

def preprocess(utterance):
    '''return list of words in utterance preprocessed to be lower case, removing any non 
    alphabetic characters, removing words less than 2 characters, removing 
    '''
    sentence = sentence.lower()
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)
    rem_url = re.sub(r'http\S+', '',cleantext)
    rem_num = re.sub('[0-9]+', '', rem_url)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rem_num)  
    return tokens

def get_pos(word, utterance):
    '''return the part of speech of the word in the utterance if it is a verb
    noun, pronoun, proper noun, adjective, or adverb.
    '''
    pos = next(filter(lambda x : str(x[0])==word,
                      [(word, word.pos_)
                       for word in nlp(utterance)]))[1]
    print(pos)
    if pos in posmap: 
        return posmap[pos]
    else: 
        return None
        

def get_synonyms(word, pos):
    '''return synonyms given a word if its part of speech is a verb, noun, adverb,
    or adjective.
    
    Note: it is necessary to pass in the pos to get the relevant kind of synonym.
    '''
    if pos not in ['v', 'n', 'r', 'a']: return # nothing if not a verb, noun, adjective, or adverb
    return set(
            list(
                itertools.chain(
                    [synonym
                     for synset in wn.synsets(word, pos=pos)
                     for synonym in synset.lemma_names()])))

In [None]:
preprocess(

In [18]:
get_synonyms("dog", "v")

{'chase',
 'chase_after',
 'dog',
 'give_chase',
 'go_after',
 'tag',
 'tail',
 'track',
 'trail'}

In [19]:
sometext = "This is a test sentence."
print(get_synonyms("test", get_pos("test", sometext)))

NOUN
{'trial', 'trial_run', 'mental_test', 'exam', 'tryout', 'mental_testing', 'examination', 'psychometric_test', 'test', 'run'}


#### 2. select which words to replace 
Note: only replacing nouns, verbs, pronouns, proper nouns, adjectives, and adverbs <br>
spacy doc: https://spacy.io/usage/linguistic-features#pos-tagging

##### using `spacy`

In [20]:
for word in nlp(sometext):
    print(word.pos_)

PRON
AUX
DET
NOUN
NOUN
PUNCT


In [21]:
get_pos("test", sometext)

NOUN


'n'

In [22]:
def map_synonyms(utterance):
    '''return dictionary of words to synonyms for words.
    '''
    for word in utterance.split():
        print(word)
        print(get_pos(word, utterance))
        print("\n")
        
#     for tup in wordpos:
#         if tup[1] in posmap:
#             s = get_synonyms(tup[0], tup[1])
#     return {tup[1] : get_synonyms(tup[0], tup[1]) 
#             for tup in wordpos
#             if tup[1] in posmap}

In [23]:
map_synonyms(sometext)

This
PRON
n


is
AUX
None


a
DET
None


test
NOUN
n


sentence.


StopIteration: 

## *Approach 2:* text generation of similar sentences using GANs

## *Approach 3:* text generation of fixed length, similar sentences using BERT