### Tag generation prototype v0.1 
12 July 2022

parrot: https://github.com/PrithivirajDamodaran/Parrot_Paraphraser <br>
rasa prototype: https://colab.research.google.com/drive/1RGWrQv3e0CRDPDROQ3ZmUWTmlRljasGi#scrollTo=776uG9Q6DTnf <br>

In [142]:
import re
import itertools
import warnings
warnings.filterwarnings("ignore")

# numpy
import numpy as np
from numpy.linalg import norm

# nltk, spacy, gensim
import spacy
import nltk
from nltk.corpus import wordnet as wn
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api

# pytorch, parrot
from parrot import Parrot
import torch

nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/matthewstachyra/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/matthewstachyra/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/matthewstachyra/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
def random_state(seed):
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

random_state(1234)

## what will make up tags?
#### current philosophy is to create as many unique but semantically related tokens as possible, and use those as tags

#### WIP tags generated
1. tokens from words in note
2. tokens from paraphrases of note
3. synonyms of nouns in note
4. synonyms of nouns in each paraphrase

## using `parrot` paraphraser to generate paraphrases, demo below

In [3]:
parrot = Parrot(model_tag="prithivida/parrot_paraphraser_on_T5")

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/913 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/476 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/736 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.74k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/686 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [138]:
phrases = ["Can you recommed some upscale restaurants in Newyork?",
           "I was sad that I failed my exam but I knew that I could score better next time."
]

for phrase in phrases:
    print("-"*100)
    print("Input_phrase: ", phrase)
    print("-"*100)
    para_phrases = parrot.augment(input_phrase=phrase, use_gpu=False)
    for para_phrase in para_phrases:
        print(para_phrase)

----------------------------------------------------------------------------------------------------
Input_phrase:  Can you recommed some upscale restaurants in Newyork?
----------------------------------------------------------------------------------------------------
('recommend some of the best upscale restaurants in new york?', 28)
('recommend some very upscale restaurants in new york?', 26)
('can you suggest some upscale restaurants in new york?', 21)
('can you recommend some elegant restaurants in new york?', 20)
('can you recommend some upscale restaurants in new york?', 14)
('can you recommend some upscale restaurants in newyork?', 13)
----------------------------------------------------------------------------------------------------
Input_phrase:  I was sad that I failed my exam but I knew that I could score better next time.
----------------------------------------------------------------------------------------------------
('i was sad because i failed but i knew i could sc

## creating classes to establish pipeline: preprocessing text, generating synonyms, generating paraphrases, generating tags

In [9]:
!python -m spacy download en_core_web_sm

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting en-core-web-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m01[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.4.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [106]:
class Preprocessor:
    def __init__(self, note, remove_stopwords=False):
        if not note: raise ValueError("Error: Input is invalid. It should be a string.")
        self.note = note
        self.stopwords = remove_stopwords
        
    def __call__(self):
        tokenizer = RegexpTokenizer(r'\w+')
        re_strip = re.compile('<.*?>')
        lemmatizer = WordNetLemmatizer()
        
        self.note = self.note.lower()
        self.note = re.sub(re_strip, '', self.note)
        self.note = re.sub('[0-9]+', '', self.note)
        self.note = " ".join(tokenizer.tokenize(self.note))
        self.note = "".join([lemmatizer.lemmatize(word) for word in self.note])
        
        if self.stopwords:
            return self.remove_stopwords()
        else:
            return self.note
    
    def remove_stopwords(self):
        stopwords = nltk.corpus.stopwords.words('english')
        return " ".join([word for word in self.note.split() if word not in stopwords])

In [107]:
p = Preprocessor("This is a test note!@$#!@#!")
p()

'this is a test note'

In [108]:
p = Preprocessor("This is a test note!@$#!@#!", remove_stopwords=True)
p()

'test note'

In [121]:
class Synonymizer:
    def __init__(self, note):
        self.preprocessor = Preprocessor(note, remove_stopwords=True)
        self.note = self.preprocessor()
        self.glovemodel = api.load("glove-wiki-gigaword-100")
        self.spacymodel = spacy.load("en_core_web_sm")
        self.posmap = {'VERB':'v', 'NOUN':'n', 'PRON':'n', 'PROPN':'n', 'ADJ':'a', 'ADV':'r'}
        
    def __call__(self):
        '''return dictionary of words : synonym(s) pairs.
        '''
        # NOTE current version removes n grams.
        d = {}

        for word in self.note.split():
            synonyms = self.synonyms_by_word(word)

            if synonyms: synonyms = list(set([synonym
                                     for synonym in synonyms
                                     if len(synonym.split("_"))==1]))

            if synonyms: d[word] = synonyms
            
        return d
        
    def pos_by_word(self, word):
        for w in nlp(self.note):
            if str(w)==word: return w.pos_

    def similarities_by_word(self, word, synonyms):
        def cosinesim(v1, v2):
            return (np.dot(v1, v2 / (norm(v1) * norm(v2))))
        
        def embed(vector, model):
            try:
                vec = self.glovemodel.get_vector(vector)
            except:
                return np.empty(0)
            return vec

        sims = {word: 1.0}
        ref  = embed(word, self.glovemodel)

        for s in synonyms:
            vec = embed(s, self.glovemodel)
            if vec.any():
                sim     = cosinesim(ref, vec)
                sims[s] = sim

        return sims

    def print_similarities(self, similarities):
        for synonym, similarity in similarities.items():
            print(f"word: {synonym}, cosine similarity: {similarity}")

    def synonyms_by_word(self, word):
        pos = self.pos_by_word(word)

        if pos not in ['VERB', 'NOUN', 'PRON', 'PROPN', 'ADJ', 'ADV']: return

        # get full set of synonyms
        synonyms = set(list(itertools.chain([synonym
                                             for synset in wn.synsets(word, pos=self.posmap[pos])
                                             for synonym in synset.lemma_names()
                                             if len(word)>1])))

        # filter this set using cosine similarities
        similarities = self.similarities_by_word(word, synonyms)

        return [synonym
                for synonym, similarity in similarities.items()
                if similarity>=0.70]
    

In [122]:
s = Synonymizer("I was sad that I failed my exam but I knew that I could score better next time.")

In [123]:
s()

{'sad': ['sad', 'sorry'],
 'failed': ['failed', 'fail'],
 'exam': ['exam', 'examination'],
 'knew': ['knew', 'know'],
 'score': ['score'],
 'better': ['best', 'better', 'well', 'good'],
 'next': ['future', 'next'],
 'time': ['time']}

In [139]:
class Paraphraser:
    def __init__(self, note, with_gpu=False):
        self.preprocessor = Preprocessor(note)
        self.synonymizer = Synonymizer(note)
        self.parrot = Parrot(model_tag="prithivida/parrot_paraphraser_on_T5")
        self.note = self.preprocessor()
        self.synonyms = self.synonymizer() 
        self.gpu = with_gpu
    
    def __call__(self):
        '''return list of paraphrases of the note.
        '''
        paraphrases = self.transformer_phrases()
        paraphrases.extend(self.synonym_phrases())
        return paraphrases
        
    def transformer_phrases(self):
        phrases = [tup[0] for tup in parrot.augment(input_phrase=self.note, 
                                                 use_gpu=self.gpu)]
        print(phrases)
        return phrases
        
    def synonym_phrases(self):
        genlist = []
        tokens  = []
        prev    = 0

        for word in self.note.split():
            if word in self.synonyms:
                tokens.append(list(itertools.chain(*[[word], self.synonyms[word]])))
            else:
                tokens.append([word])

        # use tokens to return new utterances
        for i in range(len(tokens)):
            word  = tokens[i][0]
            slist = tokens[i]

            for j in range(len(slist)):
                start = self.note.find(word, prev)
                end   = start + len(word)
                gen   = self.note[:start] + slist[j] + self.note[end:]
                genlist.append(gen)

            prev = end

        return list(set(genlist))

In [140]:
p = Paraphraser("I was sad that I failed my exam but I knew that I could score better next time.")

In [141]:
p()

['i was sad to fail an exam but i knew i could get better next time', 'i was sad that i failed my exams but i knew i would be able to score better next time', 'i was sad that i failed the exams but i knew i could score better next time', 'i was sad that i failed my exam but i knew i could score better the next time', 'i was sad that i failed my exams but i knew i could score better next time', 'i was sad that i failed my exam but i knew i could score better next time', 'i was sad i failed my exam but i knew that i could score better next time', 'i was sad that i failed my exam but i knew that i could score better next time']


['i was sad to fail an exam but i knew i could get better next time',
 'i was sad that i failed my exams but i knew i would be able to score better next time',
 'i was sad that i failed the exams but i knew i could score better next time',
 'i was sad that i failed my exam but i knew i could score better the next time',
 'i was sad that i failed my exams but i knew i could score better next time',
 'i was sad that i failed my exam but i knew i could score better next time',
 'i was sad i failed my exam but i knew that i could score better next time',
 'i was sad that i failed my exam but i knew that i could score better next time',
 'i was sad that i failed my exam but i knew that i could score better future time',
 'i was sad that i failed my exam but i knew that i could score better next time',
 'i was sorry that i failed my exam but i knew that i could score better next time',
 'i was sad that i failed my exam but i knew that i could score well next time',
 'i was sad that i failed 

In [42]:
class Tagger:
    #TODO