# Ubiquitous Vocab Demo
GTech Final - Kevin Zen


## Download pre-built models and data.

In [1]:
!python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [2]:
import sys
import nltk
nltk.download("wordnet")
from api import UbiVocab
from main import eval_different_filters

data_dir = "../../data/"


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/kevinzen/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Interactive Vocab Replacement.

* **article (str)**: Define your news article in the article variable.
* **num_sentences (int)** : Specify the number of sentences around a target word to look at when considering the context of the word.
* **run_lesk_wsd (bool)** : Use lesk while doing word sense disambiguation.
* **run_bert_wsd (bool)**: Use bert for word sense disambiguation, comparing word definition and example of word uses with the word in the article. I have found that lesk works better for this at the moment.

In [3]:
ubi_vocab = UbiVocab(data_dir = data_dir)
article = (f"This is an article about the damaging effects of not learning vocabulary. "
                f" This program seeks to create a routine exercise for learning vocabulary over time.")

# Process the input article above.
ubi_vocab.process_article(article = article,
                         run_lesk_wsd = True,
                         run_bert_wsd = False,
                         num_sentences = 1)


Reading existing file at ../../data/gre_vocab.csv
2021-04-27 11:28:49,938 - main - INFO - Loading in spacy and sentence transformer models.
2021-04-27 11:28:50,444 - transformer - INFO - Reading in pretrained model bert-base-nli-mean-tokens
2021-04-27 11:28:50,444 - sentence_transformers.SentenceTransformer - INFO - Load pretrained SentenceTransformer: bert-base-nli-mean-tokens
2021-04-27 11:28:50,445 - sentence_transformers.SentenceTransformer - INFO - Did not find folder bert-base-nli-mean-tokens
2021-04-27 11:28:50,445 - sentence_transformers.SentenceTransformer - INFO - Try to download model from server: https://sbert.net/models/bert-base-nli-mean-tokens.zip
2021-04-27 11:28:50,447 - sentence_transformers.SentenceTransformer - INFO - Load SentenceTransformer from folder: /Users/kevinzen/.cache/torch/sentence_transformers/sbert.net_models_bert-base-nli-mean-tokens
2021-04-27 11:28:51,910 - sentence_transformers.SentenceTransformer - INFO - Use pytorch device: cpu
2021-04-27 11:28:51

In [4]:
# View the new article with replaced words.
# Here we see damaging was replaced with "detrimental", and routine was replaced with "quotidian".
ubi_vocab.new_article

0    This is an article about the detrimental effects of not learning vocabulary.  This program seeks to create a quotidian exercise for learning vocabulary over time.
dtype: object

In [5]:
# View the underlying reasons for the replacement.
# We see that routine was matched to both mundane and quotidian and correctly chose quotidian. 
ubi_vocab.highlights_df

Unnamed: 0,word,syn,orig_text,mod_text,sim_score,syn_pos_context,syn_pos,synset,lesk
0,detrimental,damaging,This is an article about the damaging effects of not learning vocabulary,This is an article about the detrimental effects of not learning vocabulary,0.983425,adjective,adjective,Synset('damaging.s.01'),Synset('damaging.s.01')
1,mundane,routine,This program seeks to create a routine exercise for learning vocabulary over time,This program seeks to create a mundane exercise for learning vocabulary over time,0.787758,adjective,adjective,Synset('everyday.s.01'),Synset('mundane.s.03')
2,quotidian,routine,This program seeks to create a routine exercise for learning vocabulary over time,This program seeks to create a quotidian exercise for learning vocabulary over time,0.930106,adjective,adjective,Synset('everyday.s.01'),Synset('everyday.s.01')


# POS filtering and LESK is currently the most effective combination.
## Below are the metrics for each method.
For future work, BERT for WSD is the state of the art and while I didn't find it as effective here, I believe I would need to train a model on these specific words, and compare the average of their sentence vectors with the modified sentence.

In [6]:
master_df, metrics_df = eval_different_filters(data_dir = data_dir)

Reading existing file at ../../data/gre_vocab.csv


In [7]:
metrics_df.sort_values("f1", ascending = False)

Unnamed: 0,model,precision,recall,f1,tp,fp,fn,tn
0,pos_lesk_score,0.333333,1.0,0.5,4,8,0,36
0,pos_bert_score,0.5,0.25,0.333333,1,1,3,43
0,pos_bert_lesk_score,0.5,0.25,0.333333,1,1,3,43
0,lesk_score,0.181818,1.0,0.307692,4,18,0,26
0,pos_score,0.173913,1.0,0.296296,4,19,0,25
0,bert_wsd_score,0.2,0.25,0.222222,1,4,3,40
0,base,0.083333,1.0,0.153846,4,44,0,0


In [8]:
# For an example of the entire labeled dataset and results, see below.
master_df

Unnamed: 0,word,syn,orig_text,mod_text,sim_score,syn_pos_context,syn_pos,label,lesk,bert_wsd,synset,pos_score,lesk_score,bert_wsd_score,pos_lesk_score,pos_bert_score,pos_bert_lesk_score,base
0,check,see,"Yet, although we see all kinds of families on television now, it seems like some people still take their idea of what a ""real"" family looks like from the 1950s sitcoms","Yet, although we check all kinds of families on television now, it seems like some people still take their idea of what a ""real"" family looks like from the 1950s sitcoms",0.988222,verb,verb,0,Synset('check.v.18'),Synset('greathearted.s.01'),Synset('see.v.10'),1,0,0,0,0,0,1
1,check,see,""" No matter how the negative versus positive comments weigh, we're happy to see that this little girl has so many people in her corner, and that her parents and their new spouses are friendly enough to show up to her soccer game wearing custom-made shirts",""" No matter how the negative versus positive comments weigh, we're happy to check that this little girl has so many people in her corner, and that her parents and their new spouses are friendly enough to show up to her soccer game wearing custom-made shirts",0.991943,verb,verb,0,Synset('match.v.01'),Synset('patronize.v.04'),Synset('see.v.10'),1,0,0,0,0,0,1
2,corroborate,support,"But, thankfully, the positive responses and support for Plaayer's family seems to outweigh the bad","But, thankfully, the positive responses and corroborate for Plaayer's family seems to outweigh the bad",0.99042,noun,verb,0,Synset('validate.v.03'),Synset('minimize.v.03'),Synset('confirm.v.01'),0,0,0,0,0,0,1
3,corroborate,support,"Two Dad's, two moms, 4 grandmas and a big ass support system,"" another wrote","Two Dad's, two moms, 4 grandmas and a big ass corroborate system,"" another wrote",0.972239,noun,verb,0,Synset('validate.v.03'),Synset('confirm.v.01'),Synset('confirm.v.01'),0,0,1,0,0,0,1
4,corroborate,support,"Then, instead of being 'petty' as you are all claiming it is, they constantly worked together to support us and make us happy","Then, instead of being 'petty' as you are all claiming it is, they constantly worked together to corroborate us and make us happy",0.959698,verb,verb,0,Synset('corroborate.v.03'),Synset('incontrovertible.s.01'),Synset('confirm.v.01'),1,0,0,0,0,0,1
5,corroborate,support,I have an amazing support system of four people's families that treat me like their own and four parents whom I can go to with anything,I have an amazing corroborate system of four people's families that treat me like their own and four parents whom I can go to with anything,0.981328,noun,verb,0,Synset('corroborate.v.03'),Synset('incontrovertible.s.01'),Synset('confirm.v.01'),0,0,0,0,0,0,1
6,detrimental,damaging,"Some people attacked her message, saying that having a child spend time with both sets of parents is confusing and damaging, and that if the parents really wanted to spend time together, they should have stayed together","Some people attacked her message, saying that having a child spend time with both sets of parents is confusing and detrimental, and that if the parents really wanted to spend time together, they should have stayed together",0.996289,adjective,adjective,1,Synset('damaging.s.01'),Synset('damaging.s.01'),Synset('damaging.s.01'),1,1,1,1,1,1,1
7,devolve,fall,"""Because of us, I will never believe co-parenting can't work! I KNOW through experience it CAN WORK! Choose to do what's best for your child and everything will just fall into place,"" Plaayer wrote when she posted the photo to both Instagram and Facebook","""Because of us, I will never believe co-parenting can't work! I KNOW through experience it CAN WORK! Choose to do what's best for your child and everything will just devolve into place,"" Plaayer wrote when she posted the photo to both Instagram and Facebook",0.998385,verb,verb,0,Synset('devolve.v.01'),Synset('greathearted.s.01'),Synset('fall.v.21'),1,0,0,0,0,0,1
8,patronize,support,"But, thankfully, the positive responses and support for Plaayer's family seems to outweigh the bad","But, thankfully, the positive responses and patronize for Plaayer's family seems to outweigh the bad",0.987273,noun,verb,0,Synset('sponsor.v.01'),Synset('confirm.v.01'),Synset('patronize.v.04'),0,0,0,0,0,0,1
9,patronize,support,"Two Dad's, two moms, 4 grandmas and a big ass support system,"" another wrote","Two Dad's, two moms, 4 grandmas and a big ass patronize system,"" another wrote",0.965925,noun,verb,0,Synset('patronize.v.04'),Synset('patronize.v.04'),Synset('patronize.v.04'),0,1,1,0,0,0,1
