## Word-embeddings to detect new terms using a list of seed terms

### Workspace requirements
- python 3.7+ (tested with python 3.7.10)
- [fasttext](https://fasttext.cc/docs/en/python-module.html) module for python
- [gensim](https://radimrehurek.com/gensim/) (tested with version 3.8.3)
- [numpy](https://numpy.org/) (tested with version 1.19.5)

In [1]:
# Import modules
import re
import sys
import os
import fasttext
import numpy as np

from collections import defaultdict
import gensim 
from gensim import corpora
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

### Initialize a model

Load the already compiled models: binary file (``.bin``) with subword information, and non-compiled file (``.vec``)

The embeddings were trained on a corpus of >14 million tokens of texts related to the COVID-19 pandemics.

In [2]:
vec_filename = "fasttext-covid19-embeddings/covid-19-embeddings-es-d100-min5-uncased.vec"
bin_filename = "fasttext-covid19-embeddings/covid-19-embeddings-es-d100-min5-uncased.bin"
model_word2vec = gensim.models.KeyedVectors.load_word2vec_format(vec_filename)
model_fasttext = fasttext.load_model(bin_filename)
print("Model loaded!")

2022-04-26 17:36:05,955 : INFO : loading projection weights from fasttext-covid19-embeddings/covid-19-embeddings-es-d100-min5-uncased.vec
2022-04-26 17:36:08,296 : INFO : loaded (32676, 100) matrix from fasttext-covid19-embeddings/covid-19-embeddings-es-d100-min5-uncased.vec


Model loaded!


In [3]:
# Define the number of nearest neighbours to check
k = 10

In [4]:
# Open the list of "seed terms" to check
termList = []

with open("covid-19_terms",'r',encoding='utf8') as f:
    termList = f.readlines()
    termList = [term.lower().strip() for term in termList]
    
print(termList)

['arbidol', 'camrelizumab', 'covid-19', 'coronavirus', 'confinamiento', 'cuarentena', 'colchicina', 'danoprevir', 'epi', 'epp', 'hidroxicloroquina', 'favipiravir', 'ffp2', 'leronlimab', 'n95', 'opaganib', 'remdesivir', 'sars-cov-2', 'umifenovir', 'wuhan']


In [5]:
# File to print the results
outFile = open('10-nearest-neighbours','w',encoding='utf8')

# Get the k nearest neighbours of each seed term 
for new_word in termList:
    print("\n" + new_word,file=outFile)
    print("\n" + new_word)
    # Get the output of both Word2Vec and fastText
    test = None
    try:
        # 1) Output from word2vec file (words already seen in the corpus)
        test = model_word2vec.most_similar(new_word, topn=int(k)) 
    except:
        # 2) If it is an out-of-vocabulary (OOV) word, use fastText to get an embedding using subword information
        print("Word not in corpus!",file=outFile)
        print("Word not in corpus!")
        # returns a non-zero vector of n-dimensions from the fastText model
        word_vector = model_fasttext[new_word]
        vector = np.array(word_vector,dtype='f')
        test = model_word2vec.most_similar( [ vector ], topn=int(k))
    # Sort and print words from nearest to furthest item
    sorted_list = sorted(enumerate(test), key=lambda item: item[0])
    for j, v in sorted_list:
        print(v[0], str(round(v[1],4)),file=outFile)
        print(v[0], str(round(v[1],4)))

outFile.close()

2022-04-26 17:34:44,245 : INFO : precomputing L2-norms of word weight vectors



arbidol
Word not in corpus!


NameError: name 'np' is not defined

### Try defining a larger number of nearest neighbours to check

In [13]:
k = 50

In [15]:
# File to print the results
outFile = open('50-nearest-neighbours','w',encoding='utf8')

# Get the k nearest neighbours of each seed term 
for new_word in termList:
    print("\n" + new_word,file=outFile)
    print("\n" + new_word)
    # Get the output of both Word2Vec and fastText
    test = None
    try:
        # 1) Output from word2vec file (words already seen in the corpus)
        test = model_word2vec.most_similar(new_word, topn=int(k)) 
    except:
        # 2) If it is an out-of-vocabulary (OOV) word, use fastText to get an embedding using subword information
        print("Word not in corpus!",file=outFile)
        print("Word not in corpus!")
        # returns a non-zero vector of n-dimensions from the fastText model
        word_vector = model_fasttext[new_word]
        vector = np.array(word_vector,dtype='f')
        test = model_word2vec.most_similar( [ vector ], topn=int(k))
    # Sort and print words from nearest to furthest item
    sorted_list = sorted(enumerate(test), key=lambda item: item[0])
    for j, v in sorted_list:
        print(v[0], str(round(v[1],4)),file=outFile)
        print(v[0], str(round(v[1],4)))

outFile.close()


arbidol
Word not in corpus!
arb 0.7704
valsartán 0.7296
enalapril 0.7241
vafidemstat 0.7213
764198 0.7165
cds 0.7158
ssc 0.6991
—un 0.697
dfv890 0.6944
ara2 0.6922
betadex 0.6901
davoudi-monfared 0.689
cosin 0.6886
hippisley-cox 0.688
acei 0.6875
pioglitazona 0.6866
renin-angiotensin-aldosterone 0.6852
comorbidities 0.6836
● 0.6836
imatinib 0.6825
pirfenidona 0.6814
gautret 0.6783
carbamazepina 0.6752
ethyl 0.6747
jak2 0.6746
sevoflurane 0.6739
lisinopril 0.6729
ruxolitinib 0.6722
losartán 0.6719
cannabidiol 0.6704
fbi 0.6697
acetaminofén 0.6694
baricitinib 0.6694
ara 0.6685
avifavir 0.6679
hydroxychloroquine 0.6671
corticosteroids 0.6665
beta-1b 0.6664
acetilsalicílico 0.6659
recalled 0.6656
pitavastatina 0.6653
behalf 0.665
retinol 0.6648
araii 0.6647
gastaminza 0.6636
ascórbico 0.6634
renin-angiotensin 0.6626
fatality 0.6626
klok 0.662
aloe 0.6618

camrelizumab
Word not in corpus!
ravulizumab 0.9516
vedolizumab 0.9464
eculizumab 0.9443
bevacizumab 0.942
tocilizumab 0.9348
pembroliz