In [1]:
import numpy as np
import pandas as pd
import datetime 

#### Counter-fitted vectors

The TextFooler uses Counter-fitted vectors by N. Mrksic et al (2016), which are trained word embeddings to find synonyms.  The source is at **https://github.com/nmrksic/counter-fitting**.  The counter-fitted-vectors.txt is extracted and preprocessed since it contains the raw words 'nan' and 'null', which python would have difficulties to process.  There are 65713 words each counter-fitted as a vector of 300 dimensions.

In [2]:
# read counter-fitted vectors
cf = pd.read_table('counter-fitted-vectors.txt', header=None, sep=' ')
cf.set_index(0, inplace=True)
cf.shape

(65713, 300)

In [3]:
# check any non-string index
[idx for idx, w in enumerate(cf.index) if type(w) != type('word')]

[6167, 55574]

In [4]:
# check raw text
temp = pd.read_table('counter-fitted-vectors.txt', header=None)

In [5]:
temp.loc[[6167, 55574]] # 'nan' and 'null'

Unnamed: 0,0
6167,null 0.054147 0.062974 -0.018208 -0.000571 0.0...
55574,nan 0.034078 0.007384 0.013994 0.022337 -0.051...


In [6]:
# cleaned
new_index = list(cf.index)
new_index[6167] = 'null'
new_index[55574] = 'nan'
cf.index = new_index
[idx for idx, w in enumerate(cf.index) if type(w) != type('word')]

[]

Synonyms are identified as the most-positive dot products with target word vectors.  Here we also select the synonyms meeting the part of speech i.e. adjectives.  One-word part of speech is considered.

In [7]:
word2index_cf = dict(zip(cf.index.tolist(), list(range(65713))))
index2word_cf = dict(zip(list(range(65713)), cf.index.tolist()))

In [8]:
st = datetime.datetime.now()

In [9]:
cfv = cf.copy()
cfv['vocab sn'] = np.arange(65713)

with open('nn_matrix_cf.txt', 'w') as f:
    for idx, target_word in enumerate(cf.index):
        target_wv = cf.loc[target_word]
        cfv['dot product'] = cf.dot(target_wv)
        
        top50syn = cfv.sort_values(by='dot product', ascending=False)[1:51].index.tolist()
        t = cfv.loc[top50syn]['vocab sn'].values.astype('str')

        for s in t[:-1]:
            f.write(s+',')
        f.write(t[-1]+'\n')
        
        if (idx+1) % 3000 == 0:
            print('.', end='')

.....................

In [10]:
et = datetime.datetime.now()
et-st

datetime.timedelta(seconds=12456, microseconds=808043)

### BERT Vectors

In [8]:
# read BERT vectors
bertv = pd.read_table('bertvocab.txt', header=None, sep=',')
bertv.index = new_index
bertv.shape

(65713, 768)

In [9]:
st = datetime.datetime.now()

In [10]:
bertv_copy = bertv.copy()
bertv_copy['vocab sn'] = np.arange(65713)

with open('nn_matrix_bert.txt', 'w') as f:
    for idx, target_word in enumerate(bertv.index):
        target_wv = bertv.loc[target_word]
        bertv_copy['dot product'] = bertv.dot(target_wv)
        
        top50syn = bertv_copy.sort_values(by='dot product', ascending=False)[1:51].index.tolist()
        t = bertv_copy.loc[top50syn]['vocab sn'].values.astype('str')

        for s in t[:-1]:
            f.write(s+',')
        f.write(t[-1]+'\n')
        
        if (idx+1) % 3000 == 0:
            print('.', end='')

.....................

In [11]:
et = datetime.datetime.now()
et-st

datetime.timedelta(seconds=26170, microseconds=219192)

### CLIP Vectors

In [12]:
# read CLIP vectors
clipv = pd.read_table('clipvocab.txt', header=None, sep=',')
clipv.index = new_index
clipv.shape

(65713, 512)

In [13]:
st = datetime.datetime.now()

In [14]:
clipv_copy = clipv.copy()
clipv_copy['vocab sn'] = np.arange(65713)

with open('nn_matrix_clip.txt', 'w') as f:
    for idx, target_word in enumerate(clipv.index):
        target_wv = clipv.loc[target_word]
        clipv_copy['dot product'] = clipv.dot(target_wv)
        
        top50syn = clipv_copy.sort_values(by='dot product', ascending=False)[1:51].index.tolist()
        t = clipv_copy.loc[top50syn]['vocab sn'].values.astype('str')

        for s in t[:-1]:
            f.write(s+',')
        f.write(t[-1]+'\n')
        
        if (idx+1) % 3000 == 0:
            print('.', end='')

.....................

In [15]:
et = datetime.datetime.now()
et-st

datetime.timedelta(seconds=18865, microseconds=54035)