GloVe processes a text corpus and associates each word to a vector.
For this specific notebook, the corpus was made usign the entire 2017-03-08 it.wikipedia, 1000 assorted Italian ebooks and 1.6M chat conversation.

The compressed wikipedia dump was parsed using `plaintext_from_wikidump.py` (from [here](https://github.com/jacopofar/markov-avro-tools)), ebook text was extracted with calibre and the conversations from [IngestAdiumLogs](https://github.com/jacopofar/adium-to-avro)
Punctuation was removed and the text was lowercased using `remove_punctuation.py`, then GloVe was executed on the resulting file of ~3 GB with these settings:
* vector size: 100
* ignore tokens appearing less than 15 times
* 15 iterations


In [1]:
import math
# log of the euclidean distance, log is used to get more readable and rounding-friendly values
# log is monotonic, so we can use the value to make comparisons
# same goes for the N-th root where N is the vector size, that's monotonic too
def log_distance(vec1, vec2):
    sq_sum = sum(map(lambda t: (t[0] - t[1]) ** 2, zip(vec1,vec2)))
    if sq_sum == 0:
        return -float('inf')
    return math.log(sq_sum)

In [2]:
# efficient memory representation of float arrays
from array import array
words_repr = {}
for count, line in enumerate(open('/Users/utente/Documents/vector_it_100.txt','r')):
    elems = line.split(' ')
    words_repr[elems[0]] = array('f', map(float, elems[1:]))
    if count % 50000 == 0:
        print(f'read {count} lines')

read 0 lines
read 50000 lines
read 100000 lines
read 150000 lines
read 200000 lines
read 250000 lines
read 300000 lines
read 350000 lines


Let's see how it works for sample words. As expected, in-set distance between animals (gatto, topo, elefante)  and cities (milano, torino) is smaller than cross-set ones

In [3]:
def word_log_distance(word1, word2):
    return log_distance(words_repr[word1], words_repr[word2])

words = ['elefante', 'topo', 'gatto', 'milano', 'torino']

for a in words:
    for b in words:
        print(a)
        print(b)
        couple_distance = word_log_distance(a,b)
        print(f'the distance between {a} and {b} is {couple_distance}')

elefante
elefante
the distance between elefante and elefante is -inf
elefante
topo
the distance between elefante and topo is 2.846488685043265
elefante
gatto
the distance between elefante and gatto is 2.794266804542366
elefante
milano
the distance between elefante and milano is 3.9243535590223995
elefante
torino
the distance between elefante and torino is 3.7918914766441203
topo
elefante
the distance between topo and elefante is 2.846488685043265
topo
topo
the distance between topo and topo is -inf
topo
gatto
the distance between topo and gatto is 2.3345286151252216
topo
milano
the distance between topo and milano is 4.046828277989919
topo
torino
the distance between topo and torino is 3.9147062631693927
gatto
elefante
the distance between gatto and elefante is 2.794266804542366
gatto
topo
the distance between gatto and topo is 2.3345286151252216
gatto
gatto
the distance between gatto and gatto is -inf
gatto
milano
the distance between gatto and milano is 3.9327211009406557
gatto
torin

now let's look for the vectors close to a given one

In [4]:
def closest(vec, ignorelist=[]):
    max_dist_so_far = float('inf')
    closest_word = None
    closest_vec = None
    for w,v in words_repr.items():
        if w in ignorelist:
            continue
        this_dist = log_distance(v, vec)
        if this_dist < max_dist_so_far:
            max_dist_so_far = this_dist
            # print(f'closest so far {w}, at distance {max_dist_so_far}')
            closest_word = w
            closest_vec = v
    return closest_word, closest_vec, max_dist_so_far


def difference(vec1, vec2):
    return array('f', map(lambda t: t[0] - t[1], zip(vec1,vec2)))

def addition(vec1, vec2):
    return array('f', map(lambda t: t[0] + t[1], zip(vec1,vec2)))
# "input1 is for output1 what input2 is for return value"
# e.g.: paris is for france what rome is for italy - rome is the returned value
def word_analogy(input1, output1, input2):
    return addition(words_repr[input2], difference(words_repr[output1], words_repr[input1]))


#print(closest(word_analogy('roma', 'parigi', 'italia')))
a1 = 'mano'
a2 = 'braccio'
b1 = 'piede'
b2 = 'gamba'

v_a1 = words_repr[a1]
v_a2 = words_repr[a2]
v_b1 = words_repr[b1]
v_b2 = words_repr[b2]


print(f'{a1} vector: {v_a1}')
print(f'{a2} vector: {v_a2}')
print(f'{b1} vector: {v_b1}')
print(f'{b2} vector: {v_b2}')


ana_vector = word_analogy(a1, a2, b1)

print('analogy', ana_vector)
print(f'a1 from ana: {log_distance(ana_vector, v_a1)}')
print(f'a2 from ana: {log_distance(ana_vector, v_a2)}')
print(f'b1 from ana: {log_distance(ana_vector, v_b1)}')
print(f'b2 from ana: {log_distance(ana_vector, v_b2)}')
print('looking for words closest to analogy vector...')

ignoreus = [a1, a2, b1]
closest_tuple = closest(ana_vector, ignorelist=ignoreus)
print(f'MOST LIKELY MATCH: {closest_tuple[0]} with distance {closest_tuple[2]} - is for {a2} what {a1} is for {b1}', )
ignoreus.append(closest_tuple[0])

closest_tuple = closest(ana_vector, ignorelist=ignoreus)
print(f'MOST LIKELY MATCH: {closest_tuple[0]} with distance {closest_tuple[2]} - is for {a2} what {a1} is for {b1}', )
ignoreus.append(closest_tuple[0])

closest_tuple = closest(ana_vector, ignorelist=ignoreus)
print(f'MOST LIKELY MATCH: {closest_tuple[0]} with distance {closest_tuple[2]} - is for {a2} what {a1} is for {b1}', )
ignoreus.append(closest_tuple[0])

mano vector: array('f', [0.6919599771499634, 0.045556001365184784, 0.5272369980812073, 0.486378014087677, -0.3290340006351471, 0.5753369927406311, 0.3282040059566498, 0.14870400726795197, 0.6593790054321289, -0.4817349910736084, 0.21461600065231323, -0.602387011051178, -0.06331700086593628, 0.2917340099811554, -0.34991100430488586, 0.007501999847590923, -0.22546599805355072, 0.7400839924812317, 0.6441119909286499, 0.7058770060539246, -0.39807599782943726, -0.07347699999809265, -0.023214999586343765, -0.47804999351501465, 0.4521610140800476, -0.4790300130844116, 0.19309300184249878, 0.27258700132369995, -0.9146460294723511, 0.34329500794410706, -0.6239799857139587, -0.04706500098109245, -0.33723101019859314, 0.36100199818611145, -0.5046340227127075, -0.5703129768371582, -0.6465820074081421, -0.09659499675035477, 0.6791830062866211, 0.03311900049448013, 0.6746820211410522, 0.17448699474334717, -0.8768200278282166, 0.42315199971199036, -0.44227099418640137, -0.6582059860229492, 0.31212401

In [6]:
import time
import operator
def distances_dict(target_vector):
    result = {}
    for w,v in words_repr.items():
        this_dist = log_distance(v, target_vector)
        result[w] = this_dist
    return result

start_time = time.time()
target_word = 'cubo'
target_vector = words_repr[target_word]
print('retrieving distance dictionary...')
dists = distances_dict(target_vector)
# sort KV pairs by descending order of values
sorted_dists = sorted(dists.items(), key=operator.itemgetter(1))
elapsed_time = round(time.time() - start_time)
for k in range(20):
    print(f'{k} - CLOSEST TO {target_word}: {sorted_dists[k][0]} with distance {sorted_dists[k][1]}, it took {elapsed_time} seconds')

#closest_tuple = closest(words_repr[a_word], ignorelist=[a_word, 'torino'])

retrieving distance dictionary...
0 - CLOSEST TO cubo: cubo with distance -inf, it took 36 seconds
1 - CLOSEST TO cubo: parallelepipedo with distance 2.5780685544093145, it took 36 seconds
2 - CLOSEST TO cubo: quadrato with distance 2.683741353846279, it took 36 seconds
3 - CLOSEST TO cubo: centimetro with distance 2.71290363888691, it took 36 seconds
4 - CLOSEST TO cubo: rettangolo with distance 2.7437298297038977, it took 36 seconds
5 - CLOSEST TO cubo: sferico with distance 2.7442105104057255, it took 36 seconds
6 - CLOSEST TO cubo: millimetro with distance 2.7760517644305347, it took 36 seconds
7 - CLOSEST TO cubo: avente with distance 2.813669030830409, it took 36 seconds
8 - CLOSEST TO cubo: mancante with distance 2.8395540458643462, it took 36 seconds
9 - CLOSEST TO cubo: moltiplicato with distance 2.842290351403379, it took 36 seconds
10 - CLOSEST TO cubo: cubi with distance 2.843216477237865, it took 36 seconds
11 - CLOSEST TO cubo: posizionepaese20px20px20pxtotale with distan