## LT2304 Language Technology Resources: Assignment on word embeddings

#### Míriam Sánchez Alcón

In [1]:
from gensim.models.word2vec import Word2Vec
from multiprocessing import cpu_count
import gensim.downloader as api

In [2]:
# Download dataset
dataset = api.load("text8")
data = [d for d in dataset]

In [3]:
# Split the data into 2 parts. Part 2 will be used later to update the model
data_part1 = data[:1000]
data_part2 = data[1000:]

In [4]:
# Train Word2Vec model. Defaults result vector size = 100
model = Word2Vec(data_part1, min_count = 0, workers=cpu_count())

In [5]:
# Get the word vector for given word
#model['boat']

In [6]:
# Save and Load Model
model.save('newmodel')
model = Word2Vec.load('newmodel')

### 1.1 Nearest neighbours

#### For each of the closest neighbors, specify what linguistic relation exists. Is it semantic, mor- phological, spelling difference, or other? Specify beyond these categories when possible (for instance, what specific kinds of semantic or morphological relation).

(answered individually after each example)

#### What do all the relations compiled above have in common that makes word2vec be able to capture them?

They all have in common a proximity in the vector space, which means that all of them share a common room or context and therefore they are either used in similar way (in a similar context) or they stem/derive from the same word.

In [7]:
model.most_similar_cosmul('create', topn=8)

  """Entry point for launching an IPython kernel.


[('develop', 0.9094280004501343),
 ('introduce', 0.8985365629196167),
 ('exploit', 0.8838707208633423),
 ('construct', 0.8829323649406433),
 ('adapt', 0.878865122795105),
 ('build', 0.878303050994873),
 ('enable', 0.8764412999153137),
 ('acquire', 0.8740254640579224)]

The closest neighbors to 'create' are other verbs that express a similar meaning, thus the similarity being semantic ('the action of producing something new').

In [8]:
model.most_similar_cosmul('good', topn=8)

  """Entry point for launching an IPython kernel.


[('bad', 0.883431613445282),
 ('really', 0.8022475242614746),
 ('quick', 0.8021515607833862),
 ('luck', 0.8018017411231995),
 ('poor', 0.8013737797737122),
 ('everyone', 0.7982611656188965),
 ('my', 0.7962745428085327),
 ('you', 0.7953109741210938)]

The closest neighbors to 'good' are a combination of semantically similar adjectives, sustantives and pronouns, all being positive characteristics of a person or an object.

In [9]:
model.most_similar_cosmul('king', topn=8)

  """Entry point for launching an IPython kernel.


[('emperor', 0.9077165126800537),
 ('throne', 0.9017484188079834),
 ('prince', 0.8884717226028442),
 ('elector', 0.8861269950866699),
 ('judah', 0.8861194849014282),
 ('queen', 0.8846652507781982),
 ('sultan', 0.8791512250900269),
 ('constantine', 0.8728420734405518)]

The closest neighbors to 'king' are semantically similar sustantives or sustantives that are related to king somehow, such as the equivalent to the word king in other cultures, the female counterpart, or an item related to it.

In [10]:
model.most_similar_cosmul('stockholm', topn=8)

  """Entry point for launching an IPython kernel.


[('rotterdam', 0.9175029397010803),
 ('stuttgart', 0.9094671607017517),
 ('mannheim', 0.9077668786048889),
 ('helsinki', 0.8997992277145386),
 ('dortmund', 0.8982261419296265),
 ('dresden', 0.8961442708969116),
 ('manitoba', 0.8938829898834229),
 ('bradford', 0.8931567072868347)]

The closest neighbors to 'stockholm' share the property of being a city (therefore semantically similar) or frequently visited institutions/buildings within that city, such as the university or the train station.

In [11]:
model.most_similar_cosmul('fail', topn=8)

  """Entry point for launching an IPython kernel.


[('respond', 0.9232657551765442),
 ('try', 0.9164349436759949),
 ('seek', 0.9024219512939453),
 ('decide', 0.8954998254776001),
 ('acquire', 0.887599527835846),
 ('perform', 0.8862231969833374),
 ('fix', 0.885450005531311),
 ('prove', 0.8851593732833862)]

I was expecting the closest neighbors to 'fail' to be other adjectives representing the same meaning, but it seems like it's showing common verbs accompaning this adjective, as in 'fail to respond' or 'fail to decide'. These are therefore collocations.

In [12]:
model.most_similar_cosmul('build', topn=8)

  """Entry point for launching an IPython kernel.


[('deliver', 0.8865854740142822),
 ('create', 0.878303050994873),
 ('expand', 0.8757200241088867),
 ('carry', 0.8716338276863098),
 ('buy', 0.8673720359802246),
 ('sell', 0.8645790815353394),
 ('accommodate', 0.862581729888916),
 ('keep', 0.8605408072471619)]

The closest neighbors to 'build' are semantically similar verbs meaning 'to create/expand'.

### 1.2 Analogies

#### For each of these analogy tasks, what is the right result, and where is it ranked when computing the analogy using word2vec? (obs: it might be the case that the right answer is not be among the 8 closest neighbors, or there may be more than one right answer). Specify what is the linguistic relation between the right results and the original query (the first word in the analogy)

In [13]:
model.most_similar_cosmul(positive=['king', 'woman'], negative=['man'],topn=8) #king+woman-man=queen

  """Entry point for launching an IPython kernel.


[('empress', 0.9364698529243469),
 ('son', 0.9166089296340942),
 ('elector', 0.906948983669281),
 ('castile', 0.9057071805000305),
 ('queen', 0.897747814655304),
 ('elizabeth', 0.8963039517402649),
 ('emperor', 0.8944636583328247),
 ('archbishop', 0.8938698172569275)]

The right result should be 'queen', ranked 5th. There is a contextual relation among all sustantives, but not semantic similarity in all cases. I am surprised that we still got some masculine equivalents after sustracting 'man'.

In [27]:
model.most_similar_cosmul(positive=['stockholm', 'denmark'], negative=['copenhagen'],topn=8)

  """Entry point for launching an IPython kernel.


[('norway', 1.1105077266693115),
 ('romania', 1.0747724771499634),
 ('sweden', 1.0710995197296143),
 ('cedes', 1.0662809610366821),
 ('hungary', 1.061793565750122),
 ('latvia', 1.0590490102767944),
 ('switzerland', 1.0568761825561523),
 ('lithuania', 1.055883765220642)]

I would assume the most correct would be Sweden (whose relation would be also contextual/semantical), which is ranked 3rd. Instead, we got the third scandinavian country.

In [28]:
model.most_similar_cosmul(positive=['green', 'weaker'], negative=['weak'],topn=8)

  """Entry point for launching an IPython kernel.


[('holzman', 0.982444703578949),
 ('brighter', 0.9815743565559387),
 ('aegolius', 0.9712939262390137),
 ('cairngorm', 0.9613748788833618),
 ('jays', 0.9503967761993408),
 ('eagles', 0.9462083578109741),
 ('clutton', 0.9437892436981201),
 ('hall', 0.9422101378440857)]

I don't really understand the relations between the neighbors obtained, since I don't know what 'holzman', which is ranked 1st, refers to. It seems to be a basketball player and coach, therefore the 'green' adjective might be related to a field, although in professional basketball they don't usually play outdoors.

In [29]:
model.most_similar_cosmul(positive=['high', 'smallest'], negative=['small'],topn=8)

  """Entry point for launching an IPython kernel.


[('lowest', 1.0295003652572632),
 ('euug', 0.9826725125312805),
 ('highest', 0.9711028337478638),
 ('neel', 0.9577215909957886),
 ('db', 0.9483307003974915),
 ('ratio', 0.9368687868118286),
 ('diablotins', 0.9240896105766296),
 ('unmodulated', 0.9169269800186157)]

They are all adjectives and the logical closest neighbor would be 'highest', ranked 3rd. 'Lowest' is ranked 1st although it is semantically opposite (antonyms).

In [30]:
model.most_similar_cosmul(positive=['build', 'bad'], negative=['good'],topn=8)

  """Entry point for launching an IPython kernel.


[('blow', 0.8911640048027039),
 ('releasing', 0.8881102204322815),
 ('fend', 0.886435866355896),
 ('samfu', 0.883983850479126),
 ('blew', 0.8808054327964783),
 ('attract', 0.877048134803772),
 ('buy', 0.8752694725990295),
 ('burn', 0.874954104423523)]

By adding 'good' the adjective 'bad' is reinforced even more, so the neartes neighbors are negative consequences of bad building/constructions, such as weak structures and therefore can be blown away ('blow' is ranked 1st) or released ('releasing' ranked 2nd). 

### 1.3 Semantic addition

#### For the following addition experiments, what is the rank of the correct solution/solutions?

In [63]:
model.most_similar_cosmul(positive=['germany', 'river'],topn=8)

  """Entry point for launching an IPython kernel.


[('danube', 0.6821359992027283),
 ('rhine', 0.6648325324058533),
 ('northeast', 0.6514289379119873),
 ('border', 0.6455503702163696),
 ('canal', 0.6448549032211304),
 ('peninsula', 0.6424357891082764),
 ('peru', 0.6361529231071472),
 ('sudan', 0.6348578333854675)]

The most reasonable solution is ranked 1st, because it's maybe the most famous German river.

In [26]:
model.most_similar_cosmul(positive=['sweden', 'capital'],topn=8)

  """Entry point for launching an IPython kernel.


[('denmark', 0.7795769572257996),
 ('belgium', 0.7677285671234131),
 ('netherlands', 0.7469103336334229),
 ('norway', 0.746178388595581),
 ('luxembourg', 0.7400593757629395),
 ('switzerland', 0.7339285016059875),
 ('austria', 0.7328314185142517),
 ('catalonia', 0.7200887799263)]

The most reasonable solution would be Stockholm but it's not between the 8th topmost nearest neighbors. Instead, the country Denmark is ranked 1st, which does not make much sense.