<img src="https://designguide.ku.dk/download/co-branding/ku_logo_uk_h.png" alt="University logo" width="300" align="right"/>

## Language Processing 1

### Session 10 (part 2)

##### Manex Agirrezabal



### In the previous classes

We learned about:

 * Lexical semantics
 * WordNet

### In this class:

I would like you to learn about:

  * One-hot vector representation
  * Term-Document and Term-Term matrices
  * TF-IDF representation
  * Intuition of vector semantics

### Vector semantics

I would like to go back to a previous discussion.

One-hot encodings or sparse representations

  * Dimensionality
    * 50,000 words in the vocabulary, then, 50,000 features
    * can be computational resources
  * Relation between different instances
    * Dog, cat and table... There is the same distance between all these elements
    * low generalization power


Check discussion in section 3 from Goldberg (2016)

Goldberg, Y. (2016). A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research, 57, 345-420.

Actual discussion, sparse representations vs. dense representations

The main idea of current word-level word embeddings is to represent words as dense representations.

Advantages?

  * Model training will cause similar features to have similar vectors – information is shared between similar features
    * Generalization power
  * Computationally more efficient

In [5]:
import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
%matplotlib inline

### Why using word embeddings?

#### Modeling semantics

 * Until now, we modeled semantics using Wordnet (or Prolog, or something similar)
 * Search two words in Wordnet
 * Check the similarity value (according to the path in the **WORDNET GRAPH**!)

 * **High quality** resource, but **expensive**
 * There is a lot of manual annotation

If we want to create a new lexical resource for a specific language

 * We could do it automatically
 * Take all synset-synset relations for English and...
   * English $\rightarrow{}$ (Our desired language)

 * But a lot of synonymy pairings are different across languages
 
 <center>Can you think on an example that would not work?</center>

In [6]:
import nltk
from nltk.corpus import wordnet as wn

In [3]:
for syn in wn.synsets("blue", pos='v'):
    print (syn, syn.definition())

Synset('blue.v.01') turn blue


In [4]:
for syn in wn.synsets("urdin", lang="eus"):
    print (syn,syn.definition())

Synset('blue.n.07') any of numerous small butterflies of the family Lycaenidae
Synset('blue.n.02') blue clothing
Synset('blue.n.01') blue color or pigment; resembling the color of the clear sky in the daytime
Synset('mold.n.05') a fungus that produces a superficial growth on various kinds of damp or decaying organic matter
Synset('mildew.n.02') a fungus that produces a superficial (usually white) growth on organic matter
Synset('mold.n.03') loose soil rich in organic matter
Synset('bluing.n.01') used to whiten laundry or hair or give it a bluish tinge


In [5]:
for syn in wn.synsets("urdindu", lang="eus", pos='v'):
    print (syn,syn.definition())
    print (syn.examples())

Synset('silver.v.02') make silver in color
['Her worries had silvered her hair']


#### Distributional hypothesis (Harris, 1954)

 * Mid 20th century
 * Intuitions
   * "*oculist and eye-doctor occur in almost the same environments*"
   * "*If A and B have almost identical environments we say that they are synonyms*"

#### Firth (1957)

 * You shall know a word by the company it keeps

#### In Spanish:

<center><i>Dime con quien andas y te diré quién eres</i></center>

#### Example inspired by Nida (1976) found in SLP2:

<center><b>pacharán</b></center>

 * a bottle table of is on the
 * likes everybody
 * have dinner
 * drunk you makes
 * make blackthorn and we anis with

#### Example inspired by Nida (1976) found in SLP2:

<center><b>pacharán</b></center>

 * A bottle of **pacharán** is on the table

 * Everybody likes **pacharán**

 * We will have a **pacharán** after dinner

 * **Pacharán** makes you drunk

 * We make **pacharán** with anis and blackthorn

Conclusion:

 * From context words humans can guess pacharán means an alcoholic beverage made out of blackthorn and anis

 * Intuition for algorithm: two words are similar if they have similar word contexts.

#### Example KWIC

 * Tell me a weird word.

<center><a href="https://corpus.byu.edu/" target="_blank"/>BYU corpus</center>

### What is a word embedding?

  * A representation of a word
  * Dense representation (~300 dimensions)
  * Why are they relevant? What properties do they have?

You can download pretrained word embeddings from many websites, for instance:

 * https://absalon.ku.dk/courses/69089/files/7371057?module_item_id=2046662 (small version of embedding file, only 50,000 words)
 * http://vectors.nlpl.eu/repository/#

There are even pretrained word embeddings in the gensim package :-)

In [6]:
import gensim
from gensim.models import KeyedVectors

# Load vectors directly from the file
model = KeyedVectors.load_word2vec_format('./wiki.en.vec.short50K', binary=False)

How can we check which are the most similar words to a specific word?

In [7]:
model.most_similar("house")

[('houses', 0.6788979172706604),
 ('mansion', 0.6370158195495605),
 ('farmhouse', 0.6100958585739136),
 ('barn', 0.5588486194610596),
 ('representatives', 0.5545411705970764),
 ('townhouse', 0.5492976903915405),
 ('cottage', 0.5487219095230103),
 ('upstairs', 0.5245002508163452),
 ('outbuildings', 0.522941529750824),
 ('residence', 0.5194516777992249)]

How can we check the similarity between two words?

In [8]:
model.similarity("cat","dog")

0.63805175

In [9]:
model.similarity("cat","elephant")

0.42504558

How do we calculate this similarity?

Do it yourself:

  1. Get vectors (`model.get_vector`)
  2. Use the cosine similarity function we previously implemented (`cosine_similarity`)

In [None]:
#YOUR CODE HERE

### Some nice properties:


<img src="https://nlp.stanford.edu/projects/glove/images/man_woman.jpg"/>

Shall we see some of these operations working?

In [10]:
vec = model.get_vector("berlin") - model.get_vector("germany") + model.get_vector("france")
model.most_similar_cosmul(vec)

[('paris', 2.3291807174682617),
 ('berlin', 2.1073265075683594),
 ('france', 1.9722492694854736),
 ('marseille', 1.932993769645691),
 ('toulouse', 1.9296311140060425),
 ('montpellier', 1.8973047733306885),
 ('rouen', 1.8967307806015015),
 ('avignon', 1.8911137580871582),
 ('rennes', 1.8815248012542725),
 ('ferrand', 1.8801523447036743)]

In [11]:
vec = model.get_vector("queen") - model.get_vector("woman") + model.get_vector("man")
model.most_similar_cosmul(vec)

[('queen', 2.3895578384399414),
 ('king', 2.0380735397338867),
 ('majesty', 1.7851786613464355),
 ('monarch', 1.7262828350067139),
 ('crown', 1.6237839460372925),
 ('queens', 1.5930700302124023),
 ('whitehall', 1.5785431861877441),
 ('reign', 1.576119065284729),
 ('coronation', 1.570044994354248),
 ('royal', 1.5298391580581665)]

Can you find more relevant patterns?

In [12]:
#YOUR CODE HERE
vec = model.get_vector("dark") - model.get_vector("black") + model.get_vector("white")
model.most_similar_cosmul(vec)

[('dark', 2.387988567352295),
 ('white', 2.0791215896606445),
 ('pale', 2.0786049365997314),
 ('darker', 1.9954276084899902),
 ('darkened', 1.937967300415039),
 ('bright', 1.8600469827651978),
 ('pinkish', 1.8571182489395142),
 ('greenish', 1.8410722017288208),
 ('shades', 1.8392839431762695),
 ('dull', 1.8250279426574707)]

We can also find the words that don't match in a list

In [13]:
model.doesnt_match("breakfast cereal dinner lunch".split())

'cereal'

In [14]:
model.doesnt_match("screen keyboard pen apple usb".split())

'pen'

Can you get more examples like these?

In [27]:
#YOUR CODE HERE
model.doesnt_match("football basketball baseball orange rugby".split())

'orange'

### Models

There are several models for estimating 

Further reading:

Nice blog post: http://www.davidsbatista.net/blog/2018/12/06/Word_Embeddings/

 - Word2vec
   - https://papers.nips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf
   - (additional) https://arxiv.org/pdf/1301.3781.pdf
 - Glove
   - https://www.aclweb.org/anthology/D14-1162.pdf
 - FastText
   - https://www.aclweb.org/anthology/Q17-1010.pdf

Code to play with:

 - Word2vec:
   - https://rare-technologies.com/word2vec-tutorial/ (here you can also load Glove or FastText models))
   - https://radimrehurek.com/gensim/models/word2vec.html
   - https://code.google.com/archive/p/word2vec/
   
 - Glove:
   - https://github.com/stanfordnlp/GloVe
   
 - FastText
   - https://fasttext.cc/
   - https://radimrehurek.com/gensim/models/fasttext.html