# Lecture 1 : intro and Word Vectors

## TL;DR

- Different ways words are represented by computers
  - WordNet : manual labeling, traditional method
  - WordVectors
    - One-Hot Vectors
    - Word Vectors

## Meaning of a Word

**How can we represent the meaning of a word?**

### Wordnet
Previously utilized NLP solution : WordNet

### Discrete Symbols
- Representing words as discrete symbols as one-hot-vectors
- Problem
  - If a user searches for “Seattle motel”, we would like to matchdocuments containing “Seattle hotel”.
  - However, the two vectors below are orthogonal, so there is no similarity between the two in one-hot-vectors

```
motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
```
- Solution
  - Could try to rely on WordNet’s list of synonyms to get similarity?
    - But it is well-known to fail badly: incompleteness, etc.
  - Instead: learn to encode similarity in the vectors themselves

## WordNet
- **Wordnet** is a lexical database of semantic relations between words in English first created by CogSys Lab of Princeton University.
- It includes N, V, ADJ, ADV but omits PREP, DET, and other function words.
- WordVec for other langauges exists too.

### WordNet Example

Downloading nltk and wordnet

In [None]:
import nltk
nltk.download('wordnet')

In [8]:
from nltk.corpus import wordnet as wn

print('Synsets for the word "invite" in WordNet:\n\n', wn.synsets('invite'))

Synsets for the word "invite" in WordNet:

 [Synset('invite.n.01'), Synset('invite.v.01'), Synset('invite.v.02'), Synset('tempt.v.03'), Synset('invite.v.04'), Synset('invite.v.05'), Synset('invite.v.06'), Synset('invite.v.07'), Synset('receive.v.05')]


In [9]:
# We can constrain the search by specifying the part of speech
# parts of speech available: ADJ, ADV, ADJ_SAT, NOUN, VERB
# ADJ_SAT: see https://stackoverflow.com/questions/18817396/what-part-of-speech-does-s-stand-for-in-wordnet-synsets

# Way one
print(f'{"-"*20}Way one{"-"*20}')
print('Synsets for the noun "invite" in WordNet:\n\n', wn.synsets('invite', pos=wn.NOUN))

# Way two
print(f'\n\n{"-"*20}Way two{"-"*20}')
# pos: {'n':'noun', 'v':'verb', 's':'adj (s)', 'a':'adj', 'r':'adv'}
print('Synsets for the noun "invite" in WordNet:\n\n', [s for s in wn.synsets('invite') if s.pos()=='n'])


--------------------Way one--------------------
Synsets for the noun "invite" in WordNet:

 [Synset('invite.n.01')]


--------------------Way two--------------------
Synsets for the noun "invite" in WordNet:

 [Synset('invite.n.01')]


In [10]:
# check definition of a synset
print(f'{"-"*20}Definition{"-"*20}')
print('The definition for invite as a noun:\n\n', wn.synset('invite.n.01').definition())

# check the related examples
print(f'\n\n{"-"*20}Examples{"-"*20}')
print('The definition for invite as a noun:\n\n', wn.synset('invite.n.01').examples())

# check the hypernyms
print(f'\n\n{"-"*20}Hypernyms{"-"*20}')
print('The hypernyms for invite as a noun:\n\n', wn.synset('invite.n.01').hypernyms())


--------------------Definition--------------------
The definition for invite as a noun:

 a colloquial expression for invitation


--------------------Examples--------------------
The definition for invite as a noun:

 ["he didn't get no invite to the party"]


--------------------Hypernyms--------------------
The hypernyms for invite as a noun:

 [Synset('invitation.n.01')]


### Limitations
- Requires human labor
  - Impossible to update every word
- Missing **nuance**
  - "proficient" is listed as a synoynm for "good"
- Misses new words
  - badass, nifty, etc
- Cannot compute word similarity accurately (score range : 0~1)

In [11]:
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
print('The path similarity between cat(noun) and dog(noun): ', dog.path_similarity(cat))

The path similarity between cat(noun) and dog(noun):  0.2


## Word Vectors(AKA Embeddings)

- When a word *w* appears in a text, the **context** is the set of words that appear nearby.
- **Context words** build up a representation of *w*
- A dense **vector** for each word is created, measuring similarity as the vector dot product

<p align="center">
    <img src="img/j/j1.png" alt="Word Vectors" width="500"/>
</p>


**Word Space**

<p align="center">
    <img src="img/j/j2.png" alt="word vector" width="500"/>
</p>

- note that:
  - has, have, had are grouped together
  - come, go are closely groupd



## Word2vec

**Word2Vec** is a frameword for learning word vectors

How it works:

1. Get a large **corpus**(latin word for body) of text
2. Create a vector for each word in a fixed vocabulary
3. Go through each position *t* in the text, which has center word *c* and context words *o*
4. Find the probability of *o* given *c*(or vice versa) using the similarity of word vectors for *c* and *o*
5. Keep adjusting this

**core idea** : What is the probability of a word occuring in the context of the  center word?


<p align="center">
    <img src="img/j/j3.png" alt="calc" width="500"/>
</p>

If the window = 2, then it predicts the likelihood of the 2 words that come before and after the center word.
