In [9]:
import numpy as np

### Word2Vec: A Better Word Embedding

- **Issues with tf-idf Embeddings**:
  - **Long and Sparse Vectors**: These vectors are often too long and filled with near-zero values.
  - Not Ideal for Machine Learning: Due to their length and sparsity, they lead to more model parameters and longer training times.
  - Word Order & Synonyms: Tf-idf doesn't consider the order of words or handle synonyms well.

- How Do We Address These Issues?


### Using Taxonomy to Capture Similarity

- Overcoming the lack of context can be achieved using taxonomies.
- What's a Taxonomy? 
  - It's a "knowledge organization system" that is often visualized as a tree or graph. 
  - It categorizes terms within a field and the relationships between them.
  - Taxonomies vary: from broad (like English Language taxonomy) to specific (like Computing or Amazon Product Taxonomy).
  
- A popular example is the WordNet Taxonomy:
  - Contains 155,327 English words.
  - [More about WordNet on Wikipedia](https://en.wikipedia.org/wiki/WordNet)

![WordNet Taxonomy Example](https://www.dropbox.com/s/9kapn0eq6v84g2m/word_net_taxonomy.png?dl=1)
[Source](https://escholarship.org/content/qt9j8221x8/qt9j8221x8.pdf)


### Semantic Similarity

* WordNet organizes words into sets of synonyms, known as synsets. 
  * Each synset represents a distinct concept.

* Hypernymy and Hyponymy Relations:
  * WordNet provides hypernyms (more general terms) and hyponyms (more specific terms) for each word.

* For each word in your dataset, create a feature vector where each entry corresponds to the semantic similarity between the word and a set of predefined words (or reference words). 

  * This will reduce the sparsity, as each word is now represented as a vector of similarity scores.

* Meronyms and Holonyms:
    * Consider incorporating part-whole relationships (meronyms) and whole-part relationships (holonyms) as additional features. For example, 'wheel' is a meronym of 'car'.
    
* Ideally we want to be able to **learn** a word's representation, such that the representation encodes meaning

### Beyond Taxonomies: The Case for Word2Vec

- Constructing taxonomies is intricate and demand expert input.
- Taxonomies offer clear hierarchies, but a word's context in text often shapes its meaning.
  - Unlike dynamic embeddings, taxonomies can restrict a word to a singular definition.
- Representing Distances: Knowing "snake" is a "reptile" is useful, but how close is it to "lizard"? Taxonomies might fall short.
  
* While taxonomies provide organized structure, their rigidity limits comprehensive word representation. 

### Understanding Word2Vec: Turning Word Meanings into Math

- When we talk about a word's 'meaning', we're referring to the concept or idea it stands for. 
  
- Now, imagine if we could translate these concepts into numbers or formulas. This means we can actually do calculations with word meanings!
  
- Here's an ineteresting example: 
  If - we do the math with the concepts behind the words, such as: 
  King minus Man plus Woman, the answer should remind us of the word 'Queen'.


<img src="https://www.dropbox.com/s/hiqe4ql1ne9reu2/mwkq.png?dl=1" width=700>

### Drawing Parallels: Word Operations vs. Color Operations

- Believe it or not, we already perform operations on words that represent colors, much like math.

- Think about how we mix and match colors. Can't we do the same with words?
  - For instance:
    - Mixing red and green gives us yellow.
    - Subtracting magenta from blue results in cyan.
    - Yellow brings to mind bananas more than it does green.
    - Combine black and white, and you get a shade of grey.
    - Just as 'royal' is a shade of yellow, 'sky' is a shade of blue.

[![Color Operations](https://www.dropbox.com/s/aon76xh7qlu1z2y/colors.png?dl=1)](https://www.dropbox.com/s/aon76xh7qlu1z2y/colors.png?dl=1)

Reference: [Exploring word color associations](https://gist.github.com/aparrish/2f562e3737544


### Drawing Parallels: Word Operations vs. Color Operations

- We already perform operations on words that represent colors, much like math.

- Think about how we mix and match colors. Can't we do the same with words?
  - For instance:
    - Mixing red and green gives us yellow.
    - Subtracting magenta from blue results in cyan.
    - Yellow brings to mind bananas more than it does green.
    - Combine black and white, and you get a shade of grey.
    - Just as 'royal' is a shade of yellow, 'sky' is a shade of blue.

[![Color Operations](https://www.dropbox.com/s/aon76xh7qlu1z2y/colors.png?dl=1)](https://www.dropbox.com/s/aon76xh7qlu1z2y/colors.png?dl=1)

Reference: [Exploring word color associations](https://gist.github.com/aparrish/2f562e3737544cf29aaf1af30362f469)


![](https://www.dropbox.com/s/5eh77i3byy0av50/xkcd%20colors.png?dl=1)

import json
color_data = json.loads(open("media/xkcd.json").read())
print(color_data.keys())


In [2]:
color_data

{'description': 'The 954 most common RGB monitor colors, as defined by several hundred thousand participants in the xkcd color name survey.',
 'colors': [{'color': 'cloudy blue', 'hex': '#acc2d9'},
  {'color': 'dark pastel green', 'hex': '#56ae57'},
  {'color': 'dust', 'hex': '#b2996e'},
  {'color': 'electric lime', 'hex': '#a8ff04'},
  {'color': 'fresh green', 'hex': '#69d84f'},
  {'color': 'light eggplant', 'hex': '#894585'},
  {'color': 'nasty green', 'hex': '#70b23f'},
  {'color': 'really light blue', 'hex': '#d4ffff'},
  {'color': 'tea', 'hex': '#65ab7c'},
  {'color': 'warm purple', 'hex': '#952e8f'},
  {'color': 'yellowish tan', 'hex': '#fcfc81'},
  {'color': 'cement', 'hex': '#a5a391'},
  {'color': 'dark grass green', 'hex': '#388004'},
  {'color': 'dusty teal', 'hex': '#4c9085'},
  {'color': 'grey teal', 'hex': '#5e9b8a'},
  {'color': 'macaroni and cheese', 'hex': '#efb435'},
  {'color': 'pinkish tan', 'hex': '#d99b82'},
  {'color': 'spruce', 'hex': '#0a5f38'},
  {'color': 'str

In [3]:
color_data["colors"][:3]

[{'color': 'cloudy blue', 'hex': '#acc2d9'},
 {'color': 'dark pastel green', 'hex': '#56ae57'},
 {'color': 'dust', 'hex': '#b2996e'}]

In [5]:
x = '#acc2d9'
from textwrap import wrap
wrap(x[1:], 2)

['ac', 'c2', 'd9']

In [6]:
colors = {col_info["color"]:tuple(wrap(col_info["hex"][1:], 2)) for col_info in color_data["colors"]}
colors["cloudy blue"]

('ac', 'c2', 'd9')

In [7]:
int('ac', 16), int('c2', 16), int('d9', 16), 

(172, 194, 217)

In [10]:
colors = {name:np.array(list(int(hex_v, 16) for hex_v in hex_t)) for name,hex_t in colors.items()}
colors["cloudy blue"]

array([172, 194, 217])

In [11]:
colors

{'cloudy blue': array([172, 194, 217]),
 'dark pastel green': array([ 86, 174,  87]),
 'dust': array([178, 153, 110]),
 'electric lime': array([168, 255,   4]),
 'fresh green': array([105, 216,  79]),
 'light eggplant': array([137,  69, 133]),
 'nasty green': array([112, 178,  63]),
 'really light blue': array([212, 255, 255]),
 'tea': array([101, 171, 124]),
 'warm purple': array([149,  46, 143]),
 'yellowish tan': array([252, 252, 129]),
 'cement': array([165, 163, 145]),
 'dark grass green': array([ 56, 128,   4]),
 'dusty teal': array([ 76, 144, 133]),
 'grey teal': array([ 94, 155, 138]),
 'macaroni and cheese': array([239, 180,  53]),
 'pinkish tan': array([217, 155, 130]),
 'spruce': array([10, 95, 56]),
 'strong blue': array([ 12,   6, 247]),
 'toxic green': array([ 97, 222,  42]),
 'windows blue': array([ 55, 120, 191]),
 'blue blue': array([ 34,  66, 199]),
 'blue with a hint of purple': array([ 83,  60, 198]),
 'booger': array([155, 181,  60]),
 'bright sea green': array([  

In [12]:
print("These colors were manually labelled by participants")

print(f"Black is {colors['black']}, white is: {colors['white']} and red is {colors['red']}")


These colors were manually labelled by participants
Black is [0 0 0], white is: [255 255 255] and red is [229   0   0]


![](https://www.dropbox.com/s/9k2828pyr0nypla/red.png?dl=1)

In [13]:
np.array([1,2,3]) + np.array([1,1,1])

array([2, 3, 4])

In [14]:
np.array([1,2,3]) - np.array([1,1,1])

array([0, 1, 2])

In [15]:
# import numpy as np
# Compute the Euclidean Distance in numpy
def dist(coord1, coord2):
    # Euclidean distance in numpy. 
    return np.linalg.norm(coord1 - coord2)
    
dist(colors['red'], colors['blue'])



324.49036965678965

In [16]:
np.mean([[1,2,3], [1,2,3], [1,2,3]], axis =0)

array([1., 2., 3.])

In [17]:
dist(colors['red'], colors['green']) > dist(colors['cherry red'], colors['tomato red'])

True

In [18]:
def closest(query, colors, n=10):
    closest = []
    closest = sorted(colors.keys(),
                        key=lambda x: dist(query, colors[x]))[:n]
    return closest

closest(colors['red'], colors,  n=5)

['red', 'fire engine red', 'bright red', 'tomato red', 'cherry red']

### Manipulating Colors with Their RGB "Embeddings"

- Using these RGB "embeddings", we can perform arithmetic on colors, much like we do with numbers.
  - Think of these operations:
     * Adding red and green gives us yellow. 
     * Subtracting blue from magenta leaves us with red.
     * When we talk about proximity, yellow is nearer to royal than to green in the color spectrum.
     * Mix black and white in equal parts, and you'll land on grey.
     * In the same way "banana" relates to the color yellow, "hunter green" relates to the basic shade of green.


In [30]:
### Red + green = yellow  
some_color = colors["red"] + colors["green"]
closest(some_color, colors, n=5)

['squash', 'orangey yellow', 'yellowish orange', 'saffron', 'amber']

In [19]:
some_color = colors["magenta"] - colors["blue"]
closest(some_color, colors, n=5)


['red', 'deep red', 'blood red', 'darkish red', 'dark red']

In [20]:
dist(colors["yellow"], colors["banana"]) < dist(colors["yellow"], colors["green"])


True

In [21]:
some_color =  np.mean([colors['black'], colors['white']], axis=0)
closest(some_color, colors,  n=5)


['medium grey', 'purple grey', 'steel grey', 'battleship grey', 'grey purple']

In [22]:
some_color

array([127.5, 127.5, 127.5])

### Relationship Between Colors

* Banana yellow is to yellow what hunter green is to green
  * Derived from the exact diagram

![](https://www.dropbox.com/s/aon76xh7qlu1z2y/colors.png?dl=1)

```
some_color = colors['yellow'] - colors['banana'] + colors['green']
closest(colors, some_color, n=5)

['true green',
 'grassy green',
 'vibrant green',
 'grass green',
 'dark grass green']
```
![](https://www.dropbox.com/s/tjgnw6cwf0kwju8/green_prediction.png?dl=1)

In [42]:
some_color = colors['yellow'] - colors['banana'] + colors['green']

closest(some_color, colors,  n=5)


['true green',
 'grassy green',
 'vibrant green',
 'grass green',
 'dark grass green']

### Making the Jump: Colors to Words

- Remember how we did math with colors? That's possible because each color has a meaningful numeric representation, or an "embedding".
- Similarly, words can have their own numeric representations. When words with similar meanings get close numeric values, we say they have good "embeddings".
- What does "semantic" mean? It's all about meaning in language.

- Enter Word2Vec: it gives each word a unique numeric vector. When you do math on these vectors, the results tell us about the relationships between words.


### Understanding Word Embeddings

- Think of word embeddings as machine-interpretable representations.
  
- When two words have similar meanings, their number values (or embeddings) are close to each other. And when their meanings are different, their number values are far apart.

- It's not just words! We can do this number magic for sentences, big chunks of text, even things outside language like proteins or even pictures.

- But how do we measure if two word embeddings are close or far? We use `cosine` similarity.

  - If two things (like movie ratings for movies $M_1$ and $M_2$) have a similarity of 1, it means they're just the same.
  
  - If they have a similarity of 0, they're not exactly the same, but they're not completely different either.
  
  - If they have a similarity of -1, they're total opposites!


### Grasping Word2Vec

- A wise man named J.R. Firth once said, "To know a word, look at its friends." This means words are often defined by other words around them.

- Let's play a game: "Paris is a city and the ___ of France." 
  - What word fits in the blank? 
    - Is it pretzel, pizza, capital, painting, or shame? 

- By guessing which words are likely to appear near other words, we get a hint about the word's meaning.

- If we can predict the neighborhood of a word, we get its essence. And this simple yet strong idea helps in building computer programs that understand language.


### Word2Vec Versus Language Models

- At its core, a language model plays a huge role in many crucial tasks that help computers understand human language.

- Think of it this way: a language model is like predicting what word or phrase comes next. For example, after "How are you," what's likely to follow? Maybe "...doing?" or "...feeling?".

```Language modeling is the task of assigning a probability to sentences in a language. […] Besides assigning a probability to each sequence of words, the language models also assign a probability for the likelihood of a given word (or a sequence of words) to follow a sequence of words. ``` 
    - From `Neural Network Methods in Natural Language Processing,`
    
- Now, Word2Vec isn't exactly a language model. But it's a handy tool when you want to dig deep into how words relate to each other.


### Getting a Feel for Word2Vec

- Imagine you have a huge book or a collection of articles. 
- Now, from this collection, we pick out some specific words and give each a special list of numbers called a vector.
- As you read through this collection, think of each word as having neighbors - some words come just before or after it.
- Now, for a chosen word (let's call it our main word), and its neighbors (we can call them context words), Word2Vec tries to guess the main word based on the neighbor words using their vectors.
- The cool part? We keep tweaking these vectors to get better and better guesses!


### Word2Vec Process

<img src="https://www.dropbox.com/s/i9686ozir426221/process.png?dl=1" width=800>

### Word2Vec Visualized

<img src="https://www.dropbox.com/s/sfkstxxyeevpiau/data_2d_space.png?dl=1" width=800>


### How Capitals Relate to Countries: An Example

- Just like how we naturally understand that Paris is linked to France, Tokyo has a similar connection to Japan.
- Consider these sentences:
  - "Paris is France's heart."
  - "Ambassadors meet in Paris representing France."
  - "Big events happen in Paris, especially at French government locations."
- Guess what? Swap out 'Paris' with 'Tokyo' and 'France' with 'Japan'. The relationship still makes sense!
- This shows how words in context can be replaced to understand relationships between them.

![Capital Relationships](https://www.dropbox.com/s/mwdh6z9qc0pflyy/capitals_example.png?dl=1)


### Finding Closest Values

* When working with colors, we simply used a for loop to iterate over all words. 

  * Given the output of a vector, how can we efficiently identify all closest neighbors of a word?

  * How can we scale this using map reduce?

* Modern implementations that use the GPU can compute millions of comparisons in seconds 

![](https://www.dropbox.com/s/1zzv2azdaxru9rk/mat_mult.png?dl=1)

### The  Skipgram Algorithm

![](https://www.dropbox.com/s/ykyjsroxu1utwd0/skipgram.png?dl=1)

### The Continuous Bag of Words ( CBOW) Algorithm
![](https://www.dropbox.com/s/sae7f1sp84xuwwy/cbow.png?dl=1)

### Word2Vec and Modern Word Representations: A Comparison

- Word2Vec gives us quick and static word embeddings.
  - Benefits? We can get them without labeled data. Just plain text works!
  
- Newer methods offer context-specific embeddings.
  - They lean on advanced deep learning techniques.

- Expanding the horizon: We've got embeddings for sentences, whole documents, and more.
  - Even speech isn't left out. Think of 'speech2vec' by Facebook AI.

- As AI and deep learning evolve, embeddings are become so drastically more accurate.
