In [None]:
import numpy as np

### Word2Vec: A Better Word Embedding

- **Issues with tf-idf Embeddings**:
  - **Long and Sparse Vectors**: These vectors are often too long and filled with near-zero values.
  - Not Ideal for Machine Learning: Due to their length and sparsity, they lead to more model parameters and longer training times.
  - Word Order & Synonyms: Tf-idf doesn't consider the order of words or handle synonyms well.

- How Do We Address These Issues?


### Using Taxonomy to Capture Similarity

- Overcoming the lack of context can be achieved using taxonomies.
- What's a Taxonomy? 
  - It's a "knowledge organization system" that is often visualized as a tree or graph. 
  - It categorizes terms within a field and the relationships between them.
  - Taxonomies vary: from broad (like English Language taxonomy) to specific (like Computing or Amazon Product Taxonomy).
  
- A popular example is the WordNet Taxonomy:
  - Contains 155,327 English words.
  - [More about WordNet on Wikipedia](https://en.wikipedia.org/wiki/WordNet)

![WordNet Taxonomy Example](https://www.dropbox.com/s/9kapn0eq6v84g2m/word_net_taxonomy.png?dl=1)
[Source](https://escholarship.org/content/qt9j8221x8/qt9j8221x8.pdf)


### Semantic Similarity

* WordNet organizes words into sets of synonyms, known as synsets. 
  * Each synset represents a distinct concept.

* Hypernymy and Hyponymy Relations:
  * WordNet provides hypernyms (more general terms) and hyponyms (more specific terms) for each word.
  * Dog is a `hypernym` of Canine
  * Poodle is a `hyponym` of Dog

* For each word in your dataset, create a feature vector where each entry corresponds to the semantic similarity between the word and a set of predefined words (or reference words). 
  * This will reduce the sparsity, as each word is now represented as a vector of similarity scores.

* Meronyms and Holonyms:
    * Consider incorporating part-whole relationships (meronyms) and whole-part relationships (holonyms) as additional features. For example, 'wheel' is a meronym of 'car'.
      * Branch is a `meronym` of tree
      * Forest is a `holonym` of tree    
* Ideally we want to be able to **learn** a word's representation, such that the representation encodes meaning

### Beyond Taxonomies: The Case for Word2Vec

- Constructing taxonomies is intricate and demand expert input.
- Taxonomies offer clear hierarchies, but a word's context in text often shapes its meaning.
  - Unlike dynamic embeddings, taxonomies can restrict a word to a single definition.
    - Bank can mean many different things (verb, geographical entity, financial institution, etc.) 
- Representing Distances: Knowing "snake" is a "reptile" is useful, but how close is it to "lizard"? Taxonomies might fall short.
  
* While taxonomies provide organized structure, their rigidity limits comprehensive word representation. 

### Understanding Word2Vec: Turning Word Meanings into Math

- When we talk about a word's 'meaning', we're referring to the concept or idea it stands for. 
  
- Now, imagine if we could translate these concepts into numbers to use in formulas.
  * This means we can actually do calculations with word meanings!
  
- Here's an ineteresting example: 
  If - we do the math with the concepts behind the words, such as: 
  King minus Man plus Woman, the answer should remind us of the word 'Queen'.


<img src="https://www.dropbox.com/s/hiqe4ql1ne9reu2/mwkq.png?dl=1" width=700>

### Drawing Parallels: Word Operations vs. Color Operations

- We already perform operations on words that represent colors, much like math.

- Think about how we mix and match colors. Can't we do the same with words?
  - For instance:
    - Mixing red and green gives us yellow.
    - Subtracting magenta from blue results in cyan.
    - Yellow brings to mind bananas more than it does green.
    - Combine black and white, and you get a shade of grey.
    - Just as 'royal' is a shade of yellow, 'sky' is a shade of blue.

[![Color Operations](https://www.dropbox.com/s/aon76xh7qlu1z2y/colors.png?dl=1)](https://www.dropbox.com/s/aon76xh7qlu1z2y/colors.png?dl=1)

Reference: [Exploring word color associations](https://gist.github.com/aparrish/2f562e3737544


### Drawing Parallels: Word Operations vs. Color Operations

- We already perform operations on words that represent colors, much like math.

- Think about how we mix and match colors. Can't we do the same with words?
  - For instance:
    - Mixing red and green gives us yellow.
    - Subtracting magenta from blue results in cyan.
    - Yellow brings to mind bananas more than it does green.
    - Combine black and white, and you get a shade of grey.
    - Just as 'royal' is a shade of yellow, 'sky' is a shade of blue.

[![Color Operations](https://www.dropbox.com/s/aon76xh7qlu1z2y/colors.png?dl=1)](https://www.dropbox.com/s/aon76xh7qlu1z2y/colors.png?dl=1)

Reference: [Exploring word color associations](https://gist.github.com/aparrish/2f562e3737544cf29aaf1af30362f469)


![](https://www.dropbox.com/s/5eh77i3byy0av50/xkcd%20colors.png?dl=1)

import json
color_data = json.loads(open("media/xkcd.json").read())
print(color_data.keys())


In [None]:
color_data

In [None]:
color_data["colors"][:3]

In [None]:
x = '#acc2d9'
from textwrap import wrap
wrap(x[1:], 2)

In [None]:
colors = {col_info["color"]:tuple(wrap(col_info["hex"][1:], 2)) for col_info in color_data["colors"]}
colors["cloudy blue"]

In [None]:
int('ac', 16), int('c2', 16), int('d9', 16), 

In [None]:
colors = {name:np.array(list(int(hex_v, 16) for hex_v in hex_t)) for name,hex_t in colors.items()}
colors["cloudy blue"]

In [None]:
colors

In [None]:
print("These colors were manually labelled by participants")

print(f"Black is {colors['black']}, white is: {colors['white']} and red is {colors['red']}")


![](https://www.dropbox.com/s/9k2828pyr0nypla/red.png?dl=1)

In [None]:
np.array([1,2,3]) + np.array([1,1,1])

In [None]:
np.array([1,2,3]) - np.array([1,1,1])

In [None]:
# import numpy as np
# Compute the Euclidean Distance in numpy
def dist(coord1, coord2):
    # Euclidean distance in numpy. 
    return np.linalg.norm(coord1 - coord2)
    
dist(colors['red'], colors['blue'])



In [None]:
np.mean([[1,2,3], [1,2,3], [1,2,3]], axis =0)

In [None]:
dist(colors['red'], colors['green']) > dist(colors['cherry red'], colors['tomato red'])

In [None]:
def closest(query, colors, n=10):
    closest = []
    closest = sorted(colors.keys(),
                        key=lambda x: dist(query, colors[x]))[:n]
    return closest

closest(colors['red'], colors,  n=5)

### Manipulating Colors with Their RGB "Embeddings"

- Using these RGB "embeddings", we can perform arithmetic on colors, much like we do with numbers.
  - Think of these operations:
     * Adding red and green gives us yellow. 
     * Subtracting blue from magenta leaves us with red.
     * When we talk about proximity, yellow is nearer to royal than to green in the color spectrum.
     * Mix black and white in equal parts, and you'll land on grey.
     * In the same way "banana" relates to the color yellow, "hunter green" relates to the basic shade of green.


In [None]:
### Red + green = yellow  
some_color = colors["red"] + colors["green"]
closest(some_color, colors, n=5)

In [None]:
some_color = colors["magenta"] - colors["blue"]
closest(some_color, colors, n=5)


In [None]:
dist(colors["yellow"], colors["banana"]) < dist(colors["yellow"], colors["green"])


In [None]:
some_color =  np.mean([colors['black'], colors['white']], axis=0)
closest(some_color, colors,  n=5)


In [None]:
some_color

### Relationship Between Colors

* Banana yellow is to yellow what hunter green is to green
  * Derived from the exact diagram

![](https://www.dropbox.com/s/aon76xh7qlu1z2y/colors.png?dl=1)

```
some_color = colors['yellow'] - colors['banana'] + colors['green']
closest(colors, some_color, n=5)

['true green',
 'grassy green',
 'vibrant green',
 'grass green',
 'dark grass green']
```
![](https://www.dropbox.com/s/tjgnw6cwf0kwju8/green_prediction.png?dl=1)

In [None]:
some_color = colors['yellow'] - colors['banana'] + colors['green']

closest(some_color, colors,  n=5)


### Making the Jump: Colors to Words

- Remember how we did math with colors? That's possible because each color has a meaningful numeric representation, or an "embedding".
- Similarly, words can have their own numeric representations and when words with similar meanings (semantically similar) get close numeric values, we say they have good "embeddings".

- Enter Word2Vec: it gives each word a unique numeric vector. When you do math on these vectors, the results tell us about the relationships between words.


### Understanding Word Embeddings

- Word embeddings are machine-interpretable numerical representations, analogous to how colors can be encoded as RGB values

- Words with similar semantic meanings have embeddings that are mathematically close in vector space, while words with different meanings have embeddings that are further apart

- This embedding concept extends beyond individual words - we can create vector representations for sentences, documents, and even non-linguistic data like protein sequences and images



### Understanding Word2Vec

- "You shall know a word by the company it keeps" - J.R. Firth
  - Words derive their semantic meaning from their contextual relationships

- Consider: "Paris is a city and the ___ of France."
  - Among options like 'pretzel', 'pizza', 'capital', or 'painting', 
  - 'capital' fits naturally because it appears frequently in similar contexts

- Word2Vec learns semantic relationships by predicting words that co-occur in similar contexts
  - This distributional hypothesis is fundamental to computational linguistics

- By capturing these contextual patterns, Word2Vec generates vector representations that preserve semantic relationships, enabling machines to process natural language


### Word2Vec and Language Models: Key Distinctions

- Language models serve as foundational components in natural language processing systems, enabling various comprehension tasks

- Language models predict the probability distribution of words or tokens in a sequence:
  - Given "How are you ...", they compute probabilities for subsequent words like "doing",  "feeling" or "pizza"
  - They model both local and long-range dependencies in text

```
Language modeling is the task of assigning a probability to sentences in a language. [...] Besides assigning a probability to each sequence of words, the language models also assign a probability for the likelihood of a given word (or a sequence of words) to follow a sequence of words.
```
- Source: Neural Network Methods in Natural Language Processing

- While Word2Vec shares some application similarities with language models, it differs in its primary objective:
  - Language models: Predict probability distributions over entire vocabularies
  - Word2Vec: Generate dense vector representations that capture semantic relationships between words



### Word2Vec: Core Mechanism

- Word2Vec operates on large text corpora  and the algorithm assigns each vocabulary word an n-dimensional vector in continuous vector space

- For each word occurrence in the corpus:
  - A center word is selected
  - A context window defines neighboring words within a specified distance
  - The relationship between center and context words informs the learning process

- The training objective:
  - Given context words, predict the center word (Skipgram model)
  - Or predict context words given the center word (CBOW model)
  - Vectors are iteratively optimized through gradient descent to maximize prediction accuracy

- Through this iterative optimization process, the resulting word vectors capture meaningful semantic relationships in the embedding space

### Word2Vec Process

<img src="https://www.dropbox.com/s/i9686ozir426221/process.png?dl=1" width=800>

### Word2Vec Visualized

<img src="https://www.dropbox.com/s/sfkstxxyeevpiau/data_2d_space.png?dl=1" width=800>


### How Capitals Relate to Countries: An Example

![Capital Relationships](https://www.dropbox.com/s/mwdh6z9qc0pflyy/capitals_example.png?dl=1)


### The  Skipgram Algorithm
![](https://www.dropbox.com/s/ykyjsroxu1utwd0/skipgram.png?dl=1)

### The Continuous Bag of Words ( CBOW) Algorithm
![](https://www.dropbox.com/s/sae7f1sp84xuwwy/cbow.png?dl=1)

### Word2Vec and Modern Embedding Architectures: A Comparative Analysis

- Word2Vec generates efficient, static word embeddings
  - Requires only unlabeled text corpora for training
  - Computationally efficient for basic NLP tasks

- Contemporary architectures produce contextual embeddings
  - Leverage transformer-based architectures
  - Generate dynamic representations based on surrounding context

- The embedding paradigm extends beyond word-level analysis
  - Encompasses sentence, document, and multi-modal representations
  - Notable examples include Facebook AI's Speech2Vec for audio embeddings

- Recent advances in deep learning have significantly enhanced embedding quality
  - Improved semantic and syntactic relationship capture
  - Greater precision in downstream NLP tasks