# üåü Word2Vec - The Ultimate Word Embedding

**The Game Changer**: Word2Vec captures semantic meaning!

## Why Word2Vec?

### Problems with Previous Methods:

| Method | Problem |
|--------|---------|
| One-Hot | Sparse vectors, no meaning |
| BoW | No word order, no semantics |
| TF-IDF | Still sparse, no true similarity |

### Word2Vec Solution ‚úÖ

- **Dense vectors** (typically 300 dimensions)
- **Captures semantic relationships**
- **Similar words ‚Üí similar vectors**
- **Vector arithmetic works!**

---

## The Magic of Word2Vec ü™Ñ

Word2Vec understands relationships:

```
king - man + woman ‚âà queen
Paris - France + Germany ‚âà Berlin
good ‚âà great (close vectors)
good ‚â† bad (far vectors)
```

---

## How It Works (Simple Explanation)

Word2Vec uses a **neural network** to learn word relationships from context.

**Two approaches**:

### 1. CBOW (Continuous Bag of Words)
```
Context ‚Üí Predict Center Word
[I, love, machine, learning] ‚Üí predict: "machine"
```
‚úÖ Faster  
‚ö†Ô∏è Weaker on rare words

### 2. Skip-Gram
```
Center Word ‚Üí Predict Context
"machine" ‚Üí predict: [I, love, learning]
```
‚úÖ Better for rare words  
‚úÖ Better semantic learning  
‚ö†Ô∏è Slower

---

## Word Similarity - Cosine Distance

To measure how similar two words are:

```
distance = 1 - cosine_similarity(vector1, vector2)
```

**Interpretation**:
- Distance = 0 ‚Üí Identical words
- Distance ‚âà 0.3 ‚Üí Very similar (king, queen)
- Distance ‚âà 0.7 ‚Üí Somewhat related
- Distance ‚âà 1.0 ‚Üí Unrelated

---

## Pre-trained Models

**Google Word2Vec**:
- Trained on 3 billion words
- 300-dimensional vectors
- Size: ~1.5 GB
- Covers most English words

**Recommendation**: Use pre-trained models instead of training from scratch (unless you have domain-specific data)

In [None]:
!pip install gensim

In [1]:
import gensim

In [2]:
from gensim.models import Word2Vec, KeyedVectors

In [3]:
## References: https://stackoverflow.com/questions/46433778/import-googlenews-vectors-negative300-bin


In [4]:
import gensim.downloader as api

wv = api.load('word2vec-google-news-300')

vec_king = wv['king']



## üì• Loading Pre-trained Google Word2Vec

**What we're doing**: Loading Google's pre-trained Word2Vec model

**Model details**:
- Trained on Google News corpus
- 3 billion words
- 300-dimensional vectors
- File size: ~1.5 GB

**Note**: First time loading will take a few minutes!

In [5]:
vec_king

array([ 1.25976562e-01,  2.97851562e-02,  8.60595703e-03,  1.39648438e-01,
       -2.56347656e-02, -3.61328125e-02,  1.11816406e-01, -1.98242188e-01,
        5.12695312e-02,  3.63281250e-01, -2.42187500e-01, -3.02734375e-01,
       -1.77734375e-01, -2.49023438e-02, -1.67968750e-01, -1.69921875e-01,
        3.46679688e-02,  5.21850586e-03,  4.63867188e-02,  1.28906250e-01,
        1.36718750e-01,  1.12792969e-01,  5.95703125e-02,  1.36718750e-01,
        1.01074219e-01, -1.76757812e-01, -2.51953125e-01,  5.98144531e-02,
        3.41796875e-01, -3.11279297e-02,  1.04492188e-01,  6.17675781e-02,
        1.24511719e-01,  4.00390625e-01, -3.22265625e-01,  8.39843750e-02,
        3.90625000e-02,  5.85937500e-03,  7.03125000e-02,  1.72851562e-01,
        1.38671875e-01, -2.31445312e-01,  2.83203125e-01,  1.42578125e-01,
        3.41796875e-01, -2.39257812e-02, -1.09863281e-01,  3.32031250e-02,
       -5.46875000e-02,  1.53198242e-02, -1.62109375e-01,  1.58203125e-01,
       -2.59765625e-01,  

## üîç Exploring Word Vectors

Each word is represented as a 300-dimensional vector.

**Example**: The word "apple" becomes a list of 300 numbers like:
`[0.13, -0.25, 0.87, ..., 0.42]`

These numbers capture the **meaning** of the word based on how it's used in context!

In [6]:
vec_king.shape

(300,)

In [7]:
wv['cricket']

array([-3.67187500e-01, -1.21582031e-01,  2.85156250e-01,  8.15429688e-02,
        3.19824219e-02, -3.19824219e-02,  1.34765625e-01, -2.73437500e-01,
        9.46044922e-03, -1.07421875e-01,  2.48046875e-01, -6.05468750e-01,
        5.02929688e-02,  2.98828125e-01,  9.57031250e-02,  1.39648438e-01,
       -5.41992188e-02,  2.91015625e-01,  2.85156250e-01,  1.51367188e-01,
       -2.89062500e-01, -3.46679688e-02,  1.81884766e-02, -3.92578125e-01,
        2.46093750e-01,  2.51953125e-01, -9.86328125e-02,  3.22265625e-01,
        4.49218750e-01, -1.36718750e-01, -2.34375000e-01,  4.12597656e-02,
       -2.15820312e-01,  1.69921875e-01,  2.56347656e-02,  1.50146484e-02,
       -3.75976562e-02,  6.95800781e-03,  4.00390625e-01,  2.09960938e-01,
        1.17675781e-01, -4.19921875e-02,  2.34375000e-01,  2.03125000e-01,
       -1.86523438e-01, -2.46093750e-01,  3.12500000e-01, -2.59765625e-01,
       -1.06933594e-01,  1.04003906e-01, -1.79687500e-01,  5.71289062e-02,
       -7.41577148e-03, -

In [8]:
wv.most_similar('cricket')

[('cricketing', 0.8372225761413574),
 ('cricketers', 0.8165745735168457),
 ('Test_cricket', 0.8094819188117981),
 ('Twenty##_cricket', 0.8068488240242004),
 ('Twenty##', 0.7624265551567078),
 ('Cricket', 0.75413978099823),
 ('cricketer', 0.7372578382492065),
 ('twenty##', 0.7316356897354126),
 ('T##_cricket', 0.7304614186286926),
 ('West_Indies_cricket', 0.6987985968589783)]

In [None]:
wv.most_similar('happy')

## üéØ Finding Similar Words

**`most_similar()`** finds words with similar meanings!

**How it works**:
1. Gets the vector for your word
2. Calculates cosine similarity with all other words
3. Returns top N most similar words with similarity scores

**Similarity score**:
- 1.0 = Identical
- 0.7-0.9 = Very similar
- 0.5-0.7 = Somewhat related
- < 0.5 = Not very related

In [None]:
wv.similarity("hockey","sports")

## üìè Calculating Distance Between Words

**Distance = 1 - cosine_similarity**

**Example results**:
- Distance(king, queen) ‚âà 0.3 ‚Üí Very similar! ‚úÖ
- Distance(king, apple) ‚âà 0.8 ‚Üí Not related ‚ùå

Lower distance = more similar meaning!

In [9]:
vec=wv['king']-wv['man']+wv['woman']

In [10]:
vec

array([ 4.29687500e-02, -1.78222656e-01, -1.29089355e-01,  1.15234375e-01,
        2.68554688e-03, -1.02294922e-01,  1.95800781e-01, -1.79504395e-01,
        1.95312500e-02,  4.09919739e-01, -3.68164062e-01, -3.96484375e-01,
       -1.56738281e-01,  1.46484375e-03, -9.30175781e-02, -1.16455078e-01,
       -5.51757812e-02, -1.07574463e-01,  7.91015625e-02,  1.98974609e-01,
        2.38525391e-01,  6.34002686e-02, -2.17285156e-02,  0.00000000e+00,
        4.72412109e-02, -2.17773438e-01, -3.44726562e-01,  6.37207031e-02,
        3.16406250e-01, -1.97631836e-01,  8.59375000e-02, -8.11767578e-02,
       -3.71093750e-02,  3.15551758e-01, -3.41796875e-01, -4.68750000e-02,
        9.76562500e-02,  8.39843750e-02, -9.71679688e-02,  5.17578125e-02,
       -5.00488281e-02, -2.20947266e-01,  2.29492188e-01,  1.26403809e-01,
        2.49023438e-01,  2.09960938e-02, -1.09863281e-01,  5.81054688e-02,
       -3.35693359e-02,  1.29577637e-01,  2.41699219e-02,  3.48129272e-02,
       -2.60009766e-01,  

In [11]:
wv.most_similar([vec])

[('king', 0.8449392318725586),
 ('queen', 0.7300517559051514),
 ('monarch', 0.645466148853302),
 ('princess', 0.6156251430511475),
 ('crown_prince', 0.5818676352500916),
 ('prince', 0.5777117609977722),
 ('kings', 0.5613663792610168),
 ('sultan', 0.5376775860786438),
 ('Queen_Consort', 0.5344247817993164),
 ('queens', 0.5289887189865112)]

---

## üéì Bonus Concept: Average Word2Vec

**Problem**: Word2Vec gives one vector per word. But ML models need one vector per sentence!

**Solution**: Average Word2Vec

### How it works:

For sentence: "The food is good"

1. Get vectors for each word:
   - the ‚Üí 300-d vector
   - food ‚Üí 300-d vector
   - is ‚Üí 300-d vector
   - good ‚Üí 300-d vector

2. **Average them**:
   ```
   Sentence Vector = (vec_the + vec_food + vec_is + vec_good) / 4
   ```

3. Result: **One 300-dimensional vector representing the entire sentence!**

### Why it works:
- Combines semantic meaning of all words ‚úÖ
- Fixed-size vector for any sentence length ‚úÖ
- Maintains context ‚úÖ
- Ready for ML models (classification, clustering, etc.) ‚úÖ

**Use case**: Text classification, sentiment analysis, document similarity

---

## üéØ Summary: Complete NLP Pipeline

```
Raw Text
    ‚Üì
1. Tokenization (split into words)
    ‚Üì
2. Remove Stopwords
    ‚Üì
3. Lemmatization (dictionary root)
    ‚Üì
4. Word2Vec (convert words to dense vectors)
    ‚Üì
5. Average Word2Vec (sentence-level vector)
    ‚Üì
ML Model (Classification, Clustering, etc.)
    ‚Üì
Predictions! üéâ
```

**Congratulations!** üéä You now understand the complete journey from raw text to ML-ready features!

**Next Steps**: 
- Deep Learning for NLP (RNNs, LSTMs)
- Transformers (BERT, GPT)
- Generative AI