# Word2Vec

https://medium.com/@enozeren/word2vec-from-scratch-with-python-1bba88d9f221
https://medium.com/@anmoltalwar/cbow-word2vec-854a043ee8f3

- One Hot Encodings
- Co-occurance Matrix
- Word Embeddings

Steps Involved

- Corpus Creation: Collect a large corpus of text data. This could be a collection of books, articles, or even social media posts.
- Tokenization: Break the text down into individual words or tokens.
- Vocabulary Building: Create a vocabulary of all unique words in the corpus.
- Word Embedding Model Selection: Choose a suitable word embedding model. Popular options include:
- Word2Vec (skip-gram or CBOW)
    - GloVe
    - FastText
- Model Training: Train the selected model on the corpus. This involves iteratively adjusting the word embeddings to minimize a loss function.
- Vector Lookup: Once trained, the model can be used to look up the vector representation of any word in the vocabulary.


NOTE: In LLM will split one single corpus (rich text document) into sentence. The `batch` = 30, where `batch` is number of sentences

#### One Hot Encodings of each word in our Corpus
- no semantic similarities between word vectors

```python
dogs   = [1, 0, 0, 0, 0, 0]
and    = [0, 1, 0, 0, 0, 0]
cats   = [0, 0, 1, 0, 0, 0]
love   = [0, 0, 0, 1, 0, 0]
this   = [0, 0, 0, 0, 1, 0]
meadow = [0, 0, 0, 0, 0, 1]
```



#### Co-occurance Matrix

- Have space and memory consumed when have a big vocab

<img src="images/word2vec_s1.png">



#### Word Embedding

- compress the co-occurance matrix to a lower dimension with Singular Value Decomposition (SVD). This was called LSA (latent semantic analysis) in the literature.
```python
dogs   = [0.14, 3.42]
and    = [0.90, 1.12]
cats   = [2.30, -0.56]
...
```


#### CBOW

CBOW is a neural network model used for word embeddings. It predicts a target word given its surrounding context words. It's one of the two main architectures used in Word2Vec.
1. Word Embedding: one-hot-endoding or random initialized
2. Define context window size and extract
3. Aggregated context vector: avg or sum for all words in context window to prepare for Shallow NN
4. Train through Shallow Neural Network
6. Test model

##### Example:

- Corpus: "The cat sat on the mat. The dog chased the cat."
- Context window size: 2 (user's defined)
- Vocabulary: {"the", "cat", "sat", "on", "mat", "dog", "chased"}
- Embedding dimension: 3 (user's defined)

##### Step 1: Embedding Initialization
Let's randomly initialize the embeddings for each word:

```json
"the": [0.1, 0.2, 0.3]
"cat": [0.4, 0.5, 0.6]
"sat": [0.7, 0.8, 0.9]
"on": [1.0, 1.1, 1.2]
"mat": [1.3, 1.4, 1.5]
"dog": [1.6, 1.7, 1.8]
"chased": [1.9, 2.0, 2.1]
```

##### Step 2: Context Window and Embedding Lookup

Target word: "cat"
Context window: ["the", "sat"]

2.A. Word Embeddings:
```json
"the": [0.1, 0.2, 0.3]
"sat": [0.7, 0.8, 0.9]
```

can do OHC

2.B One-Hot Encoding (OHC):
```json
"the": [1, 0, 0, 0, 0, 0, 0]
"sat": [0, 0, 1, 0, 0, 0, 0]
```

##### Step 3: Aggregated context vector (average or sum)
to create a single vector representation of the context words.

3.A Average the word embeddings:
```json
(0.1 + 0.7) / 2 = 0.4
(0.2 + 0.8) / 2 = 0.5
(0.3 + 0.9) / 2 = 0.6

[0.4, 0.5, 0.6]
```

3.B Average OHC:
```json
(1 + 0) / 2 = 0.5
(0 + 0) / 2 = 0
(0 + 1) / 2 = 0.5
(0 + 0) / 2 = 0
(0 + 0) / 2 = 0
(0 + 0) / 2 = 0
(0 + 0) / 2 = 0

[0.5, 0, 0.5, 0, 0, 0, 0]
```


##### Step 4: Training with Shallow Neuron Network

INPUT   : [0.4, 0.5, 0.6] (CBOW)
          [w1, w2, w3] (weight initialize at first round)
OUTPUT  : [o1] (predicted result from all vocabs)
PROCESS : 
- get input embedding
- initialize weight (or use adjusted weights)
- dot product from input and weight 
- Softmax activation function
- cross entropy to calculate loss
- backpropagation (repeat with updated weight)



##### Step 5: Testing new vocab with pre-trained Word2Vec
INPUT: New word
OUTPUT: Predicted Word

##### Example of CBOW Input Calculation
<b>Given</b>:



Word Embeddings (input):
|Word|Random Vector|
|-|-|
|cat|[0.1, 0.2, 0.3]|
|dog|[0.4, 0.5, 0.6]|
|bird|[0.7, 0.8, 0.9]|
|tree|[0.1, 0.2, 0.3]|
|flower|[0.4, 0.5, 0.6]|

<br>

|Parameters|Value|
|-|-|
|Vocabulary|`{cat, dog, bird, tree, flower}`|
|Window Size|2|
|Target Word|"dog"|
|Input Weight Matrix (W_in)|[[0.2, 0.3], [0.4, 0.5], [0.6, 0.7]]|

##### Steps:

- Create a window: The window around "dog" would be ["cat", "dog", "bird"].
- Represent words as vectors: The vectors for "cat", "dog", and "bird" are given.
- Calculate the average word vector:
- Average of the three vectors: ([0.1+0.4+0.7]/3, [0.2+0.5+0.8]/3, [0.3+0.6+0.9]/3) = [0.4, 0.5, 0.6].
- Multiply by the input weights: [0.4, 0.5, 0.6] * [[0.2, 0.3], [0.4, 0.5], [0.6, 0.7]] = [0.56, 0.77].
- Apply activation function: Assuming a sigmoid activation function:
    - sigmoid(0.56) ≈ 0.63
    - sigmoid(0.77) ≈ 0.68

- Result: The input to the hidden layer for the word "dog" is approximately [0.63, 0.68]. These values will be used as inputs to the hidden layer neurons in the CBOW model.