Here are short notes on **Word2Vec** 

---

```markdown
# 📌 Word2Vec - Quick Notes

## 🔹 What is Word2Vec?
Word2Vec is a deep learning-based word embedding technique that transforms words into vector representations, capturing their semantic meaning.

## 🔹 Types of Word2Vec Models
1. **CBOW (Continuous Bag of Words)**  
   - Predicts a target word based on its surrounding context words.  
   - Faster and works well for small datasets.  

2. **Skip-Gram**  
   - Predicts surrounding words based on a given target word.  
   - Works better for rare words in large datasets.  

## 🔹 How Word2Vec Works?
- Uses a **shallow neural network** with one hidden layer.  
- Converts words into **dense vectors** based on co-occurrence relationships.  
- Word vectors are learned by optimizing the network through **backpropagation**.  

## 🔹 Key Parameters
- `vector_size`: Number of dimensions in word embeddings.  
- `window`: Number of words considered around a target word.  
- `min_count`: Ignores words that appear less than this threshold.  
- `workers`: Number of CPU threads used in training.  

## 🔹 Example: Training a Word2Vec Model in Python
```python
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

sentences = ["Word embeddings capture meaning", "NLP is amazing", "Deep learning improves AI"]
tokenized_sentences = [word_tokenize(sent.lower()) for sent in sentences]

# Train Word2Vec model
model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)

# Get word vector
print(model.wv["learning"])
```

## 🔹 Advantages of Word2Vec
✅ Captures semantic relationships (e.g., "king - man + woman = queen")  
✅ Produces dense, meaningful vector representations  
✅ Efficient compared to traditional word embeddings  

## 🔹 Limitations
❌ Struggles with **out-of-vocabulary (OOV) words**  
❌ Ignores **word order and sentence structure**  
❌ Not **context-aware** (unlike BERT/GPT)  

## 🔹 Applications
- Sentiment Analysis  
- Machine Translation  
- Chatbots & Virtual Assistants  
- Document Clustering  

🔹 **Alternatives**: FastText (handles OOV words), GloVe (combines count-based + neural embeddings), BERT (contextual embeddings).  
```



In [1]:
!pip install gensim 





In [6]:

import pandas as pd

# Sample dataset
text = {
    "Sentence_ID": [1, 2, 3, 4, 5],
    "Text": [
        "The cat sat on the mat",
        "The dog barked at the stranger",
        "A man walked into the park with his dog",
        "She enjoys reading books in the library",
        "The sun is shining brightly in the sky"
    ]
}

# Convert to DataFrame
df = pd.DataFrame(text)

# Save as CSV
df.to_csv("word2vec_data.csv", index=False)

print("CSV file 'word2vec_data.csv' created successfully!")


CSV file 'word2vec_data.csv' created successfully!


In [7]:
df = pd.read_csv("word2vec_data.csv")

In [9]:
df.head()

Unnamed: 0,Sentence_ID,Text
0,1,The cat sat on the mat
1,2,The dog barked at the stranger
2,3,A man walked into the park with his dog
3,4,She enjoys reading books in the library
4,5,The sun is shining brightly in the sky


In [16]:
df.columns 



Index(['Sentence_ID', 'Text'], dtype='object')

In [None]:
import nltk 
import gensim
from nltk.tokenize import word_tokenize 
from gensim.models import Word2Vec


In [18]:
df["Tokenized_Text"] = df["Text"].apply(lambda x : word_tokenize(x.lower()))
print(df[["Text","Tokenized_Text"]])


                                      Text  \
0                   The cat sat on the mat   
1           The dog barked at the stranger   
2  A man walked into the park with his dog   
3  She enjoys reading books in the library   
4   The sun is shining brightly in the sky   

                                      Tokenized_Text  
0                      [the, cat, sat, on, the, mat]  
1              [the, dog, barked, at, the, stranger]  
2  [a, man, walked, into, the, park, with, his, dog]  
3    [she, enjoys, reading, books, in, the, library]  
4    [the, sun, is, shining, brightly, in, the, sky]  


In [23]:
tokenized_sentences = df["Tokenized_Text"].tolist()

In [28]:
w2v = Word2Vec(sentences = tokenized_sentences , vector_size = 100, window = 5 , min_count = 1 , workers = 1)


w2v.save("w2v.model")
print("Word2Vec model trained successfully!")


Word2Vec model trained successfully!


In [29]:
w2v = Word2Vec.load("w2v.model")


In [30]:
word_vector = w2v.wv["cat"]
print("Vector for 'cat': \n",word_vector)


Vector for 'cat': 
 [-0.00713902  0.00124103 -0.00717672 -0.00224462  0.0037193   0.00583312
  0.00119818  0.00210273 -0.00411039  0.00722533 -0.00630704  0.00464722
 -0.00821997  0.00203647 -0.00497705 -0.00424769 -0.00310898  0.00565521
  0.0057984  -0.00497465  0.00077333 -0.00849578  0.00780981  0.00925729
 -0.00274233  0.00080022  0.00074665  0.00547788 -0.00860608  0.00058446
  0.00686942  0.00223159  0.00112468 -0.00932216  0.00848237 -0.00626413
 -0.00299237  0.00349379 -0.00077263  0.00141129  0.00178199 -0.0068289
 -0.00972481  0.00904058  0.00619805 -0.00691293  0.00340348  0.00020606
  0.00475375 -0.00711994  0.00402695  0.00434743  0.00995737 -0.00447374
 -0.00138926 -0.00731732 -0.00969783 -0.00908026 -0.00102275 -0.00650329
  0.00484973 -0.00616403  0.00251919  0.00073944 -0.00339215 -0.00097922
  0.00997913  0.00914589 -0.00446183  0.00908303 -0.00564176  0.00593092
 -0.00309722  0.00343175  0.00301723  0.00690046 -0.00237388  0.00877504
  0.00758943 -0.00954765 -0.0080

In [31]:
print(w2v.wv.most_similar("dog"))


[('barked', 0.19900402426719666), ('a', 0.1727856993675232), ('cat', 0.17032110691070557), ('she', 0.1528787463903427), ('reading', 0.1485336273908615), ('on', 0.14597296714782715), ('sat', 0.06404566019773483), ('park', 0.05354084074497223), ('man', 0.047006841748952866), ('shining', 0.013658874668180943)]


🔹 Step 7: Find Similarity Between Words

In [32]:
similarity = w2v.wv.similarity("dog", "cat")
print("Similarity between 'dog' and 'cat':", similarity)


Similarity between 'dog' and 'cat': 0.1703211


### **Explanation of Word2Vec Model Parameters**
```python
w2v = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=1)
```
This line initializes and trains a **Word2Vec** model using `gensim`. Let's break down each parameter:

---

### **1️⃣ `sentences=tokenized_sentences`**
- **What it does**: 
  - Takes **preprocessed tokenized sentences** as input.
  - Each sentence is a **list of words** (tokens).
  
- **Example:**
  ```python
  tokenized_sentences = [
      ['the', 'cat', 'sat', 'on', 'the', 'mat'],
      ['the', 'dog', 'barked', 'at', 'the', 'stranger']
  ]
  ```
  Here, `Word2Vec` learns relationships between words based on their **context in these sentences**.

---

### **2️⃣ `vector_size=100`**
- **What it does**:
  - Sets the **number of dimensions** of the word embeddings.
  - Each word is represented as a **100-dimensional vector**.

- **Why 100?**
  - A higher dimension captures more **semantic meaning**.
  - Too large = computationally expensive.
  - Too small = may lose important relationships.
  - Common values: **50, 100, 200, 300**.

- **Example:**
  ```python
  word_vector = w2v.wv["cat"]
  print(word_vector.shape)  # Output: (100,)
  ```

---

### **3️⃣ `window=5`**
- **What it does**:
  - Defines the **maximum distance** between the target word and neighboring words.
  - If `window=5`, Word2Vec considers **5 words before and after** the target word for training.

- **Example:**
  ```
  Sentence:  "The dog barked at the stranger"
  
  For "barked", the context words (window=5) are:
  ['The', 'dog', 'at', 'the', 'stranger']
  ```

- **Smaller `window` (e.g., 2) = More focused relationships**.  
- **Larger `window` (e.g., 10) = More generalized meaning**.

---

### **4️⃣ `min_count=1`**
- **What it does**:
  - Ignores words that appear **less than `min_count` times** in the dataset.
  - If `min_count=1`, **all words** are considered.
  - If `min_count=5`, only words appearing **at least 5 times** are included.

- **Why use `min_count`?**
  - **Filters rare words** that may be **noise**.
  - Improves **training efficiency**.

- **Example:**
  ```python
  w2v = Word2Vec(sentences=tokenized_sentences, min_count=5)
  ```
  - Here, words appearing **less than 5 times** are ignored.

---

### **5️⃣ `workers=1`**
- **What it does**:
  - Defines the **number of CPU threads** used for training.
  - If `workers=4`, it uses **4 CPU cores** to speed up training.

- **Why use multiple workers?**
  - Speeds up training on large datasets.
  - On small datasets, `workers=1` is fine.

- **Example:**
  ```python
  import multiprocessing
  num_workers = multiprocessing.cpu_count()  # Get number of CPU cores
  w2v = Word2Vec(sentences=tokenized_sentences, workers=num_workers)
  ```
  - This automatically sets `workers` to **use all available CPU cores**.

---


# Word2Vec Model Parameters

### 🔹 `sentences=tokenized_sentences`
- Input tokenized sentences (list of words).
- Helps the model learn **word relationships**.

### 🔹 `vector_size=100`
- Defines **word embedding dimensions**.
- More dimensions = better meaning capture.
- Common values: **50, 100, 200, 300**.

### 🔹 `window=5`
- Determines how many words before & after the target word are considered.
- **Smaller `window`** = focused meaning.  
- **Larger `window`** = generalized meaning.

### 🔹 `min_count=1`
- Filters out words appearing **less than `min_count` times**.
- **Higher `min_count`** = removes rare words, speeds up training.

### 🔹 `workers=1`
- Number of CPU cores used for training.
- **More workers** = faster training.

---

### ✅ **Example**
```python
from gensim.models import Word2Vec

# Tokenized text data
tokenized_sentences = [
    ['the', 'cat', 'sat', 'on', 'the', 'mat'],
    ['the', 'dog', 'barked', 'at', 'the', 'stranger']
]

# Train Word2Vec model
w2v = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)

# Save model
w2v.save("word2vec_model.model")

# Get vector of 'cat'
print(w2v.wv["cat"])
```
- ✅ **Trains Word2Vec** on example sentences.
- ✅ **Saves the trained model**.
- ✅ **Retrieves vector representation** of "cat".

