### Understanding GloVe and the IMDb Dataset

GloVe (Global Vectors for Word Representation) is a word embedding model that converts words into dense vectors, capturing semantic meanings and relationships between words. Instead of using pre-trained embeddings, we will train our own GloVe embeddings using the IMDb dataset.

In this notebook, we will:
1. Use Hugging Face's `datasets` library to load the IMDb dataset.
2. Preprocess the dataset by tokenizing sentences.
3. Build a co-occurrence matrix.
4. Train GloVe embeddings from scratch.
5. Compute average GloVe embeddings for IMDb reviews.
6. Train a simple classifier using the embeddings.

The IMDb dataset is a binary sentiment classification dataset containing 25,000 training and 25,000 test reviews. Each review is labeled as positive or negative.

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [None]:
import datasets

train_data, test_data = datasets.load_dataset("imdb", split=["train", "test"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
or sample in train_data:f
  print(sample['text'])

In [None]:
sample

'text'

In [None]:
train_text=[sample['text'] for sample in train_data][:10]


TypeError: string indices must be integers, not 'str'

### Tokenization

Before computing GloVe embeddings, we need to preprocess the text data:
1. Tokenize the reviews into words.
2. Build a vocabulary from the tokenized data.

### Co-occurrence Matrix

To train GloVe embeddings, we need to build a co-occurrence matrix. This matrix $ X $ captures the frequency with which words co-occur within a defined context window.

$$
X_{ij} = \text{Number of times } w_i \text{ and } w_j \text{ appear within the context window}
$$

The context window can be set to a specific number of words before and after the target word.

### The Mathematics Behind GloVe

The GloVe algorithm leverages the co-occurrence matrix of words to compute their embeddings. For two words $ w_i $ and $ w_j $, their co-occurrence count is represented as $ X_{ij} $. The GloVe model optimizes the following objective function:

$$
J = \sum_{i,j} f(X_{ij}) \left( w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log(X_{ij}) \right)^2
$$

Where:
- $ w_i $ and $ \tilde{w}_j $ are word and context embeddings.
- $ b_i $ and $ \tilde{b}_j $ are biases.
- $ f(X_{ij}) $ is a weighting function to balance the contribution of frequent and infrequent co-occurrences.

The result is a vector space where semantic relationships between words are preserved.



### Weighting Function**
The weighting function $ f(X_{ij}) $ controls the influence of each co-occurrence pair based on its frequency:

$$
f(X_{ij}) =
\begin{cases}
\left( \frac{X_{ij}}{x_{\text{max}}} \right)^\alpha & \text{if } X_{ij} < x_{\text{max}} \\
1 & \text{if } X_{ij} \geq x_{\text{max}}
\end{cases}
$$

- $ x_{\text{max}} $: A threshold beyond which the weighting function is capped.
- $ \alpha $: A hyperparameter (usually $ \alpha = 0.75 $)

### Gradient Update
The model is optimized using gradient descent.


### Final Word Representation**
After training, the final word embedding for a word $ i $ is computed as:

$$
v_i = (w_i + \tilde{w}_i)/2
$$

This combines the word and context embeddings to create the final word vector.



### Average Word Embedding

To represent a review, we compute the average of GloVe vectors for all words in the review. If a word is not in the vocabulary, we skip it.

# Loading GloVe Model with Gensim and Working with Word and Sentence Embeddings

The following steps demonstrate how to load a pre-trained GloVe model using Gensim, extract word embeddings, compute similarity between words, and create sentence embeddings:

## 1. Load the GloVe Model
To use the GloVe embeddings, download the pre-trained GloVe vectors from [GloVe's official website](https://nlp.stanford.edu/projects/glove/). Convert the GloVe file into a Gensim-compatible format and load it into the model.

### Code Example
```python
from gensim.models import KeyedVectors

# Path to the GloVe file (e.g., 'glove.6B.100d.txt')
glove_file = 'glove.6B.100d.txt'
# Convert the GloVe file to Word2Vec format
from gensim.scripts.glove2word2vec import glove2word2vec
word2vec_output_file = 'glove.6B.100d.word2vec.txt'
glove2word2vec(glove_file, word2vec_output_file)

# Load the model
model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)


In [None]:
# Load the model
model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)



## 2. Extract Word Embeddings
Once the model is loaded, you can retrieve embeddings for any word in the vocabulary.

### Example
```python
word_vector = model['king']  # Get the vector for the word "king"
print(word_vector)           # Prints the embedding as a numpy array
```



## 3. Compute Similarity Between Words
You can use the GloVe model to compute semantic similarity between two words.

### Example
```python
similarity = model.similarity('king', 'queen')  # Compute similarity between "king" and "queen"
print(f"Similarity between 'king' and 'queen': {similarity}")
```



## 4. Generate Sentence Embeddings
Sentence embeddings can be created by averaging the word vectors for all words in the sentence.

### Example
```python
import numpy as np

def get_sentence_embedding(sentence, model):
    words = sentence.split()
    word_vectors = [model[word] for word in words if word in model.key_to_index]
    if not word_vectors:  # Handle case where no words are in the model's vocabulary
        return np.zeros(model.vector_size)
    return np.mean(word_vectors, axis=0)

sentence = "The king and queen ruled the kingdom."
sentence_embedding = get_sentence_embedding(sentence, model)
print(sentence_embedding)
```
