### Understanding GloVe and the IMDb Dataset

GloVe (Global Vectors for Word Representation) is a word embedding model that converts words into dense vectors, capturing semantic meanings and relationships between words. Instead of using pre-trained embeddings, we will train our own GloVe embeddings using the IMDb dataset.

In this notebook, we will:
1. Use Hugging Face's `datasets` library to load the IMDb dataset.
2. Preprocess the dataset by tokenizing sentences.
3. Build a co-occurrence matrix.
4. Train GloVe embeddings from scratch.
5. Compute average GloVe embeddings for IMDb reviews.
6. Train a simple classifier using the embeddings.

The IMDb dataset is a binary sentiment classification dataset containing 25,000 training and 25,000 test reviews. Each review is labeled as positive or negative.

### Tokenization

Before computing GloVe embeddings, we need to preprocess the text data:
1. Tokenize the reviews into words.
2. Build a vocabulary from the tokenized data.

### Co-occurrence Matrix

To train GloVe embeddings, we need to build a co-occurrence matrix. This matrix $ X $ captures the frequency with which words co-occur within a defined context window.

$$
X_{ij} = \text{Number of times } w_i \text{ and } w_j \text{ appear within the context window}
$$

The context window can be set to a specific number of words before and after the target word.

### The Mathematics Behind GloVe

The GloVe algorithm leverages the co-occurrence matrix of words to compute their embeddings. For two words $ w_i $ and $ w_j $, their co-occurrence count is represented as $ X_{ij} $. The GloVe model optimizes the following objective function:

$$
J = \sum_{i,j} f(X_{ij}) \left( w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log(X_{ij}) \right)^2
$$

Where:
- $ w_i $ and $ \tilde{w}_j $ are word and context embeddings.
- $ b_i $ and $ \tilde{b}_j $ are biases.
- $ f(X_{ij}) $ is a weighting function to balance the contribution of frequent and infrequent co-occurrences.

The result is a vector space where semantic relationships between words are preserved.

### Average Word Embedding

To represent a review, we compute the average of GloVe vectors for all words in the review. If a word is not in the vocabulary, we skip it.