# Random Indexing

**Random Indexing** is a way to represent words in a vector (similar to Word2Vec, but simpler and more efficient). It is based on:

Give each word a random "index" vector with values of -1, 0, or 1.
Putting these vectors together for each word based on its context (in this case, how often it appears together with other words in documents or tweets).

In [1]:
#importar librerias
import nltk
import numpy as np
from nltk.tokenize import TweetTokenizer
from sklearn.manifold import TSNE

In [1]:
# 📌 This notebook assumes that corpus processing, tokenization and BoW construction was already performed on the notebook:
# 👉 'feature-extraction/bag_of_words.ipynb'

#The variables used here (such as `BoW_tr`, `tr_txt`, `V1`, `dict_indices1`) were built there.
#If you want to re-run the pipeline from scratch, check that file first.

> 🔗 **Note:** The corpus loading, tokenization and construction of the Bag of Words is at
> [`bag_of_words.ipynb`](./feature-extraction/bag_of_words.ipynb)

In [3]:
def random_index(BoW, vocab, dimension=100):
    """Generates a random index matrix for words based on their co-occurrence in documents.

    Args:
        BoW (matrix): Bag-of-Words matrix where each row represent a document and each column a term.
        vocab (matrix): Vocabulary list containing the terms corresponding to the columns of BoW.
        dimension (int, optional): Dimension of the random index vectors. Defaults to 100.

    Raises:
        ValueError: If the length of the vocabulary does not match the number of terms in BoW.

    Returns:
        matrix: A matrix where each row represents a word vector in the specified dimension, normalized to unit length.
    """
    docs, terms = BoW.shape
    if len(vocab) != terms:
        raise ValueError("The vocabulary doesn't match in dimension.")

    word_vec = np.zeros((terms, dimension))
    index_vec = np.zeros((terms, dimension))

    for i in range(terms):
        index_vec[i] = np.random.choice([-1, 0, 1], size=dimension)

    for doc in BoW:
        idx = np.nonzero(doc)[0]  
        for i in idx:
            for word in idx:
                word_vec[i] += index_vec[word]

    norms = np.linalg.norm(word_vec, axis=1, keepdims=True)
    reduced_matrix = word_vec / (norms + 1e-10)  

    return reduced_matrix


In [4]:
V = list(dict_indices1.keys())                               #length of the vocabulary
reduced_matrixRI = random_index(BoW_tr, V, dimension=300)    
tsne = TSNE(n_components=2, perplexity=30, random_state=42)  #reduce the dimensionality to a 2D space visualization
reduced_matrixRI = tsne.fit_transform(reduced_matrixRI)      