In [None]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


RendererRegistry.enable('default')

In [156]:
import altair as alt
from embeddings_analysis import EmbeddingsLoader

from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.manifold import TSNE
from umap import UMAP

alt.data_transformers.disable_max_rows()
alt.renderers.set_embed_options(theme="dark")

RendererRegistry.enable('default')

# Numeric embedding analysis - OLMo-2-1124-7B

The model has been chosen as one of the targets of this analysis because of its inclination towards research and hackability. Like the other models considered, OLMo uses a BPE tokenizer in which the 0-999 range seems to be hardcoded to be encoded with a single token for each number.
Only numbers in this range are considered, even though there might be bigger integers that get encoded with a single token by the BPE tokenizer.

It is also notable that the OLMo model shares a lot of similarities with the LLaMa

We check the numbers that get encoded in a single embedding vector by running the tokenizer on all the numbers in the range until we find the first one that gets encoded with more than one token.

In [2]:
model_id = "allenai/OLMo-2-1124-7B"
loader = EmbeddingsLoader(model_id)
loader.smallest_multitoken_number()

Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

1000

One of the goal of this analysis is to find structures in the embeddings.

Hypotheses:
- The representation of different models converges to similar structures
- Numerical embeddings have a representation that favors numerical calculation tasks
- There are structures that the embeddings converge towards in the pursuit of certain tasks
    - Some structure provide affordances that allow for better resolution of certain tasks.

It's also notable that the choice of having specific tokens for the numbers in the range 0-999 bias the model toward a direct representation of positive integers, possibly negating symmetries with negative numbers.

TODO Confront this with research about model convergence

In [161]:
# Loading the number embeddings and 1000 random embeddings for comparison

number_embeddings = loader.numbers()
random_embeddings = loader.random()

number_embeddings.data.shape

(1000, 4096)

Dimensionality reduction techniques are employed to visualize the structures that might emerge from the embeddings. They also are compared to a visualization of random embeddings to show that the structure is specific to the number embeddings.

# Linear Dimensionality Reduction

## Principal Component Analysis

The numerical embeddings form a clear curve, suggesting they might follow a meaningful geometric pattern. The structure might follow this pattern for different reasons:

- The embdeddings capture non-linear relationships between the number tokens, which may take place in natural language data.
- PCA tries to preserve large distances in the data, which can cause a "bending" of inherently sequential data when projected to lower dimensions.
- Curves in PCA might happen because of Guttman effect, see [Camiz](https://www.researchgate.net/publication/228760485_The_Guttman_effect_Its_interpretation_and_a_new_redressing_method)
    - Maybe not, as similar structures appear using just SVD?

The color gradient is smooth, showing that the embedding space captures numerical proximity. Looking at the top right part of the curve, there looks to be a smear. It might seem incidental, but I'm gonna argue with further visualizations that it represents a recursive encoding of the numbers with one and two digits in the embedding space. Lower numbers find themselves in the right part of the color gradient, and they also happen to be in the right place 

In [153]:
number_pca = number_embeddings.dim_reduction(PCA(n_components=1000))
random_pca = random_embeddings.dim_reduction(PCA(n_components=1000))

alt.hconcat(number_pca.plot(), random_pca.plot()).resolve_scale(color="independent")

### Explained variance

The explained variance distribution pot shows a sharp elbow drop around the 50 dimensions mark. The cumulative explained variance plot shows how 90% of the variance can be explained by approximatively 600 components, suggesting that the intrinsic dimensionality of the numerical embeddings is much lower than the 4096 dimensions provided by the embeddings' size. This gives evidence to the hypothesis that the data resides on a lower-dimensional manifold.

In [158]:
number_pca.plot_variance_overview()

## Singular Value Decomposition

By applying SVD instead of PCA, and avoiding the mean normalization, some even more interesting patterns emerge. By plotting the first two components, the same localized digit clusters appear, but they repeat in a much more consistent manner, suggesting that the model learns the same structure for each cluster of digit counts (one, two and three digit numbers). 

- The structure seems to be fractal, which induces the question on whether this same structure would repeat if higher range numbers would be tokenized singularly (1000-9999 and so on).
    - as much as I want to say fractal, this appears to happen only on this component pair, as following plots show.

Avoiding PCA's normalization shows a much clearer structure, which suggests that information may be encoded in the absolute distance from the origin.

In [None]:
number_svd = number_embeddings.dim_reduction(TruncatedSVD(n_components=100))
random_svd = random_embeddings.dim_reduction(TruncatedSVD(n_components=100))

alt.hconcat(number_svd.plot(), random_svd.plot()).properties().resolve_scale(
    color="independent"
)

Bla bla bla I'm so smart

In [None]:
number_svd.plot_digit_overview()

In [163]:
tsne_kwargs = dict(
    perplexity=75,
    max_iter=3000,
    learning_rate=500,
    early_exaggeration=20,
    random_state=42,
)

number_tsne = number_embeddings.dim_reduction(TSNE(**tsne_kwargs))
random_tsne = random_embeddings.dim_reduction(TSNE(**tsne_kwargs))

alt.hconcat(number_tsne.plot(), random_tsne.plot()).resolve_scale(color="independent")

NameError: name 'PCA' is not defined

In [133]:
number_svd.plot_correlation_heatmap(20)

In [None]:
top_correlations_df = number_svd.top_correlations_df()
top_correlations_df

Unnamed: 0,Component1,Component2,Correlation
1,0,2,-0.402066
0,0,1,-0.365321
15,0,16,-0.291709
4,0,5,-0.245732
13,0,14,0.238143
3,0,4,0.210048
8,0,9,0.192235
5,0,6,0.166646
14,0,15,0.145475
6,0,7,0.123315


In [159]:
number_svd.plot_top_correlated_components()

In [103]:
number_svd.plot_digit_overview(0, 12)

# Non-Linear Dimensionality Reduction

Here are shown other representations for reference, even though PCA seems to be the best one as it gives a striking visual representation using only linear transformations. The other methods tried where t-SNE and UMAP, which in this particular case give the impression of an organized structure even in the random embeddings. There might be interesting observations to make about the digit encoding in this case.

In [None]:
umap_kwargs = dict(
    perplexity=75,
    max_iter=3000,
    learning_rate=500,
    early_exaggeration=20,
    random_state=42,
)

number_tsne = number_embeddings.dim_reduction(TSNE(**umap_kwargs))
random_tsne = random_embeddings.dim_reduction(TSNE(**umap_kwargs))

alt.hconcat(number_tsne.plot(), random_tsne.plot()).resolve_scale(color="independent")


In [None]:
number_tsne.plot_digit_overview()

In [146]:

umap_kwargs = dict()

number_tsne = number_embeddings.dim_reduction(UMAP(**umap_kwargs))
random_tsne = random_embeddings.dim_reduction(UMAP(**umap_kwargs))

alt.hconcat(number_tsne.plot(), random_tsne.plot()).resolve_scale(color="independent")


