# Sentence Embeddings, Custom Distances, and Spherical Projections
This notebook is a workflow for generating, analyzing, and interpreting embeddings of sentences. If you're used to the basic workflow of getting embeddings and projecting them to two dimensions, this notebook contains some unique jumping off points:

1. It uses the `paraphrase-distilroberta-base-v1` model from the `sentence-transformers` library instead of Universal Sentence Encoder. I find this gives much better results.
2. It uses a custom distance metric, TS-SS, from the paper: [A. Heidarian and M. J. Dinneen, "A Hybrid Geometric Approach for Measuring Similarity Level Among Documents and Document Clustering,"](https://ieeexplore.ieee.org/document/7474366?reload=true). I used [this implementation](https://github.com/taki0112/Vector_Similarity/blob/master/README.md) as a reference, but corrected an error and turned it into a numba function for faster computation.
3. In projecting using UMAP, it projects to onto spherical coordinates instead of a traditional Euclidean 2D plane. I personally believe this leads to a much more intuitive understanding of embeddings. Rather than an interpretation on a 2D plane where in some sense we are "bound" by the limits of the x-axis and y-axis, it makes more sense to me that the embedding space is "wrapped" and doesn't have an end.
  1. We are also able to use this spherical projection with haversine distance to develop clusters derived from the spherical projection rather than the planar projection.

In [1]:
import altair as alt
import numpy as np
import pandas as pd
import plotly.express as px
from hdbscan import HDBSCAN
from sentence_transformers import SentenceTransformer
from umap import UMAP
import numba


In [2]:
@numba.njit()
def tsss(vec1, vec2):
    euclidean_distance = np.linalg.norm(vec1 - vec2)
    cosine_distance = np.dot(vec1, vec2.T) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
    magnitude_difference = abs(np.linalg.norm(vec1) - np.linalg.norm(vec2))
    theta = np.arccos(cosine_distance) + np.radians(10.0)
    triangle = (np.linalg.norm(vec1) * np.linalg.norm(vec2) * np.sin(theta)) / 2
    sector = np.pi * ((euclidean_distance + magnitude_difference) ** 2) * (np.degrees(theta) / 360)
    return triangle * sector


In [3]:
df = (
    pd.read_csv("cleaned_hm.csv.xz")
    .dropna(subset=["ground_truth_category", "cleaned_hm"], how="any")
    .drop_duplicates(subset=["cleaned_hm"])
    .loc[:, ["cleaned_hm", "ground_truth_category"]]
    .reset_index(drop=True)
)


In [4]:
model = SentenceTransformer("paraphrase-distilroberta-base-v1")



In [5]:
docs = df["cleaned_hm"].str.lower().tolist()
embeddings = model.encode(docs, show_progress_bar=True, device="cpu", batch_size=64)


Batches: 100%|██████████| 199/199 [05:54<00:00,  1.78s/it]


In [6]:
umap = UMAP(min_dist=0.00, n_neighbors=30, metric=tsss, random_state=1234, verbose=True)
embeddings_umap = umap.fit_transform(embeddings)


UMAP(dens_frac=0.0, dens_lambda=0.0,
     metric=CPUDispatcher(<function tsss at 0x7fd04b217290>), min_dist=0.0,
     n_neighbors=30, random_state=1234, verbose=True)
Construct fuzzy simplicial set
Fri Jan 29 08:42:42 2021 Finding Nearest Neighbors
Fri Jan 29 08:42:42 2021 Building RP forest with 11 trees
Fri Jan 29 08:42:43 2021 NN descent for 14 iterations
	 1  /  14
	 2  /  14
	 3  /  14
	 4  /  14
	Stopping threshold met -- exiting after 4 iterations
Fri Jan 29 08:44:31 2021 Finished Nearest Neighbor Search
Fri Jan 29 08:44:34 2021 Construct embedding
	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs
Fri Jan 29 08:44:44 2021 Finished embedding


In [7]:
embeddings_df = pd.DataFrame()
embeddings_df["Document"] = docs
embeddings_df["Component 1"] = embeddings_umap[:, 0]
embeddings_df["Component 2"] = embeddings_umap[:, 1]
embeddings_df["Ground Truth"] = df["ground_truth_category"]


In [8]:
hdbscan = HDBSCAN(cluster_selection_method="leaf")



In [9]:
clusters = hdbscan.fit_predict(embeddings_umap)
embeddings_df["Cluster"] = clusters


In [10]:
alt.data_transformers.disable_max_rows()


DataTransformerRegistry.enable('default')

In [11]:
chart = (
    (
        alt.Chart(
            embeddings_df,
            height=1000,
            width=1000,
            title="Happy Moments - SBERT → UMAP (TSSS) → HDBSCAN",
        )
        .mark_point()
        .encode(
            x=alt.X("Component 1", axis=None),
            y=alt.Y("Component 2", axis=None),
            tooltip=["Document", "Cluster", "Ground Truth"],
            color="Cluster:N",
        )
    )
    .configure_axis(grid=False)
    .configure_view(strokeWidth=0)
    .interactive()
)


In [12]:
chart.save("docs/nlp-sbert-umap-tsss-hdbscan-chart.html")


In [13]:
sphere_mapper = UMAP(
    min_dist=0.00,
    n_neighbors=30,
    metric=tsss,
    output_metric="haversine",
    random_state=42,
    verbose=True,
)
umap_sphere_embeddings = sphere_mapper.fit_transform(embeddings)

e1, e2 = umap_sphere_embeddings[:, 0], umap_sphere_embeddings[:, 1]

x = np.sin(e1) * np.cos(e2)
y = np.sin(e1) * np.sin(e2)
z = np.cos(e1)

embeddings_df["Sphere Embedding X"] = x
embeddings_df["Sphere Embedding Y"] = y
embeddings_df["Sphere Embedding Z"] = z

embeddings_df["2D Sphere Projection X"] = np.arctan2(x, y)
embeddings_df["2D Sphere Projection Y"] = -np.arccos(z)


UMAP(dens_frac=0.0, dens_lambda=0.0,
     metric=CPUDispatcher(<function tsss at 0x7fd04b217290>), min_dist=0.0,
     n_neighbors=30, output_metric='haversine', random_state=42, verbose=True)
Construct fuzzy simplicial set
Fri Jan 29 08:44:44 2021 Finding Nearest Neighbors
Fri Jan 29 08:44:44 2021 Building RP forest with 11 trees
Fri Jan 29 08:44:45 2021 NN descent for 14 iterations
	 1  /  14
	 2  /  14
	 3  /  14
	 4  /  14
	Stopping threshold met -- exiting after 4 iterations
Fri Jan 29 08:46:24 2021 Finished Nearest Neighbor Search
Fri Jan 29 08:46:24 2021 Construct embedding
	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs
Fri Jan 29 08:47:33 2021 Finished embedding


In [14]:
hdbscan_sphere = HDBSCAN(cluster_selection_method="eom", metric="haversine")
clusters_sphere = hdbscan_sphere.fit_predict(
    np.radians(embeddings_df[["2D Sphere Projection X", "2D Sphere Projection Y"]].values)
)
embeddings_df["Cluster Sphere"] = clusters_sphere
embeddings_df["Cluster Sphere (str)"] = embeddings_df["Cluster Sphere"].astype(str)


In [15]:
fig = px.scatter(
    embeddings_df,
    x="2D Sphere Projection X",
    y="2D Sphere Projection Y",
    color="Cluster Sphere (str)",
    hover_data=["Document", "Cluster Sphere", "Ground Truth"],
    opacity=0.8,
    color_discrete_sequence=px.colors.qualitative.Alphabet,
)

fig.write_html("docs/nlp-sbert-umap-hdbscan-tsss-sphere-2d.html")


In [16]:
fig = px.scatter_3d(
    embeddings_df,
    x="Sphere Embedding X",
    y="Sphere Embedding Y",
    z="Sphere Embedding Z",
    color="Cluster Sphere (str)",
    hover_data=["Document", "Cluster Sphere", "Ground Truth"],
    color_discrete_sequence=px.colors.qualitative.Alphabet,
    opacity=0.8,
)

camera = dict(up=dict(x=0, y=0, z=0), center=dict(x=0, y=0, z=0), eye=dict(x=0, y=0, z=0))

fig.update_layout(
    scene_camera=camera,
    scene=dict(
        bgcolor="black",
        xaxis=dict(
            showbackground=False,
            gridcolor="black",
            zerolinecolor="black",
        ),
        yaxis=dict(
            showbackground=False,
            gridcolor="black",
            zerolinecolor="black",
        ),
        zaxis=dict(
            showbackground=False,
            gridcolor="black",
            zerolinecolor="black",
        ),
    ),
)

fig.write_html("docs/nlp-sbert-umap-tsss-hdbscan-sphere.html")
