# XAI W8 Assignment

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/purplevjs/XAI_W8)

In [67]:
from huggingface_hub import notebook_login
notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# **GIST-all-MiniLM-L6-v2**  
The model is fine-tuned on top of the sentence-transformers/all-MiniLM-L6-v2 using the MEDI dataset augmented with mined triplets from the MTEB Classification training dataset (excluding data from the Amazon Polarity Classification task).

#### Data
- The dataset used is a compilation of the MEDI and MTEB Classification training datasets  
- Dataset: avsolatorio/medi-data-mteb_avs_triplets



In [68]:
!pip install gensim==4.3.2 matplotlib==3.7.1 scikit-learn==1.2.2 umap-learn==0.5.6 plotly==5.15.0



In [69]:
# Basic
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

# Dimensionality Reduction
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap


In [70]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("avsolatorio/GIST-all-MiniLM-L6-v2")

sentences = [
    "That is a happy person",
    "That is a happy dog",
    "That is a very happy person",
    "Today is a sunny day"
]
embeddings = model.encode(sentences)

similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]

torch.Size([4, 4])


#### PCA (Principal Componets Analysis)


In [71]:
# Apply PCA
pca = PCA(n_components=2)
embeddings_pca = pca.fit_transform(embeddings)

# Plot PCA results using Plotly for interactivity
fig_pca = px.scatter(
    embeddings_pca, x=0, y=1,
    text=sentences,
    title="PCA of GIST-all-MiniLM-L6-v2 Embeddings",
    labels={'0': 'Principal Component 1', '1': 'Principal Component 2'}
)
fig_pca.update_traces(marker=dict(size=8))
fig_pca.show()

The PCA plot appears more condensed, with smaller separation between points. This method doesn’t emphasize clustering as strongly as UMAP or t-SNE and may group points that are actually dissimilar in the high-dimensional space.

#### t-SNE (t-distributed Stochastic Neighbor Embedding)

In [72]:
from sklearn.manifold import TSNE

In [73]:
# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=3, n_iter=300, random_state=42)
embeddings_tsne = tsne.fit_transform(embeddings)

# Plot t-SNE results using Plotly
fig_tsne = px.scatter(
    embeddings_tsne, x=0, y=1,
    text=sentences,
    title="t-SNE of GIST-all-MiniLM-L6-v2 Embeddings",
    labels={'0': 'Component 1', '1': 'Component 2'}
)
fig_tsne.update_traces(marker=dict(size=8))
fig_tsne.show()

 In the t-SNE plot, the points are spread out quite differently from UMAP, emphasizing local differences. There is some clustering visible, but t-SNE sometimes “stretches” the distances more than other methods.

#### UMAP (Uniform Manifold Approximation and Projection)

In [74]:
umap_model = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
embeddings_umap = umap_model.fit_transform(embeddings)

# Plot UMAP results using Plotly
fig_umap = px.scatter(
    embeddings_umap, x=0, y=1,
    text=sentences,
    title="UMAP of GIST-all-MiniLM-L6-v2 Embeddings",
    labels={'0': 'Component 1', '1': 'Component 2'}
)
fig_umap.update_traces(marker=dict(size=8))
fig_umap.show()



n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.


n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1



In the UMAP plot, you can see a clear separation among the different prompts. UMAP often forms distinct clusters for similar sentences, making it suitable for understanding clusters and nuances.

#### Citation


@article{solatorio2024gistembed,
    title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning},
    author={Aivin V. Solatorio},
    journal={arXiv preprint arXiv:2402.16829},
    year={2024},
    URL={https://arxiv.org/abs/2402.16829}
    eprint={2402.16829},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
