# Scientific literature mining

Embeddings are vector representations that capture semantic meaning, allowing us to measure similarity between documents in a high-dimensional space. Using embeddings, we will mine the scientific litterature to identify relationships between papers and find similar papers to a target query.

Since embeddings are high-dimensional vectors, they cannot be easily visualized. We will use non-linear dimensionality reduction techniques like UMPA to project these vectors into a 2D space, allowing us to visualize the relationships between papers.

We already have extracted title and abstract for several preprints from arXiv. These papers are taken from multiple topics:
- *nanoporous materials*
- *many-body*
- *machine learning*
- *quantum computing*
- *biomolecular modeling*

## Load required libraries

In [18]:
from itertools import cycle
import json
from pprint import pprint

from fastembed import TextEmbedding
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go
from sklearn.metrics.pairwise import cosine_similarity
from tqdm.auto import tqdm
import umap.umap_ as umap

tqdm.pandas()

## Load the model

See [sentence_embeddings.ipynb](sentence_embeddings.ipynb) for details about the model.

Loading the model can take a few minutes, so be patient.

In [19]:
model = TextEmbedding("nomic-ai/nomic-embed-text-v1.5-Q")

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

## Load the data

The file `arxiv_preprints.csv` contains the metadata of the preprints, including title, abstract, publication date, arXiv id and search category. The file has been generated following a procedure described in [extract_arxiv.ipynb](extract_arxiv.ipynb). You don't need to run this notebook. The file is already available in the `content` folder.

In [20]:
f_in = open("arxiv_papers.json")
data = json.load(f_in)  ## Load the JSON data
papers = pd.json_normalize(data) ## Convert to Pandas DataFrame

Display number of papers and the first paper:

In [21]:
print(f"Dimensions of papers dataframe: {papers.shape}")
print("First paper:")
print(papers.iloc[0])

Dimensions of papers dataframe: (400, 5)
First paper:
id                          http://arxiv.org/abs/2402.01321v1
date                                     2024-02-02T11:17:55Z
title       Ionic Current Rectification in Nanopores: Effe...
abstract    Ionic Current Rectification (ICR) can appear i...
category                                 nanoporous materials
Name: 0, dtype: object


## Generate embeddings

Compute embedding for an example text:

In [22]:
example="Sample text to embed."
embedding = list(model.embed(example))[0]
print(f"Dimensions of embedding: {len(embedding)}")
print("Embedding:\n", embedding)

Dimensions of embedding: 768
Embedding:
 [ 6.85422361e-01  4.58688557e-01 -3.03724766e+00 -1.82744217e+00
  6.51969671e-01 -1.29643595e+00  4.81724292e-01  1.07385135e+00
 -1.17998578e-01 -7.65763342e-01 -8.64521801e-01  1.75569341e-01
  1.18530381e+00  1.41094482e+00 -1.63788497e+00  5.00717998e-01
  1.13027537e+00 -1.26548994e+00 -3.73842865e-01  9.51564729e-01
 -8.08863118e-02 -7.42730260e-01 -4.02823269e-01  3.35277945e-01
  2.52884483e+00 -6.13623746e-02  3.61596733e-01  2.89210647e-01
 -9.97816205e-01 -7.75394812e-02  1.63237587e-01 -9.55110788e-02
 -6.27715170e-01 -1.04973698e+00  3.31976026e-01 -2.75149763e-01
 -5.32612741e-01  7.78294563e-01 -1.19025135e+00  1.85220286e-01
  1.06071424e+00  2.93117553e-01 -7.58224964e-01 -3.90468806e-01
  5.27690351e-01 -1.20820570e+00  3.43614668e-01 -5.94544113e-02
  9.78886843e-01 -1.47128510e+00  7.82864809e-01 -3.54484141e-01
  1.63565308e-01 -1.43609643e+00  2.02690530e+00 -3.45402271e-01
 -2.20752686e-01 -1.56293499e+00  8.58447477e-02 

We obtained a vector of 768 dimensions, which is the size of the embedding space for this model.

We will now compute embeddings for each paper using the model we loaded earlier. The input text to compute embeddings from is the concatenation of the title and abstract of each paper because they both convey important semantic meaning.

This step can take a few minutes, so be patient. We will store the embeddings in a new column of the dataframe.

In [23]:
def get_embedding(row: pd.Series) -> np.ndarray:
    """Get the embedding for a paper's title and abstract (merged)."""
    text_to_embed = row["title"] + " " + row["abstract"]
    return list(model.embed(text_to_embed))[0]

papers["embedding"] = papers.progress_apply(get_embedding, axis="columns")

  0%|          | 0/400 [00:00<?, ?it/s]

Let's display the first paper with its embedding:

In [24]:
print(papers.iloc[0])

id                           http://arxiv.org/abs/2402.01321v1
date                                      2024-02-02T11:17:55Z
title        Ionic Current Rectification in Nanopores: Effe...
abstract     Ionic Current Rectification (ICR) can appear i...
category                                  nanoporous materials
embedding    [1.384288, 1.9946833, -3.2957623, -0.6682733, ...
Name: 0, dtype: object


## Embeddings visualization

Embeddings are high-dimensional vectors, and visualizing them directly is not feasible. Instead, we will reduce their dimensionality to 2D using UMAP. UMAP is a non-linear dimensionality reduction technique, particularly well-suited for visualizing high-dimensional data like embeddings.

In [25]:
# Extract the embeddings into a 2D array for UMAP.
embeddings_array = np.vstack(papers["embedding"].values)
print(f"Shape of embeddings array: {embeddings_array.shape}")
# Create a UMAP reducer and fit-transform the embeddings.
umap_reducer = umap.UMAP(metric="cosine", n_components=2, random_state=42)
reduce_embeddings = umap_reducer.fit_transform(embeddings_array)
# Add the UMAP coordinates to the DataFrame as two new columns.
papers[["umap_x", "umap_y"]] = reduce_embeddings

Shape of embeddings array: (400, 768)



'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.


n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



Let's display the first paper with its new columns:

In [26]:
print(papers.iloc[0])

id                           http://arxiv.org/abs/2402.01321v1
date                                      2024-02-02T11:17:55Z
title        Ionic Current Rectification in Nanopores: Effe...
abstract     Ionic Current Rectification (ICR) can appear i...
category                                  nanoporous materials
embedding    [1.384288, 1.9946833, -3.2957623, -0.6682733, ...
umap_x                                               -6.604837
umap_y                                                6.002138
Name: 0, dtype: object


We now define a helper functions to prepare the content of the tooltip that will be displayed when hovering over a point in the plot. 

In [27]:
def set_tooltip(row: pd.Series) -> str:
    """Create a tooltip for each paper."""
    label = (
        f"<b>Title:</b> {row["title"]}<br>"
        f"<b>Category:</b> {row["category"]}<br>"
    )
    return label

# We apply the tooltip function to each row (= paper) of the DataFrame.
papers["tooltip"] = papers.apply(set_tooltip, axis="columns")

Display the figure. Each point represents a paper. The points are colored by their category.

In [28]:
colors = cycle(px.colors.qualitative.Plotly)
layout = {
    "title": "2D UMAP Embeddings",
    "width": 800,
    "height": 600,
    "plot_bgcolor": "rgba(0,0,0,0)",
    "hovermode": "closest",
}

fig = go.Figure(layout=layout)
for label in papers["category"].unique():
    color = next(colors)
    subset = papers[papers["category"] == label]
    trace = go.Scattergl(
        x = subset["umap_x"],
        y = subset["umap_y"],
        name = label,
        mode = "markers",
        marker = dict(
            color = color,
            size = 8,
            line = dict(width=0.5),
            opacity=0.75
        ),
        text=subset["tooltip"]
    )
    fig.add_trace(trace)

fig.show()

We observe that papers from the same category tend to cluster together, indicating that the embeddings capture semantic relationships between papers. However, a couple of papers are not in their expected category, which indicates that they are more similar to papers from other categories.

This is for instance the case of this paper which is about *biomolecular modeling* but is clustered with machine learning papers.

In [29]:
target_paper = "K-means and Cluster Models"
pprint(papers
    .query("title.str.contains(@target_paper, case=False, na=False)")
    .loc[:, ["title", "category", "abstract"]]
    .transpose()
    .to_dict()
)

{391: {'abstract': 'We present *K-means clustering algorithm and source code '
                   'by expanding statistical clustering methods applied in '
                   'https://ssrn.com/abstract=2802753 to quantitative finance. '
                   '*K-means is statistically deterministic without specifying '
                   'initial centers, etc. We apply *K-means to extracting '
                   'cancer signatures from genome data without using '
                   "nonnegative matrix factorization (NMF). *K-means' "
                   "computational cost is a fraction of NMF's. Using 1,389 "
                   'published samples for 14 cancer types, we find that 3 '
                   'cancers (liver cancer, lung cancer and renal cell '
                   'carcinoma) stand out and do not have cluster-like '
                   'structures. Two clusters have especially high '
                   'within-cluster correlations with 11 other cancers '
                   'indica

## Find similar to a target query

We will now try to find similar papers to a target query. We will compute the embedding of the target query and then find the nearest neighbors in the embedding space.

We will also visualize the nearest neighbors in the embedding space to see how they relate to the target query.

In [30]:
target = """
We find that long DNA molecules that have binding affinity
for the nanostars are preferentially enriched on the interface
"""
target_embedding = list(model.embed(target))[0]

In [31]:

# We calculate the cosine similarity between the target embedding and all papers embeddings.
similarities = cosine_similarity(embeddings_array, target_embedding.reshape(1, -1))

In [32]:
# We adapt the format of the similarities array add it to the DataFrame. 
papers["similarity_score"] = similarities.flatten()

### Finding the most similar papers

In [33]:
# Sort papers by similarity score in descending order
most_similar_papers = papers.sort_values(by="similarity_score", ascending=False)

# Display the top 10 most similar papers
print("Top 10 papers most similar to query:")
print("-"*50)
print(target)
print("-"*50)
for i, (index, row) in enumerate(most_similar_papers.head(10).iterrows(), 1):
    print(f"{i:2d}: {row['title']} (Score: {row['similarity_score']:.3f})")

Top 10 papers most similar to query:
--------------------------------------------------

We find that long DNA molecules that have binding affinity
for the nanostars are preferentially enriched on the interface

--------------------------------------------------
 1: Controlling the size and adhesion of DNA droplets using surface-active DNA molecules (Score: 0.760)
 2: Identification of DNA Bases Using Nanopores Created in Finite-Size Nanoribbons from Graphene, Phosphorene, and Silicene (Score: 0.689)
 3: DNA translocation through nanopores with salt gradients: The role of osmotic flow (Score: 0.659)
 4: First principles investigation of nanopore sequencing using variable voltage bias on graphene-based nanoribbons (Score: 0.658)
 5: Condensation and activator/repressor control of a transcription-regulated biomolecular liquid (Score: 0.657)
 6: Quantum Capacitance Modifies Interionic Interactions in Semiconducting Nanopores (Score: 0.642)
 7: A zero-depth nanopore capillary for the analy

### Visualizing similarity for all papers

In [34]:
# Define the layout for the plot.
layout = {
    "title": "2D UMAP Embeddings Colored by Similarity Score",
    "width": 800,
    "height": 600,
    "plot_bgcolor": "rgba(0,0,0,0)",  ## Set background to transparent.
    "hovermode": "closest",  # Show tooltips on hover to the nearest point.
}
# Create a new figure with the defined layout.
fig = go.Figure(layout=layout)
# Add a scatter plot with points representing papers
# Points are colored based on their similarity to the target query
trace = go.Scattergl(
    x = papers["umap_x"],
    y = papers["umap_y"],
    mode = "markers",
    marker = dict(
        color = papers["similarity_score"],
        colorscale = "Viridis",
        colorbar = dict(title="Similarity Score"),
        size = 8,
        line = dict(width=0.5),
        opacity=0.50
    ),
    text=papers["tooltip"]
)
# Add the trace to the figure.
# And show the figure.
fig.add_trace(trace)
fig.show()