In [69]:
from transformers import AutoTokenizer, AutoModel
import torch

sentences = [
    "We investigate the performance of QuantiScale, a new multi-touch interaction technique for the quantification of distances in medical images and discuss the benefits and prospects of redesigning interactions with multi-touch devices. Taking advantage of the multi-touch capabilities, QuantiScale behaves like a tape measure, but automatically adjusts the view onto the measured object to improve precision and speed. The technique has been studied in a real-world scenario measuring the diameter of structures for the diagnostic reading of medical images and provides hints for the replacement of traditional mouse-based interaction with gestural interaction. Results of the quantitative evaluation indicate a high measurement precision particularly for small objects. Participants experienced QuantiScale as being more fun, natural, and intuitive in comparison to mouse-based interaction even though the subjective preference for speed and precision was still in favor of the mouse.",
    "Brain MR images are one of the most important instruments for diagnosing neurological disorders such as tumors, infections or trauma. In particular, grade I-IV brain tumors are a well-studied subject for supervised deep learning approaches. However, for a clinical use of these approaches, a very large annotated database that covers all of the occurring variance is necessary. As MR scanners are not quantitative, it is unclear how good supervised approaches, trained on a specific database, will actually perform on a new set of images that may stem from a yet other scanner.",
    "Neuroscientists investigate neural circuits in the brain of the common fruit fly Drosophila melanogaster to discover how complex behavior is generated. Hypothesis building on potential connections between individual neurons is an essential step in the discovery of circuits that govern a specific behavior. ",
    "Real-Time Visualization of 3D Amyloid-Beta Fibrils from 2D Cryo-EM Density Maps",
    "The BundleExplorer: A Focus and Context Rendering Framework for Complex Fiber Distributions",
    "Mammogram Classification and Abnormality Detection from Nonlocal Labels using Deep Multiple Instance Neural Network",
    "Recent Advances in MRI and Ultrasound Perfusion Imaging",
    "Molecular Sombreros: Abstract Visualization of Binding Sites within Proteins",
    "CT-Based Navigation Guidance for Liver Tumor Ablation",
    "Illustrated Ultrasound for Multimodal Data Interpretation of Liver Examinations"]

tokenizer = AutoTokenizer.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract")
model = AutoModel.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract")
input_ids = []
attention_masks = []

for sent in sentences:
    inputs = tokenizer.encode_plus(
        sent,  # Sentence to encode.
        add_special_tokens=True,
        padding='max_length',
        max_length=512,
        return_attention_mask=True,  # Construct attn. masks.
        return_tensors='pt',  # Return pytorch tensors.
    )

    # Add the encoded sentence to the list.
    input_ids.append(inputs['input_ids'])

    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(inputs['attention_mask'])

input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)

print(input_ids)
print(attention_masks)



Some weights of the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tensor([[    2,  1802,  3357,  ...,     0,     0,     0],
        [    2,  2826,  2817,  ...,     0,     0,     0],
        [    2, 16573, 10735,  ...,     0,     0,     0],
        ...,
        [    2,  2894,  5313,  ...,     0,     0,     0],
        [    2,  3215,    16,  ...,     0,     0,     0],
        [    2, 11685,  5305,  ...,     0,     0,     0]])
tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])


In [73]:
outputs = model(input_ids, attention_masks)
print(outputs)
hidden_states = outputs[1]
print("'")
print(hidden_states)
hidden_states = hidden_states.detach().numpy()

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-4.3167e-01, -5.6587e-01, -4.9761e-01,  ...,  3.1688e-02,
           6.5449e-01, -9.4807e-02],
         [-5.8376e-01,  9.0268e-01,  4.7493e-01,  ..., -6.3690e-01,
           5.4595e-01, -6.3757e-01],
         [-8.3982e-01,  1.0161e+00,  5.5878e-02,  ..., -8.3432e-01,
           1.2640e-01,  2.8915e-01],
         ...,
         [-1.5023e-01,  1.5020e-01, -2.2009e-01,  ...,  4.1392e-01,
           1.9182e-01, -3.6286e-02],
         [-1.5594e-01, -1.7931e-02,  1.3513e-01,  ...,  4.4648e-01,
           1.8536e-01, -2.8775e-01],
         [-3.1814e-02, -1.0040e-01, -1.0307e-03,  ...,  3.0099e-01,
           3.2990e-01, -3.7890e-01]],

        [[-7.1307e-01,  8.1360e-01,  3.1438e-01,  ..., -9.5665e-02,
           3.2612e-01, -4.4256e-01],
         [-4.1611e-01,  1.1066e+00,  7.8100e-01,  ..., -6.7798e-01,
           9.2582e-01, -1.4066e-01],
         [ 3.3041e-01,  1.0707e+00,  7.8344e-01,  ...,  4.6264e-01,
          -7.

In [74]:
from umap import UMAP

umap_embeddings = UMAP(n_neighbors=5, n_components=2, metric='cosine', random_state=42)
new_values = umap_embeddings.fit_transform(hidden_states)

In [75]:
import plotly.express as px

fig = px.scatter(new_values, x=0, y=1, opacity=1, hover_name=sentences)
fig.show()

In [82]:
sentences = "We investigate the performance of QuantiScale, a new multi-touch interaction technique for the quantification of distances in medical images and discuss the benefits and prospects of redesigning interactions with multi-touch devices. Taking advantage of the multi-touch capabilities, QuantiScale behaves like a tape measure, but automatically adjusts the view onto the measured object to improve precision and speed. The technique has been studied in a real-world scenario measuring the diameter of structures for the diagnostic reading of medical images and provides hints for the replacement of traditional mouse-based interaction with gestural interaction. Results of the quantitative evaluation indicate a high measurement precision particularly for small objects. Participants experienced QuantiScale as being more fun, natural, and intuitive in comparison to mouse-based interaction even though the subjective preference for speed and precision was still in favor of the mouse."
encoded_inputs = tokenizer(sentences, padding=True)
encoded_inputs

{'input_ids': [2, 1802, 3357, 1680, 3131, 1685, 10651, 11853, 2666, 15, 42, 2333, 2556, 16, 15058, 2848, 3276, 1725, 1680, 7844, 1685, 10599, 1682, 3045, 4542, 1690, 2861, 1680, 5631, 1690, 20054, 1685, 25820, 1821, 1700, 3763, 1715, 2556, 16, 15058, 5991, 17, 6377, 6883, 1685, 1680, 2556, 16, 15058, 12362, 15, 10651, 11853, 2666, 27263, 3223, 42, 20092, 2674, 15, 2027, 14982, 6031, 1026, 1680, 4614, 8116, 1680, 2786, 8465, 1701, 2814, 8169, 1690, 6448, 17, 1680, 3276, 2029, 2030, 2801, 1682, 42, 4005, 16, 4517, 12816, 6159, 1680, 5177, 1685, 3818, 1725, 1680, 3559, 7895, 1685, 3045, 4542, 1690, 3853, 24582, 1727, 1725, 1680, 5941, 1685, 5373, 3877, 16, 2234, 2848, 1715, 5263, 2404, 2848, 17, 1890, 1685, 1680, 4176, 3517, 3275, 42, 1877, 4423, 8169, 4124, 1725, 2718, 10054, 17, 3289, 5452, 10651, 11853, 2666, 1732, 3203, 2051, 18915, 15, 4132, 15, 1690, 24782, 1682, 3801, 1701, 3877, 16, 2234, 2848, 3537, 4522, 1680, 7672, 8520, 1725, 6448, 1690, 8169, 1734, 3941, 1682, 6475, 1685, 168

In [83]:
tokenizer.decode(encoded_inputs["input_ids"])

'[CLS] we investigate the performance of quantiscale, a new multi - touch interaction technique for the quantification of distances in medical images and discuss the benefits and prospects of redesigning interactions with multi - touch devices. taking advantage of the multi - touch capabilities, quantiscale behaves like a tape measure, but automatically adjusts the view onto the measured object to improve precision and speed. the technique has been studied in a real - world scenario measuring the diameter of structures for the diagnostic reading of medical images and provides hints for the replacement of traditional mouse - based interaction with gestural interaction. results of the quantitative evaluation indicate a high measurement precision particularly for small objects. participants experienced quantiscale as being more fun, natural, and intuitive in comparison to mouse - based interaction even though the subjective preference for speed and precision was still in favor of the mous