# Publication Embeddings

Contains XML file conversion to pandas dataframe, classification of publications abstracts into groups, computations of embeddings with PubMedBERT and processing of the embeddings.

## Setting up Workspace

### Set up GPUs

In [None]:
# GPU information:

gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Mon Mar 11 19:17:00 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   61C    P8              11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In order to use a GPU with your notebook, select the **Runtime > Change runtime** type menu, and then set the hardware accelerator dropdown to GPU.

### High RAM

In [None]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 54.8 gigabytes of available RAM

You are using a high-RAM runtime!


Users who have purchased one of Colab's paid plans have access to high-memory VMs when they are available.

You can see how much memory you have available at any time by running the following code cell. If the execution result of running the code cell below is "Not using a high-RAM runtime", then you can enable a high-RAM runtime via **Runtime > Change runtime** type in the menu. Then select High-RAM in the Runtime shape dropdown. After, re-execute the code cell.

In [None]:
# Mount google drive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
# List directory in google drive
!ln -s /content/gdrive/My\ Drive/ /mydrive
!ls /mydrive

 current				       'Meet Recordings'   stored
'Double Bind and Mimetic Theory '$'\n''.gdoc'  'My Drive'	  'Wander - Book Design Draft.pdf'


In [None]:
import os

# Change the current working directory to '/mydrive/Metabolomics Landscape'
os.chdir('/mydrive/current/Metabolomics Landscape')

# Verify the current working directory
print("Current Working Directory: ", os.getcwd())

Current Working Directory:  /content/gdrive/My Drive/current/Metabolomics Landscape


### Set up Libraries

In [None]:
# Installing all library dependencies with their versions.
# This could take up to 3 minutes to run.

!pip install --quiet bio==1.6.2 h5py==3.9.0 lxml==4.9.4 numpy==1.25.2 pandas==1.5.3 plotly==5.15.0 psutil==5.9.5 scikit-learn==1.2.2 torch==2.1.0 transformers==4.38.1 WordCloud==1.9.3 umap-learn==0.5.5 kaleido==0.2.1

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.6/278.6 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.5/8.5 MB[0m [31m103.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.9/90.9 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.9/79.9 MB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m101.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.8/55.8 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for umap-learn (setup.py) ... [?25l[?25hdone


In [None]:
#if the torch version from the utput above differ from the one from this cell,
#this is the correct one.
import torch
import sklearn
print(sklearn.__version__)
print(torch.__version__)

1.2.2
2.1.0+cu121


## Download PubMed Dataset

In [None]:
from Bio import Entrez
#import xml.etree.ElementTree as ET
from lxml import etree as ET
import pandas as pd

### Download Abstract in Notebook

Download the first 3,000 entry.
Please note that this section of the code was used solely for testing purposes.

In [None]:
# Set the email address and API key to use for the Entrez API
Entrez.email = ' '
Entrez.api_key = ' '  # Replace with your actual API key

# Search PubMed for papers with the desired terms, retrieving only the first 2000
handle = Entrez.esearch(db='pubmed',
                        term='metabolomics OR metabonomics',
                        retmax=3000, api_key=Entrez.api_key)
record = Entrez.read(handle)

# Retrieve the XML records for the papers with the retrieved PubMed IDs
id_list = record['IdList']
handle = Entrez.efetch(db='pubmed', id=id_list, retmode='xml', api_key=Entrez.api_key)
records = ET.fromstring(handle.read())

# Initialize a list to hold all article data
articles_data = []

# Extract the desired information from the XML records
for record in records.findall('.//PubmedArticle'):
    article_data = {}

    # Extract the PubMed ID
    article_data['pmid'] = record.find('.//PMID').text

    # Extract the title
    article_data['title'] = record.find('.//ArticleTitle').text

    # Extract the abstract
    abstract_text_element = record.find('.//AbstractText')
    article_data['abstract'] = abstract_text_element.text if abstract_text_element is not None else None

    # Extract the language
    article_data['language'] = record.find('.//Language').text

    # Extract the journal title
    article_data['journal_title'] = record.find('.//Title').text

    # Extract the ISSN
    #issn_element = record.find('.//ISSN')
    #article_data['issn'] = issn_element.text if issn_element is not None else None

    # Extract the publication date
    #article_data['pub_date'] = record.find('.//PubDate').text

    year_element = record.find('.//PubDate/Year')
    if year_element is not None:
        pub_year = year_element.text
    else:
        pub_year = None

    article_data['pub_year'] = pub_year

    # Extract the author names
    authors = []
    for author in record.findall('.//Author'):
        last_name = author.find('.//LastName')
        initials = author.find('.//Initials')
        if last_name is not None and initials is not None:
            authors.append(f'{last_name.text} {initials.text}')
    article_data['authors'] = ', '.join(authors)

    # Add the article data to the list
    articles_data.append(article_data)

# Convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(articles_data)

In [None]:
df

Unnamed: 0,pmid,title,abstract,language,journal_title,pub_year,authors
0,38427197,High-Throughput Lipidomic and Metabolomic Prof...,Recent research has revealed the potential of ...,eng,"Methods in molecular biology (Clifton, N.J.)",2024,"Thompson BM, Astarita G"
1,38427189,Mass Spectrometry-Based Metabolomics Multi-pla...,The integration of complementary analytical pl...,eng,"Methods in molecular biology (Clifton, N.J.)",2024,"González-Domínguez Á, Sayago A, Fernández-Reca..."
2,38427142,An exploratory investigation of the CSF metabo...,Because cerebrospinal fluid (CSF) samples are...,eng,Metabolomics : Official journal of the Metabol...,2024,"Thirion A, Loots DT, Williams ME, Solomons R, ..."
3,38427120,Research progress on the multi-omics and survi...,"In the dynamic process of metastasis, circulat...",eng,Clinical and experimental medicine,2024,"Xie Q, Liu S, Zhang S, Liao L, Xiao Z, Wang S,..."
4,38427076,Metabolic effect of adrenaline infusion in peo...,As a result of early loss of the glucagon resp...,eng,Diabetologia,2024,"She R, Suvitaival T, Andersen HU, Hommel E, Nø..."
...,...,...,...,...,...,...,...
2994,38068940,Nutrient Solution Flowing Environment Affects ...,The principal difference between hydroponics a...,eng,International journal of molecular sciences,2023,"Baiyin B, Xiang Y, Hu J, Tagawa K, Son JE, Yam..."
2995,38068926,Human Serum and Salivary Metabolomes: Diversit...,"Saliva, which contains molecular information t...",eng,International journal of molecular sciences,2023,"Ferrari E, Gallo M, Spisni A, Antonelli R, Mel..."
2996,38068898,Tumor Necrosis Factor-Alpha Induces Proangioge...,"Ischemic heart disease and its complications, ...",eng,International journal of molecular sciences,2023,"Dergilev K, Zubkova E, Guseva A, Tsokolaeva Z,..."
2997,38068889,Transcriptome and Metabolome Analyses Reveal T...,Cucumber green mottle mosaic virus (CGMMV) is ...,eng,International journal of molecular sciences,2023,"Li Z, Tang Y, Lan G, Yu L, Ding S, She X, He Z"


In [None]:
df['journal_title'];

In [None]:
# Assuming 'df' is your DataFrame and 'index_label' is the index of the row
# abstract = df.loc[98, 'abstract']
# print(abstract)


In [None]:
#pub_date_element = record.find('.//PubDate')
#print(ET.tostring(pub_date_element))

### Convert XML file to Pandas Dataframe

Convert the XML file that was downloaded from Pubmed to pandas data frame.

In [None]:
# Load your XML file
tree = ET.parse( " ")  # Replace with your XML file path
records = tree.getroot()

In [None]:
articles_data = []  # Initialize an empty list to store the articles data

# Extract the desired information from the XML records
for record in records.findall('.//PubmedArticle'):
    article_data = {}

    # Extract the PubMed ID
    article_data['pmid'] = record.find('.//PMID').text

    # Extract the title
    article_data['title'] = record.find('.//ArticleTitle').text

    # Extract the abstract
    abstract_text_element = record.find('.//AbstractText')
    article_data['abstract'] = abstract_text_element.text if abstract_text_element is not None else None

    # Extract the language
    article_data['language'] = record.find('.//Language').text

    # Extract the journal title
    article_data['journal_title'] = record.find('.//Journal/Title').text  # Updated path

    # Extract the publication year
    year_element = record.find('.//PubDate/Year')
    article_data['pub_year'] = year_element.text if year_element is not None else None

    # Extract the author names
    authors = []
    for author in record.findall('.//Author'):
        last_name = author.find('.//LastName')
        initials = author.find('.//Initials')
        if last_name is not None and initials is not None:
            authors.append(f'{last_name.text} {initials.text}')
    article_data['authors'] = ', '.join(authors)

    # Add the article data to the list
    articles_data.append(article_data)

In [None]:
# Convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(articles_data)

In [None]:
df.head()

Unnamed: 0,pmid,title,abstract,language,journal_title,pub_year,authors,predicted_category
0,9748443,Effect of slow growth on metabolism of Escheri...,Escherichia coli growing on glucose in minimal...,eng,Journal of bacteriology,1998.0,"Tweeddale H, Notley-McRobb L, Ferenci T",Microbiology
1,10675895,On the optimization of classes for the assignm...,"At present, the assignment of function to nove...",eng,Trends in biotechnology,2000.0,"Kell DB, King RD",unlabeled
2,10731098,Assessing the effect of reactive oxygen specie...,A two-dimensional thin-layer chromatographic a...,eng,Redox report : communications in free radical ...,1999.0,"Tweeddale H, Notley-McRobb L, Ferenci T",unlabeled
3,10797602,Current awareness on comparative and functiona...,In order to keep subscribers up-to-date with t...,eng,"Yeast (Chichester, England)",2000.0,,Microbiology
4,10894722,Global adaptations resulting from high populat...,The scope of population density effects was in...,eng,Journal of bacteriology,2000.0,"Liu X, Ng C, Ferenci T",Microbiology


In [None]:
df.shape

(82540, 7)

## Classification of Publication


### `bart-large-mnli` for zero shot classification.

**we decided against `bart-large-mnli`** It was too slow and not accurate.

In [None]:
from transformers import pipeline
import pandas as pd

# Load the text classification pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Define the candidate labels
candidate_labels = [
    "Oncology", "Plant Biology", "Endocrinology", "Nephrology", "Microbiology",
    "Analytical Chemistry", "Pharmacology", "Neuroscience", "Nutrition and Food Science",
    "Environmental Science", "Toxicology", "Animal Science", "Sports Medicine",
    "Epidemiology", "Developmental Biology", "Gerontology", "Immunology"
]

# Define a function to classify the text and add the output to the DataFrame
def classify_and_add_output(text):
    result = classifier(text, candidate_labels)
    return result['labels'][0]

# Apply the function to the text column and add the output to a new column
df['predicted_category'] = df['abstract'].apply(lambda x: classify_and_add_output(x) if pd.notnull(x) else None)

### Using journal title match

The journal where the papers was published was used to infer the label for each publication.

**Creating new category label**

In [None]:
# Define candidate labels
candidate_labels = [
    "Oncology + Cancers + Cancer + Oncogene + Anticancer + Oncotarget + Oncoimmunology + Carcinogenesis + Metastasis + Tumori + Tumor",
    "Plant + Botany + Planta + Phytopathology + Horticulture",
    "Kidney + Nephrology + Nephron + Dialysis",
    "Endocrinology + Endocrine + Hormone + Endocrinological",
    "Microbiology + Bacteriology + Leeuwenhoek + Yeast + mBio + mSphere + Microbiome + Microbes + MicrobiologyOpen + mSystems + Microorganisms",
    "Analytical Chemistry + Chromatography + Mass Spectrometry + Analyst + Analytica + Bioanalysis + Separation + Spectroscopy",
    "Pharmacology + Pharmaceutical + Pharmacological + Pharmacologica + Pharmacogenomics + Pharmacogenetics + Drug + Drugs + Ethnopharmacology + Medicinal + Natural + Pharmaceutics + Phytopharmacology + Pharmacognosy",
    "Neuroscience + Neurochemistry + Brain + Neuroinflammation + Neurology + Neurochemical + Neuroimmunology + Cerebral + Neuroimage + Neurotrauma + Neurological + Neuro-oncology + Neurodegeneration + Neuropsychiatric + Neuropsychiatry + Neuroendocrinology + Headache",
    "Nutrition + Food + Foods + Nutritional + Nutrients + Dairy + Foodborne",
    "Toxicology + Toxicological",
    "Environmental + Environment + Hazardous + Pollution",
    "Animal + Animals + Poultry + Veterinary + Livestock + Ruminant + Theriogenology + Zoology",
    "Sports + Sport + Exercise + Knee + Arthroscopy + Athletic",
    "Epidemiology + Infectious + Public",
    "Developmental + Development",
    "Gerontology + Ageing + Geriatrics + Aging + Geroscience",
    "Immunology + Immunity + Leukocyte + Autoimmunity + Immunobiology + Immunotargets + Immunotherapy + Vaccines",
    "Bioinformatics + Chemometrics + Cheminformatics + Computational",
    "Genetics + Genomics + Genome"
]

# The label mapping
label_mapping = {
    "Oncology + Cancers + Cancer + Oncogene + Anticancer + Oncotarget + Oncoimmunology + Carcinogenesis + Metastasis + Tumori + Tumor": "Cancer Research",
    "Plant + Botany + Planta + Phytopathology + Horticulture": "Plant Biology",
    "Kidney + Nephrology + Nephron + Dialysis": "Nephrology",
    "Endocrinology + Endocrine + Hormone + Endocrinological": "Endocrinology",
    "Microbiology + Bacteriology + Leeuwenhoek + Yeast + mBio + mSphere + Microbiome + Microbes + MicrobiologyOpen + mSystems + Microorganisms": "Microbiology",
    "Analytical Chemistry + Chromatography + Mass Spectrometry + Analyst + Analytica + Bioanalysis + Separation + Spectroscopy": "Analytical Chemistry",
    "Pharmacology + Pharmaceutical + Pharmacological + Pharmacologica + Pharmacogenomics + Pharmacogenetics + Drug + Drugs + Ethnopharmacology + Medicinal + Natural + Pharmaceutics + Phytopharmacology + Pharmacognosy": "Pharmacology",
    "Neuroscience + Neurochemistry + Brain + Neuroinflammation + Neurology + Neurochemical + Neuroimmunology + Cerebral + Neuroimage + Neurotrauma + Neurological + Neuro-oncology + Neurodegeneration + Neuropsychiatric + Neuropsychiatry + Neuroendocrinology + Headache": "Neuroscience",
    "Nutrition + Food + Foods + Nutritional + Nutrients + Dairy + Foodborne": "Food Science & Nutrition",
    "Toxicology + Toxicological": "Toxicology",
    "Environmental + Environment + Hazardous + Pollution": "Environmental Science",
    "Animal + Animals + Poultry + Veterinary + Livestock + Ruminant + Theriogenology + Zoology": "Animal Science",
    "Sports + Sport + Exercise + Knee + Arthroscopy + Athletic": "Sports Science & Medicine",
    "Epidemiology + Infectious + Public": "Epidemiology & Public Health",
    "Developmental + Development": "Developmental Biology",
    "Gerontology + Ageing + Geriatrics + Aging + Geroscience": "Aging & Gerontology",
    "Immunology + Immunity + Leukocyte + Autoimmunity + Immunobiology + Immunotargets + Immunotherapy + Vaccines": "Immunology & Vaccine Research",
    "Bioinformatics + Chemometrics + Cheminformatics + Computational": "Computational Biology",
    "Genetics + Genomics + Genome": "Genetics & Genomics"
}


In [None]:
# Function to find a match between journal title and candidate labels
def find_category_match(journal_title, candidate_labels, label_mapping):
    journal_title_lower = journal_title.lower()
    for label in candidate_labels:
        # Split compound labels into individual keywords
        keywords = label.replace(' + ', ' ').lower().split()
        # Check if any keyword is present in the journal title
        if any(keyword in journal_title_lower for keyword in keywords):
            # Return the corresponding category from label_mapping
            return label_mapping[label]
    return 'unlabeled'

In [None]:
# Applying the function to each row in the DataFrame to create the 'predicted_category' column
df['predicted_category'] = df['journal_title'].apply(lambda title: find_category_match(title, candidate_labels, label_mapping))

In [None]:
# Dropping rows where 'abstract' or 'predicted_category' is None
df.dropna(subset=['title', 'abstract', 'predicted_category'], inplace=True)

In [None]:
df.shape

(80656, 8)

In [None]:
df['predicted_category'].value_counts()

unlabeled                        41721
Analytical Chemistry              8981
Plant Biology                     4648
Pharmacology                      4229
Food Science & Nutrition          4053
Microbiology                      3605
Cancer Research                   2657
Environmental Science             1945
Genetics & Genomics               1312
Toxicology                        1175
Neuroscience                       980
Endocrinology                      950
Computational Biology              943
Immunology & Vaccine Research      924
Animal Science                     876
Epidemiology & Public Health       547
Aging & Gerontology                448
Developmental Biology              309
Nephrology                         277
Sports Science & Medicine           76
Name: predicted_category, dtype: int64

In [None]:
df.head(5)

Unnamed: 0,pmid,title,abstract,language,journal_title,pub_year,authors,predicted_category
0,9748443,Effect of slow growth on metabolism of Escheri...,Escherichia coli growing on glucose in minimal...,eng,Journal of bacteriology,1998.0,"Tweeddale H, Notley-McRobb L, Ferenci T",Microbiology
1,10675895,On the optimization of classes for the assignm...,"At present, the assignment of function to nove...",eng,Trends in biotechnology,2000.0,"Kell DB, King RD",unlabeled
2,10731098,Assessing the effect of reactive oxygen specie...,A two-dimensional thin-layer chromatographic a...,eng,Redox report : communications in free radical ...,1999.0,"Tweeddale H, Notley-McRobb L, Ferenci T",unlabeled
3,10797602,Current awareness on comparative and functiona...,In order to keep subscribers up-to-date with t...,eng,"Yeast (Chichester, England)",2000.0,,Microbiology
4,10894722,Global adaptations resulting from high populat...,The scope of population density effects was in...,eng,Journal of bacteriology,2000.0,"Liu X, Ng C, Ferenci T",Microbiology


## Compute embeddings with PubMedBERT


**Load the PubMedBERT model and tokenizer.**

In [None]:
import numpy as np
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext")
model = AutoModel.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


**Tokenize the abstracts and generate embeddings.**

***Use GPUs if available***

In [None]:
import torch

# Check if CUDA (GPU support) is available, otherwise fall back to CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


This function `get_embedding` tokenizes the text, feeds it into the model, and then returns the mean of the last hidden states as the embedding. The apply method is used to apply this function to each abstract in the DataFrame.

Please note that this operation can be very slow, especially if you have a large number of abstracts, as it processes each abstract one by one. For large datasets, you would typically batch this operation and run it on a machine with a GPU to speed up the computation.

In [None]:
# Assuming 'model' and 'tokenizer' are already defined
# Move the model to the specified device (GPU or CPU)
model.to(device)

def get_embedding(text, tokenizer, model):
    # Encode the text inputs for the model
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)

    # Move the inputs to the same device as the model
    inputs = inputs.to(device)

    # Perform the forward pass, get the outputs from the model
    outputs = model(**inputs)

    # Compute the mean of the last hidden state, detach it from the graph, and move it back to CPU
    embeddings = outputs.last_hidden_state.mean(dim=1).detach().cpu().numpy()

    return embeddings

In [None]:
# Assuming 'df' is your DataFrame and it has a column 'abstract'
# Apply the function to each abstract, using GPU acceleration where available
df['full_embeddings'] = df['abstract'].apply(lambda x: get_embedding(x, tokenizer, model) if pd.notnull(x) else None)

In [None]:
# reset index
df.reset_index(drop=True, inplace=True)

In [None]:
df.head()

Unnamed: 0,pmid,title,abstract,language,journal_title,pub_year,authors,predicted_category,full_embeddings
0,9748443,Effect of slow growth on metabolism of Escheri...,Escherichia coli growing on glucose in minimal...,eng,Journal of bacteriology,1998.0,"Tweeddale H, Notley-McRobb L, Ferenci T",Microbiology,"[[0.04921199, 0.1013429, 0.009529841, -0.08067..."
1,10675895,On the optimization of classes for the assignm...,"At present, the assignment of function to nove...",eng,Trends in biotechnology,2000.0,"Kell DB, King RD",unlabeled,"[[0.074717656, 0.12005615, 0.023376802, 0.0167..."
2,10731098,Assessing the effect of reactive oxygen specie...,A two-dimensional thin-layer chromatographic a...,eng,Redox report : communications in free radical ...,1999.0,"Tweeddale H, Notley-McRobb L, Ferenci T",unlabeled,"[[-0.009071778, 0.013007838, -0.0069063944, -0..."
3,10797602,Current awareness on comparative and functiona...,In order to keep subscribers up-to-date with t...,eng,"Yeast (Chichester, England)",2000.0,,Microbiology,"[[-0.07259704, 0.09568493, -0.023760073, -0.06..."
4,10894722,Global adaptations resulting from high populat...,The scope of population density effects was in...,eng,Journal of bacteriology,2000.0,"Liu X, Ng C, Ferenci T",Microbiology,"[[0.05038802, 0.1111884, -0.044020668, -0.1404..."


In [None]:
print(f'The dimension of each full embeddings is', df['full_embeddings'][0].shape)

The dimension of each full embeddings is (1, 768)


**Save embeddings and complete table to be used as a later use**

In [None]:
import h5py
df.to_hdf('embeddings_full_01MAR2024.h5', key='embeddings', mode='w')

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block2_values] [items->Index(['title', 'abstract', 'language', 'journal_title', 'authors',
       'predicted_category', 'full_embeddings'],
      dtype='object')]

  df.to_hdf('embeddings_full_01MAR2024.h5', key='embeddings', mode='w')
