# Document Embedding

Here, we call GPT-3 Embedding API to generate text similarity embeddings from the input text. Since the API is limited to pieces with less than 2048 token, we use heuristics to remove those abstracts with more than 2048 tokens.

**:warning: Starting January 2022, GPT-3 Embedding API is NOT free anymore. You probably need to pay for it to run this notebook. Cached results from 2021 are available in the `models/gpt3/` folder.**

OpenAI API provides results that contain one row per document. All documents with missing PMID/abstract or very long abstracts are discarded as calling APIs is a slow process and GPT-3 only accepts documents with up to 2047 tokens.

## Input
- `data/pubmed/abstracts.csv.gz` contains raw un-preprocessed texts collected from PubMed. To speed things up, duplicate documents will be queried only once. Documents with identical PMID are considered as duplicate.

## Outputs

- `models/gpt3/abstracts_gpt3ada.nc`, in NetCDF4 format, contains the PMIDs and corresponding embedding weights; one row per document. Use XArray to open this dataset file.
    - We also cache the incremental outputs that were used to build up the NetCDF4 dataset in the following paths:
        - `models/gpt3/abstracts_gpt3ada.npz` for the embedding weights; one row per document.
        - `models/gpt3/abstracts_pmids_gpt3ada.csv` includes PMIDs for the rows of the above matrix. This can be use to connect weights to the actual PubMed datasets.

## Requirements

```bash
# create and activate the `cogtext` environment if you haven't already 
# mamba create -n cogtext
# mamba activate cogtext

mamba install pandas scikit-learn tqdm ipykernel
mamba install xarray "dask[dataframe]" netCDF4 bottleneck
mamba install tensorflow tensorflow-probability
mamba install openai
```

In [2]:
# Setup and imports

import numpy as np
import pandas as pd
import xarray as xr

from tqdm import tqdm
from pathlib import Path
import openai
from python.cogtext.datasets.pubmed import PubMedDataLoader
import re

from IPython.display import display

In [32]:
GPT3_MODEL_ID = 'ada'  # 1024-dim embeddings
DATA_DIR = Path('../cogtext_data/')
OUTPUT_PATH = DATA_DIR / 'gpt4' / f'abstracts_gpt3{GPT3_MODEL_ID}.npz'

Prepare and cleanup the input data:

In [20]:
# load and prep pubmed document
pubmed = PubMedDataLoader(root_dir=DATA_DIR / 'pubmed',
                          preprocessed=False,
                          drop_low_occurred_labels=True).load()
pubmed = pubmed.query('pmid.notna() and abstract.notna() and title.notna()')
pubmed['abstract'] = pubmed['abstract'].apply(lambda x: x.replace('\n', ' '))
pubmed = pubmed.drop_duplicates(subset=['pmid'])

In [28]:
#### REMOVE VERY LONG ABSTRACTS; GPT-3 is limited to 2047 tokens per document

# 1. remove a very long document that prevented GPT-3 to encode all the other documents
very_long_docs = pubmed['abstract'].str.len().sort_values().iloc[:-11:-1]
pubmed = pubmed.drop(index=very_long_docs.index)

# 2. and just a heuristic to avoid GPT-3 error when encoding documents
pubmed = pubmed.query('abstract.str.len() < 3000')

# 2alt. or a slower RegEx approach
# abstract_tokens = pubmed['abstract'].apply(lambda x: len(re.split('\W+|\s+', x)))
# pubmed = pubmed[abstract_tokens < 2000]

pubmed[['pmid']].to_csv(DATA_DIR / 'gpt3' / f'abstracts_gpt3{GPT3_MODEL_ID}_pmids.csv')

print(f'* {pubmed.shape[0]} abstracts (pmids in abstracts_gpt3{GPT3_MODEL_ID}_pmids.csv)')

* 382825 abstracts (pmids in abstracts_gpt3ada_pmids.csv)


In [9]:
n_available_embeddings = 0

if OUTPUT_PATH.exists():
  n_available_embeddings = np.load(OUTPUT_PATH)['arr_0'].shape[0]

print(f'* {n_available_embeddings} documents are already embedded.')

* 382855 documents are already embedded.


In [12]:
gpt3_embeddings_dims = {
  'ada': 1024,
  'babbage': 2048,
  'curie': 4096,
  'davinci': 12288
}

batch_size = 100

model = openai.Engine(id=f'{GPT3_MODEL_ID}-similarity')

# @retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
def gpt3_embed(texts: list[str]):
  try:
    Z = model.embeddings(input=texts)#['data']['embedding']
    Z = [z['embedding'] for z in Z['data']]
    Z = np.array(Z)
  except Exception as e:
    print('GPT-3 failed! Filling the batch with zeros.', e)
    Z_dim = gpt3_embeddings_dims[GPT3_MODEL_ID]
    Z = np.zeros((len(texts), Z_dim))
  return Z

for i in tqdm(range(n_available_embeddings, len(pubmed), batch_size), unit='batch'):
  batch = pubmed[i:i+batch_size]['abstract'].tolist()
  batch_embeddings = gpt3_embed(batch)
  
  # cache
  np.savez(
    f'tmp/gpt3/abstracts_gpt3{GPT3_MODEL_ID}_b{(int(i/batch_size)+1):05d}.npz',
    batch_embeddings)

print('Done!')

0batch [00:00, ?batch/s]

Done!





For convenience, we convert the embeddings to NetCDF and store it in a single file `models/gpt3/abstracts_gpt3ada.nc`. Later, you can use single call to XArray and load the dataset.

In [3]:
pmids = pd.read_csv(DATA_DIR / 'gpt3' / f'abstracts_gpt3{GPT3_MODEL_ID}_pmids.csv', index_col=0)

gpt3_embeddings = np.load(DATA_DIR / 'gpt3' / f'abstracts_gpt3{GPT3_MODEL_ID}.npz')['arr_0']

# create the dataset
dataset = xr.Dataset({
  'gpt3_embeddings': (['pmid', 'gpt3_embedding_dim'], gpt3_embeddings)
},
coords={
  'pmid': pmids['pmid'].values,
  'original_index': pmids.index.values
})

# documentation
dataset.coords['pmid'].attrs['description'] = 'PubMed unique identifier'
dataset['gpt3_embeddings'].attrs['description'] = 'GPT-3 embeddings of the PubMed abstracts.'
dataset.coords['original_index'].attrs['description'] = (
  'original row index of the document in the abstracts.csv.gz file.')

# store
dataset.to_netcdf(f'models/gpt3/abstracts_gpt3{GPT3_MODEL_ID}.nc',
                  encoding={'gpt3_embeddings':{'zlib': True, "complevel": 5}})
dataset.close()

# done!
print(f'NetCDF4 dataset stored in `cogtext_data/gpt3/abstracts_gpt3{GPT3_MODEL_ID}.nc`.')

# dataset
# validation
with xr.open_dataset(DATA_DIR / 'gpt3' / f'abstracts_gpt3{GPT3_MODEL_ID}.nc') as dataset:
  dataset.load()
  display(dataset)