# Paper Embedding Generation Pipeline

This notebook generates dense vector embeddings for academic papers using the SPECTER2 model. The embeddings capture semantic similarity between papers based on their titles and abstracts, enabling downstream tasks and analysis.

**Model:** [SPECTER2](https://huggingface.co/allenai/specter2) - A transformer-based model from Allen AI, specifically trained on scientific documents for document-level representation learning.

**Input:**
- `S2_papers_cleaned.db` - Cleaned Semantic Scholar papers with titles and abstracts (from `clean_and_merge_dbs.ipynb`) or downloadable from Hugging Face Hub: [`lalit3c/S2_CS_PHY_PYSCH_papers`](https://huggingface.co/datasets/lalit3c/S2_CS_PHY_PYSCH_papers)

**Output:**
- Incremental uploads to Hugging Face Hub: [`lalit3c/S2_CS_PHY_PYSCH_papers`](https://huggingface.co/datasets/lalit3c/S2_CS_PHY_PYSCH_papers)

**Note**
- Used GPU device for speedup
- Incremental uploads to Hugging Face Hub for fault tolerance

In [11]:
!pip install torch transformers adapters huggingface_hub duckdb pandas numpy --quiet

In [13]:
import torch

if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))
    print("Total GPU memory:",
          torch.cuda.get_device_properties(0).total_memory / 1024**3, "GB")
else:
    print("CUDA not available")


GPU: NVIDIA L40S
Total GPU memory: 44.52056884765625 GB


In [14]:
import torch
import duckdb
import pandas
from tqdm.notebook import tqdm
import numpy as np
import os

In [15]:
from transformers import AutoTokenizer
from adapters import AutoAdapterModel
from huggingface_hub import login, hf_hub_download

##  Data and Model Loading


In [None]:
login(token='') #Hugging Face token

### Download Papers Database

Download the cleaned papers database from Hugging Face Hub containing titles and abstracts for embedding generation.

In [None]:
local_filename = hf_hub_download(
    repo_id="lalit3c/S2_CS_PHY_PYSCH_papers",
    repo_type="dataset",  # Important: specify it's a dataset repo
    filename='S2_papers_cleaned.db', 
    local_dir_use_symlinks=False
)

S2_papers_cleaned_additional_papers.db:   0%|          | 0.00/283M [00:00<?, ?B/s]

In [18]:
cleaned_db = duckdb.connect(local_filename)

### Load SPECTER2 Model

Load the SPECTER2 base model and its proximity adapter. SPECTER2 is designed for scientific document embeddings and uses a SciBERT backbone with task-specific adapters.

**Model Architecture:**
- Base Model: `allenai/specter2_base` (SciBERT-based)
- Adapter: `allenai/specter2` (proximity task adapter)
- Embedding Dimension: 768
- Max Sequence Length: 512 tokens

In [20]:
# --- Load model and adapter ---
hf_model_name = "allenai/specter2_base"
adapter_name = "allenai/specter2"

device = None
if torch.cuda.is_available():
    device = torch.device("cuda")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")
print(f"Using device: {device}")
tokenizer = AutoTokenizer.from_pretrained(hf_model_name)

model = AutoAdapterModel.from_pretrained(hf_model_name)
model.load_adapter(adapter_name, source="hf", load_as="proximity", set_active=True)
model.to(device)
model.eval()

Using device: cuda


tokenizer_config.json:   0%|          | 0.00/453 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/754 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

.gitattributes: 0.00B [00:00, ?B/s]

README.md: 0.00B [00:00, ?B/s]

adapter_config.json: 0.00B [00:00, ?B/s]

pytorch_adapter.bin:   0%|          | 0.00/3.59M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

There are adapters available but none are activated for the forward pass.


BertAdapterModel(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(31090, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttentionWithAdapters(
              (query): LoRALinearTorch(
                in_features=768, out_features=768, bias=True
                (shared_parameters): ModuleDict()
                (loras): ModuleDict()
              )
              (key): LoRALinearTorch(
                in_features=768, out_features=768, bias=True
                (shared_parameters): ModuleDict()
                (loras): ModuleDict()
              )
              (value): LoRALinearTorch(
             

##  Embeddings Database Setup

Initialize the DuckDB database to store paper embeddings. Each embedding is a 768-dimensional float vector associated with a paper's `corpusid`.

In [22]:
# Single database for all embeddings
embeddings_db = duckdb.connect('all_embeddings.db')

In [24]:
embeddings_db.execute("""
    CREATE TABLE IF NOT EXISTS embeddings (
        corpusid VARCHAR PRIMARY KEY,
        embedding FLOAT[768]
    )
""")

<_duckdb.DuckDBPyConnection at 0x7f460008ce30>

In [25]:
existing_count = embeddings_db.execute("SELECT COUNT(*) FROM embeddings").fetchone()[0]
print(f"Already embedded: {existing_count:,} papers")

Already embedded: 0 papers


## Embedding Generation Functions

### Batch Embedding Function

Define the core embedding function that processes papers in batches for efficient GPU utilization.

**Input Format:** `"Title [SEP] Abstract"` - SPECTER2 expects title and abstract separated by a special token.

**Processing:**
1. Tokenize text with padding and truncation (max 512 tokens)
2. Forward pass through SPECTER2 model
3. Extract CLS token embedding (first token) as document representation

In [None]:
def embed_batch(texts, batch_size=64, pbar = None):
    """Embed texts in batches for efficiency"""
    all_embeddings = []
    
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i + batch_size]
        
        # Tokenize
        inputs = tokenizer(
            batch_texts,
            padding=True,
            truncation=True,
            max_length=512,
            return_tensors="pt",
            return_token_type_ids=False
        ).to(device)
        
        # Generate embeddings
        with torch.no_grad():
            outputs = model(**inputs)
            embeddings = outputs.last_hidden_state[:, 0, :]  # CLS token
        
        all_embeddings.append(embeddings.cpu().numpy())
        if pbar:
            pbar.update(len(batch_texts))
    
    return np.vstack(all_embeddings)



In [None]:
# Count all papers 
total_count = cleaned_db.execute("SELECT COUNT(*) FROM papers_with_abstracts").fetchone()[0]
print(f"Total papers to embed: {total_count:,}")
total_count

## Hugging Face Hub Integration

### Configuration Parameters

Set up batch sizes and upload thresholds for the embedding pipeline:
- **UPLOAD_EVERY**: Number of embeddings before uploading to HF Hub (fault tolerance)
- **FETCH_BATCH_SIZE**: Number of papers to fetch from source DB per iteration
- **EMBED_BATCH_SIZE**: Number of papers to process through the model at once
- **DB_INSERT_BATCH_SIZE**: Number of records to insert into embeddings DB at once

In [28]:
from huggingface_hub import HfApi, create_repo

In [None]:
HF_REPO_ID = "lalit3c/S2_CS_PHY_PYSCH_papers"
HF_REPO_TYPE = "dataset"
UPLOAD_EVERY = 100_000  # Upload every 100k embeddings
FETCH_BATCH_SIZE = 10_000  # Fetch from DB
EMBED_BATCH_SIZE = 64  # Process through model
DB_INSERT_BATCH_SIZE = 10_000  # Insert to DB

### Upload and Cleanup Function

Define helper function to upload completed embedding batches to Hugging Face Hub and clean up local storage. This enables:
- Fault-tolerant processing (progress saved to cloud)
- Efficient local storage usage (delete after upload)
- Resumable processing from last successful upload

In [None]:
# Initialize HF API
api = HfApi()

def upload_and_cleanup(db_path, upload_count):
    """Upload DB to HuggingFace and clean up local file"""
    try:
        # Close the database connection temporarily
        embeddings_db.close()
        
        # Upload to HuggingFace
        filename = f"embeddings/embeddings_{upload_count}.db"
        print(f"\nUploading {db_path} to HuggingFace as {filename}...")
        api.upload_file(
            path_or_fileobj=db_path,
            path_in_repo=filename,
            repo_id=HF_REPO_ID,
            repo_type=HF_REPO_TYPE
        )
        print(f"Uploaded {filename}")
        
        # Delete local file to free space
        if os.path.exists(db_path):
            os.remove(db_path)
            print(f"Deleted local {db_path}")
        
        # Reconnect to a new database
        new_db = duckdb.connect(db_path)
        new_db.execute("""
            CREATE TABLE IF NOT EXISTS embeddings (
                corpusid VARCHAR PRIMARY KEY,
                embedding FLOAT[768]
            )
        """)
        return new_db
    except Exception as e:
        print(f"Error during upload/cleanup: {e}")
        # Reconnect even if upload failed
        return duckdb.connect(db_path)


## Main Embedding Pipeline

Execute the main embedding loop that processes all papers:

1. **Fetch** papers from source database in batches
2. **Prepare** input text as `"Title [SEP] Abstract"` format
3. **Embed** batch using SPECTER2 model
4. **Store** embeddings in local DuckDB
5. **Upload** to Hugging Face Hub every `UPLOAD_EVERY` papers
6. **Resume** from last checkpoint if interrupted

In [None]:

# Process all papers in batches
offset = existing_count
total_processed = existing_count
upload_counter = 0
embeddings_since_last_upload = existing_count % UPLOAD_EVERY

print(f"Starting embedding from offset {offset:,}")
print(f"Will upload every {UPLOAD_EVERY:,} embeddings")

# Create progress bar
pbar = tqdm(total=total_count, initial=existing_count, desc="Embedding papers", unit="papers")

while offset < total_count:
    # Fetch batch of papers from database
    papers_df = cleaned_db.execute(f"""
        SELECT corpusid, title, abstract, publication_date, citation_count
        FROM papers_with_abstracts
        ORDER BY corpusid
        LIMIT {FETCH_BATCH_SIZE}
        OFFSET {offset}
    """).df()
    
    if len(papers_df) == 0:
        break
    
    # Prepare texts for embedding: "Title [SEP] Abstract"
    texts = [f"{row['title']} {tokenizer.sep_token} {row['abstract']}"
             for _, row in papers_df.iterrows()]
    
    # Generate embeddings
    embeddings = embed_batch(texts, batch_size=EMBED_BATCH_SIZE, pbar=pbar)
    
    # Prepare data for insertion
    insert_data = []
    for idx, (_, row) in enumerate(papers_df.iterrows()):
        insert_data.append({
            'corpusid': str(row['corpusid']),
            'embedding': embeddings[idx].tolist()
        })
    
    # Insert in smaller batches to avoid memory issues
    for i in range(0, len(insert_data), DB_INSERT_BATCH_SIZE):
        batch = insert_data[i:i + DB_INSERT_BATCH_SIZE]
        embeddings_db.executemany("""
            INSERT OR REPLACE INTO embeddings
            (corpusid, embedding)
            VALUES (?, ?)
        """, [(d['corpusid'], d['embedding']) for d in batch])
    
    total_processed += len(papers_df)
    offset += FETCH_BATCH_SIZE
    embeddings_since_last_upload += len(papers_df)
    
    # Commit periodically
    embeddings_db.commit()
    
    # Update progress bar
    pbar.set_postfix({
        'processed': f"{total_processed:,}",
        'pct': f"{100*total_processed/total_count:.1f}%",
        'next_upload': f"{UPLOAD_EVERY - embeddings_since_last_upload:,}"
    })
    
    # Upload and cleanup every 100k embeddings
    if embeddings_since_last_upload >= UPLOAD_EVERY:
        upload_counter += 1
        embeddings_db = upload_and_cleanup('all_embeddings.db', upload_counter)
        embeddings_since_last_upload = 0

# Final upload if there are remaining embeddings
if embeddings_since_last_upload > 0:
    upload_counter += 1
    embeddings_db = upload_and_cleanup('all_embeddings.db', upload_counter)

pbar.close()
print(f"\nEmbedding complete! Total papers embedded: {total_processed:,}")
print(f" Uploaded {upload_counter} database files to HuggingFace")
print(f" Repository: https://huggingface.co/datasets/{HF_REPO_ID}")