# Text Embeddings with SPECTER

In this notebook, we demonstrate how to convert raw text from academic papers (titles and abstracts) into 768-dimensional vectors using the SPECTER model.

## Introduction

Transforming raw text into embeddings is a computationally intensive task. While it's possible to perform this operation on a CPU, it's not recommended due to the significant time and resource overhead. Instead, using a GPU is advisable for efficiency.

Given the computational demands of this task, it's essential to adopt a memory-efficient approach. Instead of loading the entire dataset into memory, we'll process the data iteratively. This involves reading each paper's text line-by-line, generating its embedding, and then immediately writing the embedding to storage. This method ensures that only a minimal amount of data is held in memory at any given time.

## Considerations for Storing Embeddings

When deciding how to store the generated embeddings, several factors come into play:

- **Read/Write Speed**: 
    - For operations where speed is crucial, binary formats like numpy's `.npy` or `.npz` (for sparse matrices) are recommended. These formats offer faster read/write speeds compared to traditional CSV files.

- **Interoperability**: 
    - If the embeddings need to be accessed by various software or tools, the CSV format is more universal. However, it's worth noting that CSV files tend to be larger and slower to read/write compared to binary formats.

- **Data Volume**: 
    - If dealing with a vast amount of embeddings, it might be beneficial to process and store the data in chunks. This approach can further optimize memory usage and improve overall efficiency.

With these considerations in mind, we'll now delve into the process of generating and storing embeddings using the SPECTER model.


In [2]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.insert(1, '../scripts/')

import embeddings
import csv
from tqdm.notebook import tqdm

## Increase the max size of a line reading, otherwise an error is raised
maxInt = sys.maxsize

while True:
    # decrease the maxInt value by factor 10 
    # as long as the OverflowError occurs.

    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt/10)

In [10]:
import numpy as np
import csv
from tqdm.notebook import tqdm
import torch

# Load the embedding model
print('Load the embedding model...')
tokenizer, model = embeddings.load_model()

# Move the model to GPU if available
if torch.cuda.is_available():
    model = model.to('cuda')
    print("Model moved to GPU.")
else:
    print("Using CPU.")

# Count the number of papers
print('Get the number of papers to process...')
with open('../data/raw/papers_raw.csv', 'r', encoding='utf-8') as file:
    line_count = sum(1 for line in file)
total_papers = line_count - 1  # Subtract 1 for the header

# Choose storage method: 'csv' or 'numpy'
storage = 'numpy'  # Change to 'numpy' if needed

# Initialize storage
if storage == 'csv':
    with open('../data/vectors/papers_vectors.csv', 'w', encoding='utf-8', newline='') as writer:
        csv_writer = csv.writer(writer, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        csv_writer.writerow(["PaperID"] + list(range(0, 768)))
elif storage == 'numpy':
    empty_data = np.zeros((total_papers, 768))
    np.save('../data/vectors/papers_vectors.npy', empty_data)

# Process data in chunks for memory efficiency
chunk_size = 50
print('Processing...')
with open('../data/raw/papers_raw.csv', 'r', encoding='utf-8') as reader:
    csv_reader = csv.reader(reader, delimiter='\t', quotechar='"')
    next(csv_reader)  # Skip header

    for chunk_start in tqdm(range(0, total_papers, chunk_size)):
        chunk_data = [line for _, line in zip(range(chunk_size), csv_reader)]
        
        # Generate embeddings for the chunk
        vectors_with_ids = []
        for line in chunk_data:
            text = line[2] + line[3]
            vector = embeddings.get_embedding(text, tokenizer, model)
            vectors_with_ids.append([line[0]] + list(vector))  # Add PaperID at the beginning

        # Store embeddings
        if storage == 'csv':
            with open('../data/vectors/papers_vectors.csv', 'a', encoding='utf-8', newline='') as writer:
                csv_writer = csv.writer(writer, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
                csv_writer.writerows(vectors_with_ids)
        elif storage == 'numpy':
            with open('../data/vectors/papers_vectors.npy', 'r+b') as f:
                for idx, vector in enumerate(vectors):
                    f.seek((chunk_start + idx) * vector.nbytes)
                    np.save(f, vector)


Load the embedding model...
Using CPU.
Get the number of papers to process...
Processing...


  0%|          | 0/70 [00:00<?, ?it/s]

KeyboardInterrupt: 