# Text processor and text embeddings

In this notebook, we first create a new column `title_abstract` and then use a huggingface model to create text embeddings for this column.

This column is a concatenation of the `title` and `abstract` columns and processes the text in the follwing way:

- remove phrases like `abstract` and `introduction` from the text
- removes ending phrases like copyrights, version numbers of journals, journal names
- merges the title and abstract columns with `. ` as a separator

We then use specter2 to create text embeddings for this column.


In [1]:
import pandas as pd
from src.nlp.TextProcessor import TextProcessor

pd.options.mode.chained_assignment = None  # default='warn'

from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

# Access environment variables
python_path = os.getenv("PYTHONPATH")
data_dir = os.getenv("DATA_DIR")
src_dir = os.getenv("SRC_DIR")
output_dir = os.getenv("OUTPUT_DIR")

# read in data and process text


In [2]:
df = pd.read_pickle(data_dir + "/03-connected/scopus_cleaned_connected_20250326.pkl")
cols = ["abstract", "title"]

file_path = output_dir + "/descriptive-stats-logs/na_log_text_cols_20250326.json"

tp = TextProcessor(df)
tp.save_na_dict_to_json(cols, file_path)
df = tp.clean_text_and_remove_start_and_ending_statements(
    return_cleaned_text_separately=True
)

df.reset_index(drop=True, inplace=True)
print(f"Papers to embed: {len(df)}")

NA dict saved to output/descriptive-stats-logs/na_log_text_cols_20250326.json
cleaned text and removed start and ending statements
cleaned text - embed me now :)
Papers to embed: 38961


In [3]:
df.columns

Index(['eid', 'title', 'date', 'first_author', 'abstract', 'doi', 'year',
       'auth_year', 'unique_auth_year', 'pubmed_id', 'api_url', 'scopus_id',
       'journal', 'citedby_count', 'publication_type', 'publication_subtype',
       'publication_subtype_description', 'author_count', 'authors_json',
       'authkeywords', 'funding_no', 'openaccess', 'openaccess_flag',
       'freetoread', 'freetoread_label', 'fund_acr', 'fund_sponsor',
       'article_number', 'reference_eids', 'nr_references',
       'filtered_reference_eids', 'nr_filtered_references', 'title_abstract',
       'clean_title', 'clean_abstract'],
      dtype='object')

# Create Embeddings


In [4]:
from src.nlp.EmbeddingCreator import PaperEmbeddingProcessor

In [5]:
processor = PaperEmbeddingProcessor(
    df=df,
    model_name="allenai/specter2_base",
    adapter_name="specter2",  # this is for "proximity"
    save_dir=data_dir + "/04-embeddings/2025",
    batch_size=32,
    chunk_size=2500,  # 2500
)
total_embeddings = processor.process_papers()
processor.save_embeddings_with_data(total_embeddings)

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

100%|██████████| 16/16 [4:19:29<00:00, 973.09s/it]  
