# Text processor and text embeddings

In this notebook, we first create a new column `title_abstract` and then use a huggingface model to create text embeddings for this column.

This column is a concatenation of the `title` and `abstract` columns and processes the text in the follwing way:

- remove phrases like `abstract` and `introduction` from the text
- removes ending phrases like copyrights, version numbers of journals, journal names
- merges the title and abstract columns with `. ` as a separator

We then use specter2 to create text embeddings for this column.


In [1]:
import pandas as pd
from src.nlp.TextProcessor import TextProcessor

pd.options.mode.chained_assignment = None  # default='warn'

# read in data and process text


In [2]:
df = pd.read_pickle("../data/03-connected/scopus_cleaned_connected.pkl")
cols = ["abstract", "title"]
file_path = "../output/descriptive-stats-logs/na_log_text_cols.json"
tp = TextProcessor(df)
tp.save_na_dict_to_json(cols, file_path)
df = tp.clean_text_and_remove_start_and_ending_statements(
    return_cleaned_text_separately=True
)

df.reset_index(drop=True, inplace=True)
print(f"Papers to embed: {len(df)}")

NA dict saved to ../output/descriptive-stats-logs/na_log_text_cols.json
cleaned text and removed start and ending statements
cleaned text - embed me now :)
Papers to embed: 40643


# Create Embeddings


In [3]:
from src.nlp.EmbeddingCreator import PaperEmbeddingProcessor

In [4]:
processor = PaperEmbeddingProcessor(
    df=df,
    model_name="allenai/specter2_base",
    adapter_name="specter2",  # this is for "proximity"
    save_dir="../data/04-embeddings",
    batch_size=32,
    chunk_size=2500,  # 2500
)
total_embeddings = processor.process_papers()
processor.save_embeddings_with_data(total_embeddings)

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

100%|██████████| 17/17 [3:59:04<00:00, 843.77s/it]  
