# Documents to vectors using a text embedding model

In this notebook we show how we use text embedding models to take articles in the format of raw text and turn them into vectors. The vectors are meant to capture the "semantic meaning" of the text, meaning that articles with similar content and writing style will be close to teach other. For this specific notebook we use articles published in the Physics Education Research Conference Proceedings (PERC Proceedings), in the years 2001 to 2023. The text has already been scraped using the library PyMuPDF (which can handle text extration from PDFs with multiple columns) and is in a DataFrame together with some metadata such as author, title, and year of publication etc. This dataframe has been stored as a pickle file (extension .pkl), and cleaned in such a way that it is ready to be used in a text embedding model.

To embed the articles we use the API from hugging face transformers library. The library has a large number of pre-trained models, and we use a model from jinaai called jina-embeddings-v2-small-en. This was chosen because it is a small model, with a long context window (roughly 8k tokens). This is important because the articles are quite long. Although there are better models out there many of them have smaller context windows, are much larger or perform worse.

To access this model, you will need to go through a few initial steps:
1. Make an account on HuggingFace (https://huggingface.co/), and generate an access token (https://huggingface.co/docs/hub/en/security-tokens)
2. In your command line, run `huggingface-cli login` and enter your access token
3. Open your jupyter notebook and run the code below

## Imports
First, we import our libraries: transformers to access the embeddings LLM, pickle to store the dataframe, and numpy for some data handling

In [1]:
# Importing the libraries
from transformers import AutoModel
import pickle as pkl 
import numpy as np



Next, we import our embeddings model. Remember, if you haven't made an account with HuggingFace and authenticated yourself with an access token, this won't work.

In [2]:
model = AutoModel.from_pretrained("jinaai/jina-embeddings-v2-small-en", trust_remote_code=True)

## Loading our text data

Using pickle to load the dataset. We find the cleaned/processed text in the column "text". The other columns have metadata for the texts we're working with.

In [3]:
with open("data/PERC2001-2023_ExtraArticles/processed_text.pkl", "rb") as f:
    df = pkl.load(f)
df.head()

Unnamed: 0,title,authors,PDF Link,doi,year,raw
0,Inductive Influence of Related Quantitative an...,Philip Dukes and David E. Pritchard,https://www.per-central.org/../items/perc/990.pdf,10.1119/perc.2001.inv.001,2001,(Invited paper for proceedings of Physics Educ...
1,An Investigation on the Impact of Implementing...,Lawrence T. Escalada,https://www.per-central.org/../items/perc/1023...,10.1119/perc.2001.inv.002,2001,An Investigation on the Impact of Implementing...
2,Context in the Context of Physics and Learning,Noah D. Finkelstein,https://www.per-central.org/../items/perc/1025...,10.1119/perc.2001.inv.003,2001,Context in the Context of Physics and Learning...
3,Observing Students' Use of Computer-based Tool...,"Elizabeth George, Maan Jiang Broadstock, and J...",https://www.per-central.org/../items/perc/1027...,10.1119/perc.2001.inv.004,2001,"important for learning.9 With VBL, \ngraphs ar..."
4,Problem Solving and Conceptual Understanding,William J. Gerace,https://www.per-central.org/../items/perc/1028...,10.1119/perc.2001.inv.005,2001,Problem Solving and Conceptual Understanding \...


## Creating embeddings

Next, we iterate over the rows of our dataframe, and create the embeddings from the raw text. This takes a little while (approximately 2 hours 20 min minutes on a 2023-era M2 Macbook Pro).

In [5]:
%%time
def encode_text(row):
    text = row['raw']
    embedding = model.encode(text)
    return embedding.tolist()

# Apply the function along the rows and assign the result to the new 'embedding' column
df['embedding'] = df.apply(encode_text, axis=1)

## Saving our dataframe

Finally, we store the updated DataFrame with the embeddings in a pickle file. This will be used later when we analyze the texts by their embeddings and marks the end for the notebook, thanks for following along :)

In [None]:
with open("../data/PERC2001-2023_ExatraArticles/embeddings_jina.pkl", "wb") as f:
    pkl.dump(df, f)