## Documents to vectors using a text embedding model

In this notebook we show how we use text embedding models to take articles in the format of raw text and turn them into vectors. The vectors are meant to capture the "semantic meaning" of the text, meaning that similar articles will be close to teach other. For this specific notebook we use articles published in PERC, in the years 2001 to 2023. The text is in a DataFrame, together with some metadata such as author, title, and year of publication etc. This has already been stored as a pickle file (extension .pkl), and cleaned in such a way that it is ready to be used in a text embedding model.

In [1]:
# Importing the libraries
from transformers import AutoModel
import pickle as pkl 
import numpy as np



To embed the articles we use the API from hugging face transformers library. The library has a large number of pre-trained models, and we use a model from jinaai called jina-embeddings-v2-small-en. This was chosen because it is a small model, with a long context window (roughly 8k tokens). This is important because the articles are quite long. Although there are better models out there many of them have smaller context windows, are much larger or perform worse.

In [2]:
model = AutoModel.from_pretrained("jinaai/jina-embeddings-v2-small-en", trust_remote_code=True)

Using pickle to load the dataset we find the cleaned/processed text in the column "text".

In [3]:
with open("data/PERC2001-2023_ExtraArticles/processed_text.pkl", "rb") as f:
    df = pkl.load(f)
df.head()

Unnamed: 0,title,authors,PDF Link,doi,year,raw
0,Inductive Influence of Related Quantitative an...,Philip Dukes and David E. Pritchard,https://www.per-central.org/../items/perc/990.pdf,10.1119/perc.2001.inv.001,2001,(Invited paper for proceedings of Physics Educ...
1,An Investigation on the Impact of Implementing...,Lawrence T. Escalada,https://www.per-central.org/../items/perc/1023...,10.1119/perc.2001.inv.002,2001,An Investigation on the Impact of Implementing...
2,Context in the Context of Physics and Learning,Noah D. Finkelstein,https://www.per-central.org/../items/perc/1025...,10.1119/perc.2001.inv.003,2001,Context in the Context of Physics and Learning...
3,Observing Students' Use of Computer-based Tool...,"Elizabeth George, Maan Jiang Broadstock, and J...",https://www.per-central.org/../items/perc/1027...,10.1119/perc.2001.inv.004,2001,"important for learning.9 With VBL, \ngraphs ar..."
4,Problem Solving and Conceptual Understanding,William J. Gerace,https://www.per-central.org/../items/perc/1028...,10.1119/perc.2001.inv.005,2001,Problem Solving and Conceptual Understanding \...


As this notebook is only for embedding the text we want to store it together with all the metadata already present in the DataFrame. We therefore create a new column called "embedding". Since our model will output a vector of length 512, we store the vector as a list of 512 elements. This will vary between models so it is important to check the documentation of the model you are using.

In [5]:
temp2 = np.zeros((len(df), 512))
df["embedding"] = temp2.tolist()

Now we are ready to use the model and store the output in the DataFrame. It is as simple as calling model.encode() with the text as input. The output is a list of lists, where each list is the embedding of the corresponding article. The below code loops over each row in the DataFrame, embeds the text at that row and then stores it back into the DataFrame. 

In [None]:
for i in range(len(df)):
    text = df.loc[i, "raw"]
    embedding = model.encode(text)
    df["embedding"][i] = embedding.tolist()

Finally, we store the updated DataFrame with the embeddings in a pickle file. This will be used later when we analyze the texts by their embeddings and marks the end for the notebook, thanks for following along :)

In [None]:
with open("../data/PERC2001-2023_ExatraArticles/embeddings_jina.pkl", "wb") as f:
    pkl.dump(df, f)