## Feed Policy Text data into sentence-transformer to get position in vector space

In [1]:
import configparser
from pathlib import Path
import pandas as pd
import numpy as np
import os
import time
import spacy
from spacy import displacy
import string


from sentence_transformers import SentenceTransformer, util
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

2025-09-26 10:31:46.013189: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
config = configparser.ConfigParser()
config.read("config.ini")

# access values
raw_path = Path(config["default"]["raw_path"])
interim_path = Path(config["default"]["interim_path"])
processed_path = Path(config["default"]["processed_path"])

In [11]:
df = pd.read_csv(interim_path/"tokenised_policy_incentives_country.csv") #tokenised_policy_incentives_subsectioned.csv
df

Unnamed: 0,country,text_clean
0,Austria,federal purchase subsidy scheme e mobilität 20...
1,Belgium,regional purchase subsidies are no longer avai...
2,Bulgaria,no purchase subsidies are currently available ...
3,Croatia,call for investment in zero emission vehicles ...
4,Cyprus,under national electromobility promotion schem...
5,Czech Republic,no purchase subsidies available in 2025 bevs f...
6,Denmark,no direct national purchase subsidies are avai...
7,Estonia,since december 2023 estonia no longer offers d...
8,Finland,2025 finland does not offer direct purchase su...
9,France,ecological bonus bonus écologique remains in p...


#### Import sentence-transformer model. 
* Documentation on these options found here: https://www.sbert.net/docs/sentence_transformer/pretrained_models.html
* Chose 'all' mini - because best general purpose model (out of those available), and was trained on all available training data. 
* 'Mini' version is faster, and small enough to run locally without any issues.
* Encodes sentences in 384 dimensional space

In [12]:
model = SentenceTransformer('all-MiniLM-L6-v2')

Quick check that the model is working and returning values as expected

In [5]:
# TEST EMBEDDINGS
embeddings = model.encode(["Hello world"])
print(embeddings.shape)

(1, 384)


Now apply embeddings to cleaned policy text column, storing in a new embeddings column of same dataframe to keep metadata
* for now, I haven't included section subheadings in the text - may explore doing so? 

In [13]:
# apply embeddings model to policy text column - ensuring embeddings stored as (comma separated) list
df['embeddings'] = df['text_clean'].apply(lambda x: model.encode(x).tolist())

# check results
#print(df['embeddings'][0].shape)  # embedding dimension
df

Unnamed: 0,country,text_clean,embeddings
0,Austria,federal purchase subsidy scheme e mobilität 20...,"[-0.03710819408297539, 0.02366330847144127, 0...."
1,Belgium,regional purchase subsidies are no longer avai...,"[-0.0001772841642377898, 0.028538541868329048,..."
2,Bulgaria,no purchase subsidies are currently available ...,"[-0.04008038714528084, 0.0288896132260561, 0.0..."
3,Croatia,call for investment in zero emission vehicles ...,"[-0.003128104144707322, 0.044598959386348724, ..."
4,Cyprus,under national electromobility promotion schem...,"[-0.025078408420085907, 0.06687013059854507, -..."
5,Czech Republic,no purchase subsidies available in 2025 bevs f...,"[-0.04738859832286835, 0.05021532252430916, -0..."
6,Denmark,no direct national purchase subsidies are avai...,"[-0.06332932412624359, 0.05011808127164841, 0...."
7,Estonia,since december 2023 estonia no longer offers d...,"[-0.029093794524669647, 0.003588747698813677, ..."
8,Finland,2025 finland does not offer direct purchase su...,"[-0.06467203795909882, 0.0142294242978096, 0.0..."
9,France,ecological bonus bonus écologique remains in p...,"[0.018083663657307625, 0.05167808011174202, 0...."


In [14]:
# save out DF with embeddings
df.to_csv(interim_path/"policy_embeddings_country.csv", index=False)