In [40]:
%conda install pydantic pandas langchain langchain-openai tqdm sentence-transformers torch

Channels:
 - defaults
 - conda-forge
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/samu/projects/vtt-innovation-duplication/.conda

  added / updated specs:
    - langchain
    - langchain-openai
    - pandas
    - pydantic
    - sentence-transformers
    - tqdm


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    aiohappyeyeballs-2.4.4     |  py313h06a4308_0          27 KB
    aiohttp-3.11.10            |  py313h5eee18b_0         945 KB
    aiosignal-1.2.0            |     pyhd3eb1b0_0          12 KB
    aws-c-auth-0.8.1           |       h205f482_0         106 KB  conda-forge
    aws-c-cal-0.8.1            |       h1a47875_3          46 KB  conda-forge
    aws-c-common-0.10.6        |       hb9d3cd8_0         231 KB  conda-forge
    aws-c-compression-0.3.0    |       h4e1184b_5          19 

### VTT challenge: Innovation Ambiguity

#### Source Dataframe
- Websites from finnish companies that mention 'VTT' on their website
- `Orbis ID`, also `VAT id` is a unique identifier for organizations, later used to merge different alias of the same organization to one unique id

In [4]:

# 1. original source dataframe
import pandas as pd
df = pd.read_csv('data/dataframes/vtt_mentions_comp_domain.csv')
df = df[df['Website'].str.startswith('www.')]
df['source_index'] = df.index

print(f"DF with content from {len(df)} websites of {len(df['Company name'].unique())} different companies ")
df.head(3)

DF with content from 1100 websites of 270 different companies 


Unnamed: 0,Orbis ID,Company name,Website,Link,Title,date_obtained,Type,text_content,Company_Name_Alias,source_index
0,FI14636114,FORTUM OYJ,www.fortum.com,https://www.fortum.com/media/2020/04/fortum-aw...,Fortum awarded contract for decommissioning of...,2024-09-18,Website,"Press release\n 08 April 2020, 7:00 EEST ...","['Fortum Oyj', 'Fortum Markets AB', 'Fortum Ab...",0
1,FI14636114,FORTUM OYJ,www.fortum.com,https://www.fortum.com/media/2023/03/finnish-g...,Finnish Government grants operating licence fo...,2024-08-16,Website,"Press release\n 30 March 2023, 13:25 EEST ...","['Fortum Oyj', 'Fortum Markets AB', 'Fortum Ab...",1
2,FI14636114,FORTUM OYJ,www.fortum.com,https://www.fortum.com/media/2021/05/apros-sof...,"AprosÂ® software, developed by Fortum and VTT,...",2024-09-12,Website,"Online news\n 21 May 2021, 9:30 EEST ...","['Fortum Oyj', 'Fortum Markets AB', 'Fortum Ab...",2


##### End-to-End relationship extraction
- Based on the above website content, entities of the type `Organization` and `Innovation` are extracted, as well as their type of relationship
- `Collaboration` between Organization and `Developed_by` between Innovation and Organization
- The relationships are stored in a custom object as displayed below: 

In [5]:

# 2.1. example of custom python object of data
from typing import List, Dict, Optional
from pydantic import BaseModel, Field

class Node(BaseModel):
    """Represents a node in the knowledge graph."""
    id: str # unique identifier for node of type 'Organisation', else 'name provided by llm' for type: 'Innovation'
    type: str # allowed node types: 'Organization', 'Innovation'
    properties: Dict[str, str] = Field(default_factory=dict)

class Relationship(BaseModel):
    """Represents a relationship between two nodes in the knowledge graph."""
    source: str
    source_type: str # allowed node types: 'Organization', 'Innovation'
    target: str 
    target_type: str # allowed node types: 'Organization', 'Innovation'
    type: str # allowed relationship types: 'DEVELOPED_BY', 'COLLABORATION'
    properties: Dict[str, str] = Field(default_factory=dict)

class Document(BaseModel):
    page_content:str # manually appended - source text of website
    metadata: Dict[str, str] = Field(default_factory=dict) # metadata including source URL and document ID

class GraphDocument(BaseModel):
    """Represents a complete knowledge graph extracted from a document."""
    nodes: List[Node] = Field(default_factory=list)
    relationships: List[Relationship] = Field(default_factory=list)
    source_document: Optional[Document] = None

# 2.2 loading example of custom graph document
# example file naming convention
print("The extracted graph documents are saved as f'{df['Company name'].replace(' ','_')}_{df['source_index'].pkl}.pkl, under data/graph_docs/ \n")

for i, row in df[:3].iterrows():
    print(f"{i}: 'data/graph_docs/' + {row['Company name'].replace(' ','_')}_{row['source_index']}.pkl")

The extracted graph documents are saved as f'{df['Company name'].replace(' ','_')}_{df['source_index'].pkl}.pkl, under data/graph_docs/ 

0: 'data/graph_docs/' + FORTUM_OYJ_0.pkl
1: 'data/graph_docs/' + FORTUM_OYJ_1.pkl
2: 'data/graph_docs/' + FORTUM_OYJ_2.pkl


In [6]:
# 2.3 loading example of custom graph document

import pickle, os
path = 'data/graph_docs/'
index = 0

# load graph document
with open(os.path.join(path, os.listdir(path)[index]), 'rb') as doc:
    graph_doc = pickle.load(doc)

print(f"Example custom graph document:\n\n {graph_doc} \n\n ")

print("Example custom graph document nodes :\n")
for doc in graph_doc:
    for node in doc.nodes:
        print(f"- {node.id} ({node.type})    :   {node.properties['description']}")

print("\nExample custom graph document relationships:\n")
for doc in graph_doc:
    for relationship in doc.relationships:
        print(f"- {relationship.source} ({relationship.source_type}) - {relationship.type} -> {relationship.target} ({relationship.target_type})    :    description: {relationship.properties['description']}")


Example custom graph document:

 [GraphDocument(nodes=[Node(id='pilottilinja', type='Innovation', properties={'english_id': 'pilottilinja', 'description': "A pilot production line at VTT's Koivuranta facilities in Jyväskylä that produces fiber fabrics using new manufacturing methods to reduce fossil dependency."}), Node(id='Teknologian tutkimuskeskus VTT', type='Organization', properties={'english_id': 'Teknologian tutkimuskeskus VTT', 'description': 'VTT is a technology research center that has developed new manufacturing methods for fiber fabrics using Finnish raw materials and energy-efficient foam technology.'})], relationships=[Relationship(source='pilottilinja', source_type='Innovation', target='Teknologian tutkimuskeskus VTT', target_type='Organization', type='DEVELOPED_BY', properties={'description': 'The pilot production line was developed by VTT Technology Research Center in Jyväskylä, focusing on reducing fossil dependency in fiber fabrics.'})], source_document=Document(meta

#### Name ambiguity resolution
- within the source text, variation/ alias of organization name lead to ambiguity
- this ambiguity is partly solved by mapping organization to a unique identifier: `VAT ID`
- the dict: `entity_glossary` stores Ids and Alias as key-value pairs

In [7]:
# 3. load entity glossary
import json
entity_glossary = json.load(open('data/entity_glossary/entity_glossary.json', 'r', encoding = 'utf-8'))
print(entity_glossary.get('FI26473754'))

{'alias': ['Valtion teknillinen tutkimuskeskus', 'VTT:ltä', 'Teknologian tutkimuskeskus VTT:llÃ¤', 'VTT â\x80\x93 beyond the obvious', 'VTT Oy', 'VTT TECHNICAL RESEAR CH CENTRE OF FINLAND LT D', 'VTT Technical Research Centres of Finland', 'VTT iBEX', 'Tehnologian tutkimuskeskus VTT', 'VTT:n', 'VTT technical research centre', 'VTT Technical Research Centre of Finland Ltd.', 'VTT:ltÃ¤', 'VTT (Technical Research Center of Finland)', 'Teknologian tutkimuskeskus VTT Oy', 'centre de recherche technique de Finlande', 'VTT:llÃ¤', 'Teknologian tutkimuskeskus VTT', 'TEKNOLOGIAN TUTKIMUSKESKUS VTT OY', 'VTT T echnical Research Centre of Finland', 'VTT TECHNICAL RESEARCH CENTRE OF FINLAND LTD', 'Technical Research Centre VTT', 'VTT Technical Research Centre', 'VTT Technical Research Centre in Finland', 'VTT Technical Research Centre of Finland Ltd', 'VTT-R-0025 5-20', 'VTT:t\x0224', 'VTT:llä', 'VTT LaunchPad', 'Technologian tutkimuskeskus VTT Oy', 'Facebook, LinkedIn, YouTube and Instagram', 'Tek

In [8]:
# 2.3 loading example of custom graph document

import pickle, os
path = 'data/graph_docs_names_resolved/'
index = 0

# load graph document
with open(os.path.join(path, os.listdir(path)[index]), 'rb') as doc:
    graph_doc = pickle.load(doc)

print(f"Example custom graph document:\n\n {graph_doc} \n\n ")

print("Example custom graph document nodes :\n")
for doc in graph_doc:
    for node in doc.nodes[:3]:
        print(f"- {node.id} ({node.type})    :   {node.properties['description']}")

print("\nExample custom graph document relationships:\n")
for doc in graph_doc:
    for relationship in doc.relationships[:3]:
        print(f"- {relationship.source} ({relationship.source_type}) - {relationship.type} -> {relationship.target} ({relationship.target_type})    :    description: {relationship.properties['description']}")


Example custom graph document:

 [GraphDocument(nodes=[Node(id='pilottilinja', type='Innovation', properties={'english_id': 'pilottilinja', 'description': "A pilot production line at VTT's Koivuranta facilities in Jyväskylä that produces fiber fabrics using new manufacturing methods to reduce fossil dependency."}), Node(id='FI26473754', type='Organization', properties={'english_id': 'Teknologian tutkimuskeskus VTT', 'description': 'VTT is a technology research center that has developed new manufacturing methods for fiber fabrics using Finnish raw materials and energy-efficient foam technology.'})], relationships=[Relationship(source='pilottilinja', source_type='Innovation', target='FI26473754', target_type='Organization', type='DEVELOPED_BY', properties={'description': 'The pilot production line was developed by VTT Technology Research Center in Jyväskylä, focusing on reducing fossil dependency in fiber fabrics.'})], source_document=Document(metadata={'source_id': 'KESKI-SUOMEN MEDIA O

In [9]:
# transform graph document into dataframe
import pandas as pd
from tqdm import tqdm


df_relationships_comp_url = pd.DataFrame(index= None)

with tqdm(total= len(df), desc="Entities resolved") as pbar:
    for i, row in df.iterrows(): 
        try:     
            Graph_Docs = pickle.load(open(os.path.join('data/graph_docs_names_resolved/', f"{row['Company name'].replace(' ','_')}_{i}.pkl"), 'rb'))[0] # load graph doc
                
            node_description = {} # unique identifier
            node_en_id = {}
            for node in Graph_Docs.nodes:
                node_description[node.id] = node.properties['description']
                node_en_id[node.id] = node.properties['english_id']

            # get relationship triplets
            relationship_rows = []
            for i in range(len(Graph_Docs.relationships)):
            
                relationship_rows.append({
                    "Document number": row['source_index'],
                    "Source Company": row["Company name"],
                    "relationship description": Graph_Docs.relationships[i].properties['description'],
                    "source id": Graph_Docs.relationships[i].source,
                    "source type": Graph_Docs.relationships[i].source_type,
                    "source english_id": node_en_id.get(Graph_Docs.relationships[i].source, None),
                    "source description": node_description.get(Graph_Docs.relationships[i].source, None),
                    "relationship type": Graph_Docs.relationships[i].type,
                    "target id": Graph_Docs.relationships[i].target,
                    "target type": Graph_Docs.relationships[i].target_type,
                    "target english_id": node_en_id.get(Graph_Docs.relationships[i].target, None),
                    "target description": node_description.get(Graph_Docs.relationships[i].target, None),
                    "Link Source Text": row["Link"],
                    "Source Text": row["text_content"],
                })

            df_relationships_comp_url = pd.concat([df_relationships_comp_url, pd.DataFrame(relationship_rows, index= None)], ignore_index=True)

        except:
            continue

        pbar.update(1)

df_relationships_comp_url.head(5)

Entities resolved:   0%|          | 0/1100 [00:00<?, ?it/s]

Entities resolved:  96%|█████████▌| 1053/1100 [00:02<00:00, 362.82it/s]


Unnamed: 0,Document number,Source Company,relationship description,source id,source type,source english_id,source description,relationship type,target id,target type,target english_id,target description,Link Source Text,Source Text
0,0,FORTUM OYJ,Fortum Corporation is responsible for executin...,First nuclear decommissioning project in Finla...,Innovation,First nuclear decommissioning project in Finla...,The first nuclear reactor decommissioning proj...,DEVELOPED_BY,FI14636114,Organization,Fortum Corporation,A company with over 40 years of experience in ...,https://www.fortum.com/media/2020/04/fortum-aw...,"Press release\n 08 April 2020, 7:00 EEST ..."
1,0,FORTUM OYJ,VTT Technical Research Centre of Finland Ltd c...,First nuclear decommissioning project in Finla...,Innovation,First nuclear decommissioning project in Finla...,The first nuclear reactor decommissioning proj...,DEVELOPED_BY,FI26473754,Organization,VTT Technical Research Centre of Finland Ltd,A research organization in Finland contracting...,https://www.fortum.com/media/2020/04/fortum-aw...,"Press release\n 08 April 2020, 7:00 EEST ..."
2,0,FORTUM OYJ,Fortum Corporation and VTT Technical Research ...,FI14636114,Organization,Fortum Corporation,A company with over 40 years of experience in ...,COLLABORATION,FI26473754,Organization,VTT Technical Research Centre of Finland Ltd,A research organization in Finland contracting...,https://www.fortum.com/media/2020/04/fortum-aw...,"Press release\n 08 April 2020, 7:00 EEST ..."
3,0,FORTUM OYJ,Fortum Corporation has been awarded and is con...,FiR1 nuclear reactor decommissioning,Innovation,FiR1 nuclear reactor decommissioning,"The process of planning, preparatory measures,...",DEVELOPED_BY,FI14636114,Organization,Fortum Corporation,A company with over 40 years of experience in ...,https://www.fortum.com/media/2020/04/fortum-aw...,"Press release\n 08 April 2020, 7:00 EEST ..."
4,0,FORTUM OYJ,The Radiation and Nuclear Safety Authority of ...,FiR1 nuclear reactor decommissioning,Innovation,FiR1 nuclear reactor decommissioning,"The process of planning, preparatory measures,...",DEVELOPED_BY,temp_1,Organization,Radiation and Nuclear Safety Authority of Finl...,The national regulatory authority supervising ...,https://www.fortum.com/media/2020/04/fortum-aw...,"Press release\n 08 April 2020, 7:00 EEST ..."


#### Innovation and Collaboration disclosure on VTT-domain
- in addition to the discussion of VTT contribution on company websites, the second datasource includes websites under the vtt domain that discuss collaboration with other companies
- the list of source urls is provided under `data/dataframes/comp_mentions_vtt_domain.vsc`
- the extract relationships as custom objects are provided under `data/dataframes/graph_docs_vtt_domain`
- the extract relationships with organization resolution under `data/dataframes/graph_docs_vtt_domain`

In [26]:
# transform graph document into dataframe
import pandas as pd
from tqdm import tqdm

df_relationships_vtt_domain = pd.DataFrame(index= None)
df_vtt_domain = pd.read_csv('data/dataframes/comp_mentions_vtt_domain.csv')

with tqdm(total= len(df_vtt_domain), desc="Entities resolved") as pbar:
    for index_source, row in df_vtt_domain.iterrows(): 
        try:     
            Graph_Docs = pickle.load(open(os.path.join('data/graph_docs_vtt_domain_names_resolved/', f"{row['Vat_id'].replace(' ','_')}_{index_source}.pkl"), 'rb'))[0] # load graph doc
                
            node_description = {} # unique identifier
            node_en_id = {}
            for node in Graph_Docs.nodes:
                node_description[node.id] = node.properties['description']
                node_en_id[node.id] = node.properties['english_id']

            # get relationship triplets
            relationship_rows = []
            for i in range(len(Graph_Docs.relationships)):
            
                relationship_rows.append({
                    "Document number": index_source,
                    "VAT id": row["Vat_id"],
                    "relationship description": Graph_Docs.relationships[i].properties['description'],
                    "source id": Graph_Docs.relationships[i].source,
                    "source type": Graph_Docs.relationships[i].source_type,
                    "source english_id": node_en_id.get(Graph_Docs.relationships[i].source, None),
                    "source description": node_description.get(Graph_Docs.relationships[i].source, None),
                    "relationship type": Graph_Docs.relationships[i].type,
                    "target id": Graph_Docs.relationships[i].target,
                    "target type": Graph_Docs.relationships[i].target_type,
                    "target english_id": node_en_id.get(Graph_Docs.relationships[i].target, None),
                    "target description": node_description.get(Graph_Docs.relationships[i].target, None),
                    "Link Source Text": row["source_url"],
                    "Source Text": row["main_body"],
                })

            df_relationships_vtt_domain = pd.concat([df_relationships_vtt_domain, pd.DataFrame(relationship_rows, index= None)], ignore_index=True)

        except:
            continue

        
        pbar.update(1)

df_relationships_vtt_domain.head(5)

Entities resolved:  84%|████████▎ | 1159/1387 [00:03<00:00, 339.66it/s]


Unnamed: 0,Document number,VAT id,relationship description,source id,source type,source english_id,source description,relationship type,target id,target type,target english_id,target description,Link Source Text,Source Text
0,0,FI10292588,"FiR 1 nuclear research reactor was developed, ...",FiR 1,Innovation,FiR 1,FiR 1 is a Triga-type nuclear research reactor...,DEVELOPED_BY,FI26473754,Organization,VTT Technical Research Centre of Finland Ltd.,VTT is a Finnish research and innovation partn...,https://www.vttresearch.com/en/news-and-ideas/...,Skip to main content Beyond the obvious Open m...
1,0,FI10292588,Centre for Nuclear Safety is being developed a...,Centre for Nuclear Safety,Innovation,Centre for Nuclear Safety,A modern research facility under construction ...,DEVELOPED_BY,FI26473754,Organization,VTT Technical Research Centre of Finland Ltd.,VTT is a Finnish research and innovation partn...,https://www.vttresearch.com/en/news-and-ideas/...,Skip to main content Beyond the obvious Open m...
2,3,FI08932048,The innovation approach 'Beyond the obvious' i...,Beyond the obvious,Innovation,Beyond the obvious,An innovation approach promising to provide so...,DEVELOPED_BY,FI26473754,Organization,VTT Technical Research Centre of Finland Ltd,A visionary research and innovation partner fo...,https://www.vttresearch.com/en/news-and-ideas/...,Skip to main content Beyond the obvious Open m...
3,4,FI01111693,Data-Driven Bioeconomy project is developed by...,Data-Driven Bioeconomy project,Innovation,Data-Driven Bioeconomy project,An innovation using Big Data for sustainable u...,DEVELOPED_BY,FI26473754,Organization,VTT Technical Research Centre of Finland,A Finnish research and innovation partner work...,https://www.vttresearch.com/en/news-and-ideas/...,Skip to main content Beyond the obvious Open m...
4,4,FI01111693,Data-Driven Bioeconomy project's forestry pilo...,Data-Driven Bioeconomy project,Innovation,Data-Driven Bioeconomy project,An innovation using Big Data for sustainable u...,DEVELOPED_BY,temp_1141,Organization,MHG Systems,An organization leading pilots developing fore...,https://www.vttresearch.com/en/news-and-ideas/...,Skip to main content Beyond the obvious Open m...


In [41]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

  from .autonotebook import tqdm as notebook_tqdm
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


In [25]:
df_relationships_vtt_domain.to_csv("results/df_relationships_vtt_domain.csv")

NameError: name 'df_relationships_vtt_domain' is not defined

In [43]:
from langchain_openai import AzureOpenAIEmbeddings
import json
import numpy as np

def initialize_embeddings(config_file_path: str = 'data/keys/azure_config.json', model_key: str = 'gpt-4o-mini'):
    """Initialize Azure OpenAI embeddings client"""
    with open(config_file_path, 'r') as jsonfile:
        config = json.load(jsonfile)
    
    # Extract base URL (remove the specific deployment path)
    api_base = config[model_key]['api_base'].split('/openai/deployments')[0]
    
    return AzureOpenAIEmbeddings(
        api_key=config[model_key]['api_key'],
        azure_endpoint=api_base,
        api_version=config[model_key]['api_version'],
        azure_deployment=config[model_key]['emb_deployment']  # text-embedding-3-large
    )

# Initialize the embedding model
embedding_model = initialize_embeddings()

In [44]:
df_relationships_vtt_domain["text_to_compare"] = (
    df_relationships_vtt_domain["source id"].fillna("") + " " +
    df_relationships_vtt_domain["source english_id"].fillna("") + " " +
    df_relationships_vtt_domain["source description"].fillna(""))

df_relationships_vtt_domain = df_relationships_vtt_domain[df_relationships_vtt_domain["source type"] != "Organization"]
df_relationships_vtt_domain = df_relationships_vtt_domain.drop_duplicates(subset="source description", keep="first")


In [45]:
texts_to_embed = df_relationships_vtt_domain["text_to_compare"].tolist()
embeddings = embedding_model.embed_documents(texts_to_embed)

# Convert to tensor if needed (embeddings will be a list of lists)
import torch
embeddings = torch.tensor(embeddings)

In [46]:
print(embeddings)

tensor([[-0.0183,  0.0210, -0.0209,  ..., -0.0014,  0.0089, -0.0081],
        [-0.0058,  0.0123, -0.0204,  ...,  0.0015,  0.0048, -0.0181],
        [-0.0203,  0.0031, -0.0167,  ..., -0.0059, -0.0102, -0.0097],
        ...,
        [ 0.0032,  0.0341, -0.0055,  ...,  0.0063,  0.0216, -0.0100],
        [-0.0119, -0.0013, -0.0217,  ...,  0.0063, -0.0168, -0.0109],
        [-0.0087,  0.0203, -0.0250,  ..., -0.0002, -0.0234, -0.0125]])


In [49]:
# 4. load api access credentials 
from langchain_openai import AzureChatOpenAI
import json

def initialize_llm(deployment_model:str, config_file_path:str= 'data/azure_config.json')->AzureChatOpenAI: 
    with open(config_file_path, 'r') as jsonfile:
        config = json.load(jsonfile)
    
    return AzureChatOpenAI(model =deployment_model,
                    api_key=config[deployment_model]['api_key'],
                    azure_endpoint = config[deployment_model]['api_base'],
                    api_version = config[deployment_model]['api_version'])

# initialize
#model = initialize_llm(deployment_model= 'gpt-4o-mini', config_file_path= 'data/keys/azure_config.json')
model = initialize_llm(deployment_model= 'gpt-4.1-mini', config_file_path= 'data/keys/azure_config.json')
#model = initialize_llm(deployment_model= 'gpt-4.1', config_file_path= 'data/keys/azure_config.json')

# example use:
prompt = 'Say hi'
model.invoke(prompt).content

system_prompt = "Compare the innovation description and return a score between 0 and 100 based on how confident you are that the two descriptions are the same innovation. Give a reasoning first and at the end return the score in the following format: 'Score: <score>'. e.g. score 80"

In [None]:
threshold = 0.80
similar_pairs = []

print("Comparing entries pair by pair...")
texts = df_relationships_vtt_domain["text_to_compare"].tolist()
i = 0
for i in tqdm(range(len(texts))):
    for j in range(i + 1, len(texts)):
        sim = util.cos_sim(embeddings[i], embeddings[j]).item()
        
        if sim > threshold:
            source_i = df_relationships_vtt_domain.iloc[i]["source description"]
            source_j = df_relationships_vtt_domain.iloc[j]["source description"]
            
            print(f"\nSimilarity: {sim:.3f} | From: {i+1} To: {j+1}")
            print(f"→ Source {i+1} description: {source_i}")
            print(f"→ Source {j+1} description: {source_j}")
            
            prompt = f"Innovation 1: {texts[i]}\nInnovation 2: {texts[j]}"
            messages = [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": prompt}
             ]   
            response = model.invoke(messages).content
            print(response)
            
            similar_pairs.append((i, j, sim))
        else:
            i += 1
            if i != 10:
                continue
            i = 0
            print("NOT SIMILAR:")
            print(f"→ Source {i+1} description: {source_i}")
            print(f"→ Source {j+1} description: {source_j}")
            print(f"→ Similarity: {sim:.3f}")
            prompt = f"Innovation 1: {texts[i]}\nInnovation 2: {texts[j]}"
            messages = [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": prompt}
             ]   
            response = model.invoke(messages).content
            print(response)

# Output results
print(f"\nFound {len(similar_pairs)} high-similarity pairs (> {threshold}):")
for i, j, score in similar_pairs:
    print(f"[{i}] '{df.iloc[i]['source english_id']}' <--> [{j}] '{df.iloc[j]['source english_id']}' | Similarity: {score:.3f}")

Comparing entries pair by pair...


  0%|          | 0/1504 [00:00<?, ?it/s]

NOT SIMILAR:
→ Source 1 description: A multiscale digital material modelling tool developed by VTT to forecast wear and optimize materials for reduced friction and wear in various industrial applications.
→ Source 2 description: A multiscale materials modeling concept developed at VTT Technical Research Centre of Finland aimed at tackling challenging material design, especially for hydrogen-related applications.
→ Similarity: 0.507
The first innovation describes FiR 1, a Triga-type nuclear research reactor in Espoo, Finland, that was operational from 1962 until it was decommissioned. It has been used for nuclear research, training, and medical radiation therapy.

The second innovation describes the Centre for Nuclear Safety, a modern research facility currently under construction by the same institution, VTT, in the same city, Otaniemi, Espoo. This new facility focuses on nuclear safety research and is equipped with advanced technology designed for studying radioactive materials and im

  0%|          | 0/1504 [00:28<?, ?it/s]


KeyboardInterrupt: 

#### assess to OpenAI endpoint
- for this challenge we want to provide you access to OpenAI models: 4o-mini, 4.1 or 4.1-mini
- `ASK @ VTT-stand for key :)`

In [32]:
# 4. load api access credentials 
from langchain_openai import AzureChatOpenAI
import json

def initialize_llm(deployment_model:str, config_file_path:str= 'data/azure_config.json')->AzureChatOpenAI: 
    with open(config_file_path, 'r') as jsonfile:
        config = json.load(jsonfile)
    
    return AzureChatOpenAI(model =deployment_model,
                    api_key=config[deployment_model]['api_key'],
                    azure_endpoint = config[deployment_model]['api_base'],
                    api_version = config[deployment_model]['api_version'])

# initialize
#model = initialize_llm(deployment_model= 'gpt-4o-mini', config_file_path= 'data/keys/azure_config.json')
model = initialize_llm(deployment_model= 'gpt-4.1-mini', config_file_path= 'data/keys/azure_config.json')
#model = initialize_llm(deployment_model= 'gpt-4.1', config_file_path= 'data/keys/azure_config.json')

# example use:
prompt = 'Say hi'
model.invoke(prompt).content

system_prompt = "Compare the innovation description and return a score between 0 and 100 based on how confident you are that the two descriptions are the same innovation. Give a reasoning first and at the end return the score in the following format: 'Score: <score>'. e.g. score 80"

In [34]:
threshold = 0.80
similar_pairs = []

print("Comparing entries pair by pair using openai prompt...")
texts = df_relationships_vtt_domain["text_to_compare"].tolist()
for i in tqdm(range(len(texts))):
    for j in range(i + 1, len(texts)):
        prompt = f"Innovation 1: {texts[i]}\nInnovation 2: {texts[j]}"
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ]   
        response = model.invoke(messages).content
        print(response)
        sim = float(response.split("Score: ")[1].split("\n")[0])
        
        if sim > threshold:
            source_i = df_relationships_vtt_domain.iloc[i]["source description"]
            source_j = df_relationships_vtt_domain.iloc[j]["source description"]
            
            print(f"\nSimilarity: {sim:.3f} | From: {i+1} To: {j+1}")
            print(f"→ Source {i+1} description: {source_i}")
            print(f"→ Source {j+1} description: {source_j}")
            
            similar_pairs.append((i, j, sim))
            

# Output results
print(f"\nFound {len(similar_pairs)} high-similarity pairs (> {threshold}):")
for i, j, score in similar_pairs:
    print(f"[{i}] '{df.iloc[i]['source english_id']}' <--> [{j}] '{df.iloc[j]['source english_id']}' | Similarity: {score:.3f}")

Comparing entries pair by pair using openai prompt...


  0%|          | 0/1504 [00:00<?, ?it/s]

The first description refers to FiR 1, a specific Triga-type nuclear research reactor in Otaniemi, Espoo, Finland, that has been decommissioned after serving since 1962. It was used for nuclear research, training, and medical radiation therapy.

The second description refers to the Centre for Nuclear Safety, a modern research facility currently under construction by the same institute (VTT) in the same location, Otaniemi, Espoo, but focused on nuclear safety research with advanced technology for studying radioactive materials.

While both innovations are related to nuclear research and are associated with VTT in Otaniemi, Espoo, and share a thematic focus on nuclear science, they are different entities: one is an old, decommissioned research reactor, and the other is a new, modern facility under construction for safety research. They do not describe the same innovation.

Score: 20

Similarity: 20.000 | From: 1 To: 2
→ Source 1 description: FiR 1 is a Triga-type nuclear research reactor

  0%|          | 0/1504 [02:30<?, ?it/s]


KeyboardInterrupt: 