### VTT challenge: Innovation Ambiguity

#### Source Dataframe
- Websites from finnish companies that mention 'VTT' on their website
- `Orbis ID`, also `VAT id` is a unique identifier for organizations, later used to merge different alias of the same organization to one unique id

In [11]:
# 1. original source dataframe
import pandas as pd
df = pd.read_csv('data/dataframes/vtt_mentions_comp_domain.csv')
df = df[df['Website'].str.startswith('www.')]
df['source_index'] = df.index

print(f"DF with content from {len(df)} websites of {len(df['Company name'].unique())} different companies ")
df.head(3)

DF with content from 1100 websites of 270 different companies 


Unnamed: 0,Orbis ID,Company name,Website,Link,Title,date_obtained,Type,text_content,Company_Name_Alias,source_index
0,FI14636114,FORTUM OYJ,www.fortum.com,https://www.fortum.com/media/2020/04/fortum-aw...,Fortum awarded contract for decommissioning of...,2024-09-18,Website,"Press release\n 08 April 2020, 7:00 EEST ...","['Fortum Oyj', 'Fortum Markets AB', 'Fortum Ab...",0
1,FI14636114,FORTUM OYJ,www.fortum.com,https://www.fortum.com/media/2023/03/finnish-g...,Finnish Government grants operating licence fo...,2024-08-16,Website,"Press release\n 30 March 2023, 13:25 EEST ...","['Fortum Oyj', 'Fortum Markets AB', 'Fortum Ab...",1
2,FI14636114,FORTUM OYJ,www.fortum.com,https://www.fortum.com/media/2021/05/apros-sof...,"AprosÂ® software, developed by Fortum and VTT,...",2024-09-12,Website,"Online news\n 21 May 2021, 9:30 EEST ...","['Fortum Oyj', 'Fortum Markets AB', 'Fortum Ab...",2


##### End-to-End relationship extraction
- Based on the above website content, entities of the type `Organization` and `Innovation` are extracted, as well as their type of relationship
- `Collaboration` between Organization and `Developed_by` between Innovation and Organization
- The relationships are stored in a custom object as displayed below: 

In [12]:
# 2.1. example of custom python object of data
from typing import List, Dict, Optional
from pydantic import BaseModel, Field

class Node(BaseModel):
    """Represents a node in the knowledge graph."""
    id: str # unique identifier for node of type 'Organisation', else 'name provided by llm' for type: 'Innovation'
    type: str # allowed node types: 'Organization', 'Innovation'
    properties: Dict[str, str] = Field(default_factory=dict)

class Relationship(BaseModel):
    """Represents a relationship between two nodes in the knowledge graph."""
    source: str
    source_type: str # allowed node types: 'Organization', 'Innovation'
    target: str 
    target_type: str # allowed node types: 'Organization', 'Innovation'
    type: str # allowed relationship types: 'DEVELOPED_BY', 'COLLABORATION'
    properties: Dict[str, str] = Field(default_factory=dict)

class Document(BaseModel):
    page_content:str # manually appended - source text of website
    metadata: Dict[str, str] = Field(default_factory=dict) # metadata including source URL and document ID

class GraphDocument(BaseModel):
    """Represents a complete knowledge graph extracted from a document."""
    nodes: List[Node] = Field(default_factory=list)
    relationships: List[Relationship] = Field(default_factory=list)
    source_document: Optional[Document] = None

# 2.2 loading example of custom graph document
# example file naming convention
print("The extracted graph documents are saved as f'{df['Company name'].replace(' ','_')}_{df['source_index'].pkl}.pkl, under data/graph_docs/ \n")

for i, row in df[:3].iterrows():
    print(f"{i}: 'data/graph_docs/' + {row['Company name'].replace(' ','_')}_{row['source_index']}.pkl")

The extracted graph documents are saved as f'{df['Company name'].replace(' ','_')}_{df['source_index'].pkl}.pkl, under data/graph_docs/ 

0: 'data/graph_docs/' + FORTUM_OYJ_0.pkl
1: 'data/graph_docs/' + FORTUM_OYJ_1.pkl
2: 'data/graph_docs/' + FORTUM_OYJ_2.pkl


In [13]:
# 2.3 loading example of custom graph document

import pickle, os
path = 'data/graph_docs/'
index = 0

# load graph document
with open(os.path.join(path, os.listdir(path)[index]), 'rb') as doc:
    graph_doc = pickle.load(doc)

print(f"Example custom graph document:\n\n {graph_doc} \n\n ")

print("Example custom graph document nodes :\n")
for doc in graph_doc:
    for node in doc.nodes:
        print(f"- {node.id} ({node.type})    :   {node.properties['description']}")

print("\nExample custom graph document relationships:\n")
for doc in graph_doc:
    for relationship in doc.relationships:
        print(f"- {relationship.source} ({relationship.source_type}) - {relationship.type} -> {relationship.target} ({relationship.target_type})    :    description: {relationship.properties['description']}")


Example custom graph document:

 [GraphDocument(nodes=[], relationships=[], source_document=Document(metadata={'source_id': 'TS-YHTYMA OY_797'}, page_content='\r\nTurun Sanomat                              Teknologian tutkimuskeskus VTT on nimittÃ¤nyt hallitukseensa Turun yliopiston rehtorin Jukka Kolan. Kola valittiin tehtÃ¤vÃ¤Ã¤n VTT:n yhtiÃ¶kokouksessa. Toisena uutena jÃ¤senenÃ¤ hallitukseen valittiin Futuricen toimitusjohtaja Teemu Moisala. VTT:n hallituksen puheenjohtajana toimii Suomen ABB:n toimitusjohtaja Pekka Tiitinen. Hallituksessa jatkavat Pekka Tiitisen lisÃ¤ksi teknologiajohtaja Heli Antila Fortum, talousjohtaja Harri LeiviskÃ¤ Dayton Groupista, toimitusjohtaja Matti Hietanen (Suomen Malmijalostuksesta ja osastopÃ¤Ã¤llikkÃ¶ Marja-Riitta Pihlman tyÃ¶- ja elinkeinoministeriÃ¶stÃ¤. Seuraa ja lue uutisia aiheesta Huomio! Tutustu ennen kommentoimista Turun Sanomien keskustelupalstan sÃ¤Ã¤ntÃ¶ihin tÃ¤Ã¤llÃ¤. Uudet nÃ¤kÃ¶kulmat keskustelussa vievÃ¤t asioita eteenpÃ¤in. Siksi Tur

#### Name ambiguity resolution
- within the source text, variation/ alias of organization name lead to ambiguity
- this ambiguity is partly solved by mapping organization to a unique identifier: `VAT ID`
- the dict: `entity_glossary` stores Ids and Alias as key-value pairs

In [14]:
# 3. load entity glossary
import json
entity_glossary = json.load(open('data/entity_glossary/entity_glossary.json', 'r', encoding = 'utf-8'))
print(entity_glossary.get('FI26473754'))

{'alias': ['Valtion teknillinen tutkimuskeskus', 'VTT:ltä', 'Teknologian tutkimuskeskus VTT:llÃ¤', 'VTT â\x80\x93 beyond the obvious', 'VTT Oy', 'VTT TECHNICAL RESEAR CH CENTRE OF FINLAND LT D', 'VTT Technical Research Centres of Finland', 'VTT iBEX', 'Tehnologian tutkimuskeskus VTT', 'VTT:n', 'VTT technical research centre', 'VTT Technical Research Centre of Finland Ltd.', 'VTT:ltÃ¤', 'VTT (Technical Research Center of Finland)', 'Teknologian tutkimuskeskus VTT Oy', 'centre de recherche technique de Finlande', 'VTT:llÃ¤', 'Teknologian tutkimuskeskus VTT', 'TEKNOLOGIAN TUTKIMUSKESKUS VTT OY', 'VTT T echnical Research Centre of Finland', 'VTT TECHNICAL RESEARCH CENTRE OF FINLAND LTD', 'Technical Research Centre VTT', 'VTT Technical Research Centre', 'VTT Technical Research Centre in Finland', 'VTT Technical Research Centre of Finland Ltd', 'VTT-R-0025 5-20', 'VTT:t\x0224', 'VTT:llä', 'VTT LaunchPad', 'Technologian tutkimuskeskus VTT Oy', 'Facebook, LinkedIn, YouTube and Instagram', 'Tek

In [15]:
# 2.3 loading example of custom graph document

import pickle, os
path = 'data/graph_docs_names_resolved/'
index = 0

# load graph document
with open(os.path.join(path, os.listdir(path)[index]), 'rb') as doc:
    graph_doc = pickle.load(doc)

print(f"Example custom graph document:\n\n {graph_doc} \n\n ")

print("Example custom graph document nodes :\n")
for doc in graph_doc:
    for node in doc.nodes[:3]:
        print(f"- {node.id} ({node.type})    :   {node.properties['description']}")

print("\nExample custom graph document relationships:\n")
for doc in graph_doc:
    for relationship in doc.relationships[:3]:
        print(f"- {relationship.source} ({relationship.source_type}) - {relationship.type} -> {relationship.target} ({relationship.target_type})    :    description: {relationship.properties['description']}")


Example custom graph document:

 [GraphDocument(nodes=[], relationships=[], source_document=Document(metadata={'source_id': 'TS-YHTYMA OY_797'}, page_content='\r\nTurun Sanomat                              Teknologian tutkimuskeskus VTT on nimittÃ¤nyt hallitukseensa Turun yliopiston rehtorin Jukka Kolan. Kola valittiin tehtÃ¤vÃ¤Ã¤n VTT:n yhtiÃ¶kokouksessa. Toisena uutena jÃ¤senenÃ¤ hallitukseen valittiin Futuricen toimitusjohtaja Teemu Moisala. VTT:n hallituksen puheenjohtajana toimii Suomen ABB:n toimitusjohtaja Pekka Tiitinen. Hallituksessa jatkavat Pekka Tiitisen lisÃ¤ksi teknologiajohtaja Heli Antila Fortum, talousjohtaja Harri LeiviskÃ¤ Dayton Groupista, toimitusjohtaja Matti Hietanen (Suomen Malmijalostuksesta ja osastopÃ¤Ã¤llikkÃ¶ Marja-Riitta Pihlman tyÃ¶- ja elinkeinoministeriÃ¶stÃ¤. Seuraa ja lue uutisia aiheesta Huomio! Tutustu ennen kommentoimista Turun Sanomien keskustelupalstan sÃ¤Ã¤ntÃ¶ihin tÃ¤Ã¤llÃ¤. Uudet nÃ¤kÃ¶kulmat keskustelussa vievÃ¤t asioita eteenpÃ¤in. Siksi Tur

In [16]:
# transform graph document into dataframe
import pandas as pd
from tqdm import tqdm


df_relationships_comp_url = pd.DataFrame(index= None)

with tqdm(total= len(df), desc="Entities resolved") as pbar:
    for i, row in df.iterrows(): 
        try:     
            Graph_Docs = pickle.load(open(os.path.join('data/graph_docs_names_resolved/', f"{row['Company name'].replace(' ','_')}_{i}.pkl"), 'rb'))[0] # load graph doc
                
            node_description = {} # unique identifier
            node_en_id = {}
            for node in Graph_Docs.nodes:
                node_description[node.id] = node.properties['description']
                node_en_id[node.id] = node.properties['english_id']

            # get relationship triplets
            relationship_rows = []
            for i in range(len(Graph_Docs.relationships)):
            
                relationship_rows.append({
                    "Document number": row['source_index'],
                    "Source Company": row["Company name"],
                    "relationship description": Graph_Docs.relationships[i].properties['description'],
                    "source id": Graph_Docs.relationships[i].source,
                    "source type": Graph_Docs.relationships[i].source_type,
                    "source english_id": node_en_id.get(Graph_Docs.relationships[i].source, None),
                    "source description": node_description.get(Graph_Docs.relationships[i].source, None),
                    "relationship type": Graph_Docs.relationships[i].type,
                    "target id": Graph_Docs.relationships[i].target,
                    "target type": Graph_Docs.relationships[i].target_type,
                    "target english_id": node_en_id.get(Graph_Docs.relationships[i].target, None),
                    "target description": node_description.get(Graph_Docs.relationships[i].target, None),
                    "Link Source Text": row["Link"],
                    "Source Text": row["text_content"],
                })

            df_relationships_comp_url = pd.concat([df_relationships_comp_url, pd.DataFrame(relationship_rows, index= None)], ignore_index=True)

        except:
            continue

        pbar.update(1)

df_relationships_comp_url.head(5)

Entities resolved:  96%|█████████▌| 1053/1100 [00:00<00:00, 1760.65it/s]


Unnamed: 0,Document number,Source Company,relationship description,source id,source type,source english_id,source description,relationship type,target id,target type,target english_id,target description,Link Source Text,Source Text
0,0,FORTUM OYJ,Fortum Corporation is responsible for executin...,First nuclear decommissioning project in Finla...,Innovation,First nuclear decommissioning project in Finla...,The first nuclear reactor decommissioning proj...,DEVELOPED_BY,FI14636114,Organization,Fortum Corporation,A company with over 40 years of experience in ...,https://www.fortum.com/media/2020/04/fortum-aw...,"Press release\n 08 April 2020, 7:00 EEST ..."
1,0,FORTUM OYJ,VTT Technical Research Centre of Finland Ltd c...,First nuclear decommissioning project in Finla...,Innovation,First nuclear decommissioning project in Finla...,The first nuclear reactor decommissioning proj...,DEVELOPED_BY,FI26473754,Organization,VTT Technical Research Centre of Finland Ltd,A research organization in Finland contracting...,https://www.fortum.com/media/2020/04/fortum-aw...,"Press release\n 08 April 2020, 7:00 EEST ..."
2,0,FORTUM OYJ,Fortum Corporation and VTT Technical Research ...,FI14636114,Organization,Fortum Corporation,A company with over 40 years of experience in ...,COLLABORATION,FI26473754,Organization,VTT Technical Research Centre of Finland Ltd,A research organization in Finland contracting...,https://www.fortum.com/media/2020/04/fortum-aw...,"Press release\n 08 April 2020, 7:00 EEST ..."
3,0,FORTUM OYJ,Fortum Corporation has been awarded and is con...,FiR1 nuclear reactor decommissioning,Innovation,FiR1 nuclear reactor decommissioning,"The process of planning, preparatory measures,...",DEVELOPED_BY,FI14636114,Organization,Fortum Corporation,A company with over 40 years of experience in ...,https://www.fortum.com/media/2020/04/fortum-aw...,"Press release\n 08 April 2020, 7:00 EEST ..."
4,0,FORTUM OYJ,The Radiation and Nuclear Safety Authority of ...,FiR1 nuclear reactor decommissioning,Innovation,FiR1 nuclear reactor decommissioning,"The process of planning, preparatory measures,...",DEVELOPED_BY,temp_1,Organization,Radiation and Nuclear Safety Authority of Finl...,The national regulatory authority supervising ...,https://www.fortum.com/media/2020/04/fortum-aw...,"Press release\n 08 April 2020, 7:00 EEST ..."


#### Innovation and Collaboration disclosure on VTT-domain
- in addition to the discussion of VTT contribution on company websites, the second datasource includes websites under the vtt domain that discuss collaboration with other companies
- the list of source urls is provided under `data/dataframes/comp_mentions_vtt_domain.vsc`
- the extract relationships as custom objects are provided under `data/dataframes/graph_docs_vtt_domain`
- the extract relationships with organization resolution under `data/dataframes/graph_docs_vtt_domain`

In [17]:
# transform graph document into dataframe
import pandas as pd
from tqdm import tqdm

df_relationships_vtt_domain = pd.DataFrame(index= None)
df_vtt_domain = pd.read_csv('data/dataframes/comp_mentions_vtt_domain.csv')

with tqdm(total= len(df_vtt_domain), desc="Entities resolved") as pbar:
    for index_source, row in df_vtt_domain.iterrows(): 
        try:     
            Graph_Docs = pickle.load(open(os.path.join('data/graph_docs_vtt_domain_names_resolved/', f"{row['Vat_id'].replace(' ','_')}_{index_source}.pkl"), 'rb'))[0] # load graph doc
                
            node_description = {} # unique identifier
            node_en_id = {}
            for node in Graph_Docs.nodes:
                node_description[node.id] = node.properties['description']
                node_en_id[node.id] = node.properties['english_id']

            # get relationship triplets
            relationship_rows = []
            for i in range(len(Graph_Docs.relationships)):
            
                relationship_rows.append({
                    "Document number": index_source,
                    "VAT id": row["Vat_id"],
                    "relationship description": Graph_Docs.relationships[i].properties['description'],
                    "source id": Graph_Docs.relationships[i].source,
                    "source type": Graph_Docs.relationships[i].source_type,
                    "source english_id": node_en_id.get(Graph_Docs.relationships[i].source, None),
                    "source description": node_description.get(Graph_Docs.relationships[i].source, None),
                    "relationship type": Graph_Docs.relationships[i].type,
                    "target id": Graph_Docs.relationships[i].target,
                    "target type": Graph_Docs.relationships[i].target_type,
                    "target english_id": node_en_id.get(Graph_Docs.relationships[i].target, None),
                    "target description": node_description.get(Graph_Docs.relationships[i].target, None),
                    "Link Source Text": row["source_url"],
                    "Source Text": row["main_body"],
                })

            df_relationships_vtt_domain = pd.concat([df_relationships_vtt_domain, pd.DataFrame(relationship_rows, index= None)], ignore_index=True)

        except:
            continue

        
        pbar.update(1)

df_relationships_vtt_domain.head(5)

Entities resolved:  84%|████████▎ | 1159/1387 [00:00<00:00, 1489.97it/s]


Unnamed: 0,Document number,VAT id,relationship description,source id,source type,source english_id,source description,relationship type,target id,target type,target english_id,target description,Link Source Text,Source Text
0,0,FI10292588,"FiR 1 nuclear research reactor was developed, ...",FiR 1,Innovation,FiR 1,FiR 1 is a Triga-type nuclear research reactor...,DEVELOPED_BY,FI26473754,Organization,VTT Technical Research Centre of Finland Ltd.,VTT is a Finnish research and innovation partn...,https://www.vttresearch.com/en/news-and-ideas/...,Skip to main content Beyond the obvious Open m...
1,0,FI10292588,Centre for Nuclear Safety is being developed a...,Centre for Nuclear Safety,Innovation,Centre for Nuclear Safety,A modern research facility under construction ...,DEVELOPED_BY,FI26473754,Organization,VTT Technical Research Centre of Finland Ltd.,VTT is a Finnish research and innovation partn...,https://www.vttresearch.com/en/news-and-ideas/...,Skip to main content Beyond the obvious Open m...
2,3,FI08932048,The innovation approach 'Beyond the obvious' i...,Beyond the obvious,Innovation,Beyond the obvious,An innovation approach promising to provide so...,DEVELOPED_BY,FI26473754,Organization,VTT Technical Research Centre of Finland Ltd,A visionary research and innovation partner fo...,https://www.vttresearch.com/en/news-and-ideas/...,Skip to main content Beyond the obvious Open m...
3,4,FI01111693,Data-Driven Bioeconomy project is developed by...,Data-Driven Bioeconomy project,Innovation,Data-Driven Bioeconomy project,An innovation using Big Data for sustainable u...,DEVELOPED_BY,FI26473754,Organization,VTT Technical Research Centre of Finland,A Finnish research and innovation partner work...,https://www.vttresearch.com/en/news-and-ideas/...,Skip to main content Beyond the obvious Open m...
4,4,FI01111693,Data-Driven Bioeconomy project's forestry pilo...,Data-Driven Bioeconomy project,Innovation,Data-Driven Bioeconomy project,An innovation using Big Data for sustainable u...,DEVELOPED_BY,temp_1141,Organization,MHG Systems,An organization leading pilots developing fore...,https://www.vttresearch.com/en/news-and-ideas/...,Skip to main content Beyond the obvious Open m...


In [31]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

In [29]:
df_relationships_vtt_domain.to_csv("results/df_relationships_vtt_domain.csv")

In [51]:
df_relationships_vtt_domain["text_to_compare"] = (
    df_relationships_vtt_domain["source id"].fillna("") + " " +
    df_relationships_vtt_domain["source english_id"].fillna("") + " " +
    df_relationships_vtt_domain["source description"].fillna(""))

df_relationships_vtt_domain = df_relationships_vtt_domain[df_relationships_vtt_domain["source type"] != "Organization"]
df_relationships_vtt_domain = df_relationships_vtt_domain.drop_duplicates(subset="source description", keep="first")

embeddings = model.encode(df_relationships_vtt_domain["text_to_compare"].tolist(), convert_to_tensor=True)

In [52]:
len(df_relationships_vtt_domain)

1504

In [54]:
threshold = 0.80
similar_pairs = []

print("Comparing entries pair by pair...")
texts = df_relationships_vtt_domain["text_to_compare"].tolist()
for i in tqdm(range(len(texts))):
    for j in range(i + 1, len(texts)):
        sim = util.cos_sim(embeddings[i], embeddings[j]).item()
        
        if sim > threshold:
            source_i = df_relationships_vtt_domain.iloc[i]["source description"]
            source_j = df_relationships_vtt_domain.iloc[j]["source description"]
            
            print(f"\nSimilarity: {sim:.3f} | From: {i+1} To: {j+1}")
            print(f"→ Source {i+1} description: {source_i}")
            print(f"→ Source {j+1} description: {source_j}")
            
            similar_pairs.append((i, j, sim))

# Output results
print(f"\nFound {len(similar_pairs)} high-similarity pairs (> {threshold}):")
for i, j, score in similar_pairs:
    print(f"[{i}] '{df.iloc[i]['source english_id']}' <--> [{j}] '{df.iloc[j]['source english_id']}' | Similarity: {score:.3f}")

Comparing entries pair by pair...


  1%|          | 8/1504 [00:05<15:56,  1.56it/s]


Similarity: 0.874 | From: 9 To: 10
→ Source 9 description: A compostable and lightweight packaging material developed by combining cellulose films with different, complementary properties, suitable for dry and greasy products, which extends shelf life and reduces food waste and microplastics problem.
→ Source 10 description: A compostable and lightweight packaging material made of cellulose films developed by VTT that reduces the use of plastics, extends shelf life of food, and reduces food waste and microplastics.


  1%|          | 10/1504 [00:06<15:38,  1.59it/s]


Similarity: 0.858 | From: 11 To: 12
→ Source 11 description: A new fungal production platform known as C1, developed for vaccines and therapeutic proteins, notable for low protease activity and high production efficiency.
→ Source 12 description: A new fungal production platform for vaccines and therapeutic proteins developed by improving Dyadic's C1 Myceliophthora thermophila fungus.


  1%|          | 13/1504 [00:08<15:37,  1.59it/s]


Similarity: 0.808 | From: 14 To: 38
→ Source 14 description: New technological components in silicon photonics and optical data transfer developed and commercialised through VTT and Hitachi cooperation.
→ Source 38 description: The silicon photonics products developed to create compact, efficient, and affordable optical signal-based devices for measuring, imaging, and data transfer.


  1%|▏         | 20/1504 [00:12<15:27,  1.60it/s]


Similarity: 0.828 | From: 21 To: 216
→ Source 21 description: An innovation project focused on printed intelligence, manufacturing rapid tests, and piloting them with real use cases, including rapid diagnostics test development and sweat sample analytics.
→ Source 216 description: VTT-led commercialization innovation project for rapid diagnostic tests for health and well-being, focusing on printed intelligence and development of diagnostic test samples and analysis methods.


  2%|▏         | 28/1504 [00:17<15:21,  1.60it/s]


Similarity: 0.863 | From: 28 To: 1273
→ Source 28 description: A new material made of renewable resources developed by Paptic that can replace plastic, paper, and canvas bags, with plastic-like features such as flexibility, moisture resistance, and heat-sealing, and is more water-resistant than paper.
→ Source 1273 description: A new renewable material developed by Paptic Oy to replace plastic, paper, and fabric, with plastic-like properties such as flexibility, water resistance, and heat sealability, made from Finnish softwood fibers and other biobased raw materials.

Similarity: 0.926 | From: 29 To: 30
→ Source 29 description: Innovative technologies and solutions for mechanical and rotating propulsion technology in extreme Arctic conditions, developed to increase thruster lifetimes, decrease maintenance needs, and improve reliability.
→ Source 30 description: Innovative technologies and solutions developed for mechanical and rotating propulsion technology in extreme Arctic conditio

  2%|▏         | 29/1504 [00:18<15:29,  1.59it/s]


Similarity: 0.973 | From: 30 To: 31
→ Source 30 description: Innovative technologies and solutions developed for mechanical and rotating propulsion technology in extreme Arctic conditions, improving thruster durability, reducing maintenance needs, and enhancing reliability.
→ Source 31 description: Innovative technologies and solutions developed for mechanical and rotating propulsion technology in extreme Arctic conditions, improving thruster durability, reducing maintenance, and increasing reliability, specifically under ice loads.

Similarity: 0.874 | From: 30 To: 234
→ Source 30 description: Innovative technologies and solutions developed for mechanical and rotating propulsion technology in extreme Arctic conditions, improving thruster durability, reducing maintenance needs, and enhancing reliability.
→ Source 234 description: Innovative technical solutions developed under ArTEco project for ship rudder propeller systems to improve durability, reduce maintenance needs and enhance r

  2%|▏         | 30/1504 [00:19<15:35,  1.57it/s]


Similarity: 0.878 | From: 31 To: 234
→ Source 31 description: Innovative technologies and solutions developed for mechanical and rotating propulsion technology in extreme Arctic conditions, improving thruster durability, reducing maintenance, and increasing reliability, specifically under ice loads.
→ Source 234 description: Innovative technical solutions developed under ArTEco project for ship rudder propeller systems to improve durability, reduce maintenance needs and enhance reliability in extreme conditions.


  2%|▏         | 31/1504 [00:19<15:36,  1.57it/s]


Similarity: 0.889 | From: 32 To: 89
→ Source 32 description: A project exploring the use of the 5G mobile network to improve road safety through fast data transmission solutions, collecting vehicle and road data for applications such as road weather services, road maintenance, and self-driving car control.
→ Source 89 description: An innovation project completed in 2018 aimed at improving road safety and reducing accidents by developing new road weather and road safety services to assist vehicle drivers, automated driving and road maintenance using 5G communication solutions.


  2%|▏         | 32/1504 [00:20<15:38,  1.57it/s]


Similarity: 0.844 | From: 33 To: 34
→ Source 33 description: A prototype adhesive ID tag based on blockchain technology that protects valuables without revealing their location.
→ Source 34 description: A prototype adhesive ID tag developed to protect valuables without revealing their location, using blockchain technology and smart contracts.


  2%|▏         | 35/1504 [00:22<15:23,  1.59it/s]


Similarity: 0.955 | From: 36 To: 37


  2%|▏         | 36/1504 [00:22<15:29,  1.58it/s]


Similarity: 0.811 | From: 36 To: 1222


  2%|▏         | 37/1504 [00:23<15:25,  1.58it/s]


Similarity: 0.805 | From: 37 To: 1222

Similarity: 0.873 | From: 38 To: 601
→ Source 38 description: The silicon photonics products developed to create compact, efficient, and affordable optical signal-based devices for measuring, imaging, and data transfer.
→ Source 601 description: Silicon photonics is a technology combining light and electricity for high-speed, energy-efficient, microscopic photonic integrated circuits enabling applications from medicine to autonomous transport.


  3%|▎         | 41/1504 [00:26<16:17,  1.50it/s]


Similarity: 0.939 | From: 42 To: 43
→ Source 42 description: A novel innovation developed by VTT and partner companies that uses 5G technology to transmit large 3D views between vehicles, increasing communication distances and improving road safety.
→ Source 43 description: A new solution developed by VTT and partners that enables real-time transmission of 3D views between vehicles using 5G technology to enhance road safety and automated driving.

Similarity: 0.875 | From: 42 To: 45
→ Source 42 description: A novel innovation developed by VTT and partner companies that uses 5G technology to transmit large 3D views between vehicles, increasing communication distances and improving road safety.
→ Source 45 description: An innovation utilizing 5G technology to transmit large 3D views between vehicles in real-time to increase communication range and safety on the roads.


  3%|▎         | 42/1504 [00:26<16:11,  1.51it/s]


Similarity: 0.918 | From: 43 To: 45
→ Source 43 description: A new solution developed by VTT and partners that enables real-time transmission of 3D views between vehicles using 5G technology to enhance road safety and automated driving.
→ Source 45 description: An innovation utilizing 5G technology to transmit large 3D views between vehicles in real-time to increase communication range and safety on the roads.


  3%|▎         | 45/1504 [00:28<15:52,  1.53it/s]


Similarity: 0.921 | From: 46 To: 156
→ Source 46 description: A European project coordinated by VTT developing commercial applications based on SOFC fuel cell technology, implementing 25 SOFC-based power generation solutions globally for energy-efficient and low-emission electricity and heat production.
→ Source 156 description: A five-year European project developing commercial-scale solid oxide fuel cell (SOFC) systems producing efficient and low-emission electricity and heat


  3%|▎         | 46/1504 [00:29<15:33,  1.56it/s]


Similarity: 0.847 | From: 47 To: 193
→ Source 47 description: A multiscale digital material modelling tool developed by VTT to forecast wear and optimize materials for reduced friction and wear in various industrial applications.
→ Source 193 description: An innovative multiscale digital material modeling tool developed by VTT to optimize material structures and predict wear, improving energy efficiency and reducing greenhouse gas emissions.


  3%|▎         | 52/1504 [00:33<15:32,  1.56it/s]


KeyboardInterrupt: 

#### assess to OpenAI endpoint
- for this challenge we want to provide you access to OpenAI models: 4o-mini, 4.1 or 4.1-mini
- `ASK @ VTT-stand for key :)`

In [21]:
# 4. load api access credentials 
from langchain_openai import AzureChatOpenAI
import json

def initialize_llm(deployment_model:str, config_file_path:str= 'data/azure_config.json')->AzureChatOpenAI: 
    with open(config_file_path, 'r') as jsonfile:
        config = json.load(jsonfile)
    
    return AzureChatOpenAI(model =deployment_model,
                    api_key=config[deployment_model]['api_key'],
                    azure_endpoint = config[deployment_model]['api_base'],
                    api_version = config[deployment_model]['api_version'])

# initialize
model = initialize_llm(deployment_model= 'gpt-4o-mini', config_file_path= 'data/keys/azure_config.json')
model = initialize_llm(deployment_model= 'gpt-4.1-mini', config_file_path= 'data/keys/azure_config.json')
model = initialize_llm(deployment_model= 'gpt-4.1', config_file_path= 'data/keys/azure_config.json')

# example use:
prompt = ''
model.invoke(prompt).content

ModuleNotFoundError: No module named 'langchain_openai'