# **Creazione e pulizia di un grafo**

Vista la necessita di manipolare le informazioni sulle relazioni, partendo dai file spacy etichettati nella prima parte del progetto, sono state nuovamente estratte le relazioni seguendo la stessa metodologia.
Al termine di questo notebook saranno 3 i file estratti:
 - labels_mapping.json che contiene un dizionario che mappa le diverse entita ed i valori che possono assumere
 - graph_data.json che contiene le informazioni del nuovo grafo processato, nel formato necessario all'utilizzo della libreria networkx
 - node_id_text.json che mappa gli id associati ai nodi nel grafo al loro testo

## **Support functions:**

- **extract_relations(doc)**: Estrae le relazioni da un documento processato da spaCy. Per ogni frase, estrae le relazioni tra entità e aggiunge l'indice della frase alla tupla della relazione.

- **get_entities_from_adp_new(adps)**: Estrae le entità da una lista di preposizioni (adp) e restituisce una lista di dizionari contenenti il testo e l'etichetta dell'entità.

- **extract_labelled_relations_proj(relations, projectName)**: Estrae le relazioni etichettate da un insieme di relazioni, standardizzando il testo delle entità e creando un dizionario per ogni relazione.

In [None]:
from Project.relationExtractor import *

def extract_relations(doc):
    relations = []

    for index, sent in enumerate(doc.sents):
        print("Sentence: ", sent.text)
        # Extract relations
        phase_relations = []
        sent_relation = extract_filtered_relations(sent)
        print("Sent relations: ", sent_relation)
        for el in sent_relation:
            #add the sentence index to el tuple
            new_el = el + (index,)

            phase_relations.append(new_el)
            print("Relation: ", new_el)

        relations.append(phase_relations)

    relations = [remove_relative_pronouns(rel) for rel in relations]

    return relations


def get_entities_from_adp_new(adps):
    labels=[]
    for adp in adps:
        adp_text=[]
        adp_label = ""
        for token in adp:
            if token.ent_iob_ != "O":
                adp_text.append(token)
                if adp_label == "":
                    adp_label = token.ent_type_

        #Convert adp_text to string
        adp_text_string=""
        for token in adp_text:
            adp_text_string += token.text + " "
        labels.append({"adp_text": adp_text_string, "label": adp_label})

    return labels


def extract_labelled_relations_proj(relations, projectName):
    relations_return = []
    for relations_group in relations:
        for relation in relations_group:
            #print(f"Relation: {relation}")
            subjects = relation[0]
            entities_subject = get_entities_from_subject(subjects)
            if len(entities_subject["entities"]) == 0:
                continue
            else:
                for label in entities_subject["entities"]:
                    if label["label"] not in not_entity_label:
                        import_label = True
            verb_text = ""
            verb = relation[1]
            if isinstance(verb, list):
                for v in verb:
                    verb_text += v.text + " "
            else:
                verb_text = verb

            # Remove last empty space in
            verb_text = verb_text.strip()
            verb_text = verb_text.replace(" ", "_")
            verb_text = verb_text.replace("/", "")

            objects = relation[2]
            entities_obj = get_entities_from_subject(objects)

            adp_relations = get_entities_from_adp_new(relation[3])

            adp_relations.append({"adp_text": relation[4], "label": "sentencesIndex"})

            adp_relations.append({"adp_text": projectName, "label": "documentName"})

            # Method to unify same word syntax
            entities_subject = standardize_entity_text(entities_subject, False)
            entities_obj = standardize_entity_text(entities_obj, False)
            adp_relations = standardize_entity_text(adp_relations, True)

            # Create the relation dictionary
            relation_dict = [ entities_subject,
                verb_text,
                entities_obj,
                adp_relations]

            #Add the relation to the dictionary
            relations_return.append(relation_dict)

    return relations_return

## **Text Extraction from .spacy Files:**

- **Estrazione del testo**: Legge i file .spacy da una directory, li processa con spaCy e estrae le relazioni utilizzando le funzioni di supporto

In [None]:
import spacy
from Project.graphBuilder import *
import os
nlp = spacy.load("en_core_web_trf")

relations=[]
for root, dirs, files in os.walk("newDocFiles"):
    for file in files:
        filename= file.split(".")[0]
        with open(f"newDocFiles/{file}", "rb") as f:
            doc = spacy.tokens.Doc(nlp.vocab).from_bytes(f.read())

        file_relations= extract_relations(doc)

        relations_dict = extract_labelled_relations_proj(file_relations, filename)

        relations.append(relations_dict)


    break

new_relations = []
for relation in relations:
    for rel in relation:
        new_relations.append(rel)

relations_dict = new_relations

## **Cleaning relations:**

- Rimuove le relazioni che non hanno un soggetto valido o che hanno solo etichette specifiche (es. "TIME", "MONEY", ecc.).
- Rimuove le relazioni che non hanno un oggetto valido o che non hanno preposizioni (adp) valide.

In [None]:
import re
relations_to_remove = []
label_to_ignore= ["TIME", "MONEY", "QUANTITY", "ORDINAL", "CARDINAL", "PERCENT"]
new_relations = []
date_pattern = r'\b\d{4}\b'
for relation in relations_dict:

    subject = relation[0]
    object = relation[2]
    verb = relation[1]


    adp = relation[3]
    print(f"Relations: ",relation[3])
    adp_file= []
    for el in adp:
        if el['label'] == 'sentencesIndex' or el['label'] == 'documentName':
            adp_file.append(el)

    #remove adp_file from adp
    adp = [el for el in adp if el['label'] != 'sentencesIndex' and el['label'] != 'documentName']


    #Se il soggetto ha solo label "TIME", "MONEY", "QUANTITY", "ORDINAL", "CARDINAL", "PERCENT" --> RIMUOVO
    subject_label = []
    for el in subject['entities']:
        if not el['label'] in label_to_ignore:
            if el['label'] == "DATE" and not re.match(date_pattern, el['entity']):
                continue
            subject_label.append(el)

    object_labels=[]
    for el in object['entities']:
        if not el['label'] in label_to_ignore:
            if el['label'] == "DATE" and not re.match(date_pattern, el['entity']):
                continue
            object_labels.append(el)

    if len(subject_label) == 0:
        relations_to_remove.append(relation)
        continue

    if len(relation) == 4:
        adp_label = []
        for el in adp:
            if not el['label'] in label_to_ignore and el['label'] != "":
                if el['label'] == "DATE" and not re.match(date_pattern, el['adp_text']):
                    continue
                adp_label.append(el)

        #Se non c'è oggetto con label e non ha adp --> RIMUOVO
        #Se c'è oggetto senza label ma adp solo "TIME", "MONEY", "QUANTITY", "ORDINAL", "CARDINAL", "PERCENT" --> RIMUOVO
        if len(adp_label)==0 and len(object_labels) == 0:
            relations_to_remove.append(relation)
            continue

    new_subject = {
        'subject': subject['subject'],
        'entities': subject_label


    }

    new_object = {
        'object': object['subject'],
        'entities': object_labels
    }

    if adp_label:
        new_relations.append([new_subject, relation[1], new_object, adp_label, adp_file])
    else:
        new_relations.append([new_subject, relation[1], new_object, adp_file])




**Splitting relations**:

- **Creazione di nuove relazioni**: Per ogni relazione, estrae soggetto, oggetto e preposizioni validi e li combina a coppie per creare nuove relazioni.



In [None]:
splitted_relations = []
rel_counter=0
for relation in new_relations:
    subject = relation[0]
    object = relation[2]
    verb = relation[1]

    adp_file = relation[4]  if len(relation) == 5 else relation[3]

    element=[]

    #Entity to combine
    for entity in object['entities']:
        new_el= {
            'subject': entity['entity'],
            'entities': [entity]
        }
        element.append(new_el)

    for entity in subject['entities']:
        new_el= {
            'subject': entity['entity'],
            'entities': [entity]
        }
        element.append(new_el)

    if len(relation) > 4:
        adp = relation[3]
        for entity in adp:
            new_el= {
                'subject': entity['adp_text'],
                'entities': [{
                    'entity': entity['adp_text'],
                    'label': entity['label']
                }]
            }
            element.append(new_el)

    for i in range(len(element)):
        for j in range(i+1, len(element)):
            new_rel=[
                element[i],
                rel_counter,
                element[j],
                adp_file
            ]
            splitted_relations.append(new_rel)
            rel_counter+=1



**Cypher Query Creation**:

- **build_Cypher_query_adps(subject, verb, obj, adps)**: Crea query Cypher per aggiungere nodi e relazioni al grafo Neo4j. Le query includono la creazione di nodi e relazioni con proprietà pulite.

In [None]:
def build_Cypher_query_adps(subject, verb, obj, adps):
    print(f"Subject: {subject}, Verb: {verb}, Object: {obj}, Adps: {adps}")

    # print("Inizio query")
    queries = []

    # Prepare subject labels
    subject_labels = ['Node'] + [entity['label'].strip() for entity in subject['entities']]
    subject_labels_str = ':' + ':'.join(subject_labels)

    # Create subject node with cleaned text
    subject_text = clean_text(subject['subject'].strip())
    subject_query = f"MERGE (s{subject_labels_str} {{text: '{subject_text}'}})"
    queries.append(subject_query)

    # Handle object nodes and relationships
    if obj["subject"]:
        object_query = ""
        object_text = clean_text(obj["subject"].strip())
        # Create object node with cleaned text
        if len(obj["entities"]) != 0:
            object_label = obj["entities"][0]["label"].strip()
            object_query = f"MERGE (o:Node:{object_label} {{text: '{object_text}'}})"
        else:
            object_query = f"MERGE (o:Node {{text: '{object_text}'}})"
        queries.append(object_query)

        # Create relationship with cleaned properties from adps
        properties = {clean_property_key(adp["properties"]): clean_text(adp["adp_text"].strip())
                      for adp in adps if adp["properties"].strip()}  # Skip empty property keys
        props_str = ", ".join([f"{k}: '{v}'" for k, v in properties.items()])

        if len(obj["entities"]) != 0:
            object_label = obj["entities"][0]["label"].strip()
            relationship_query = f"""MATCH (s{subject_labels_str} {{text: '{subject_text}'}})
    MATCH (o:Node:{object_label} {{text: '{object_text}'}})
    MERGE (s)-[r:{verb} {{{props_str}}}]->(o)"""
            queries.append(relationship_query)
        else:
            relationship_query = f"""MATCH (s{subject_labels_str} {{text: '{subject_text}'}})
    MATCH (o:Node {{text: '{object_text}'}})
    MERGE (s)-[r:{verb} {{{props_str}}}]->(o)"""
            queries.append(relationship_query)

    else:
        # Reflexive relationship with cleaned properties
        properties = {clean_property_key(adp["properties"]): clean_text(adp["adp_text"].strip())
                      for adp in adps if adp["properties"].strip()}  # Skip empty property keys
        props_str = ", ".join([f"{k}: '{v}'" for k, v in properties.items()])

        relationship_query = f"""MATCH (s{subject_labels_str} {{text: '{subject_text}'}})
MERGE (s)-[r:{verb} {{{props_str}}}]->(s)"""
        queries.append(relationship_query)

    return queries

**Graph creation**:

- **Creazione del grafo**: Utilizza le query Cypher per creare il grafo in un database Neo4j. Le relazioni vengono aggiunte al grafo utilizzando le query generate.

In [None]:
from neo4j import GraphDatabase

uri = "bolt://localhost:7687"
driver = GraphDatabase.driver(uri, auth=("neo4j", "14052001"))
session = driver.session()
projectName = "CleanedGraph2"
session.run(f"CREATE DATABASE {projectName}")
project_session=driver.session(database=projectName)

for rel in splitted_relations:
    queries = build_Cypher_query_adps(rel[0], f"REL_{rel[1]}", rel[2], rel[3])

    for q in queries:
        result = project_session.run(q)


# **Information Extraction from the Graph**

Estrazione di categorie ed entità: Estrae le entità e le categorie dai file .spacy e crea un mapping delle etichette.

- **execute_query(session, labels_mapping)**: Esegue una query Cypher per estrarre l'intero grafo da Neo4j e crea un JSON con i nodi, gli archi e i testi delle relazioni. Elabora il risultato della query che prende da neo4j l'intero grafo, creando il json nella forma:

  *{"nodo1": {"id": 0, "labels": [{"categoria_id": 2, "entita_id": 85}]}, "nodo2": {"id": 0, "labels": [{"categoria_id": 2, "entita_id": 85}]}, "arco": {"id": 1152921504606846976}}*

- **extract_graph_data(session, labels_mapping)**: Estrae i dati del grafo e li organizza in un formato specifico per la creazione del dataset.

In [None]:
import os
import spacy
from spacy.tokens import Span

labels_mapping = {}
labels_id = 0

directory_path = 'newDocFiles'

nlp = spacy.load("en_core_web_trf")
# Iterate over all folders in the directory

for root, dirs, files in os.walk(directory_path):
    #Iterate on files in the directory
    for file in files:
        if ".spacy" in file and not ".json" in file:
            doc_path = os.path.join(root, file)
            print(f"Processing {doc_path}")
            with open(doc_path, 'rb') as f:
                doc = spacy.tokens.Doc(nlp.vocab).from_bytes(f.read())

            for ent in doc.ents:
                if not ent.label_ in labels_mapping:
                    labels_mapping[ent.label_] = {
                        "id": labels_id,
                        "labels_count": 0,
                        "labels": {}
                    }
                    labels_id += 1
                if not ent.text in labels_mapping[ent.label_]:
                    #if dict have 0 elements
                    labels_mapping[ent.label_]["labels"][ent.text] = labels_mapping[ent.label_]["labels_count"]
                    labels_mapping[ent.label_]["labels_count"] += 1



Processing newDocFiles\COP28TriplingRenewableCapacityPledge.spacy
Processing newDocFiles\Electricity2024.spacy
Processing newDocFiles\GlobalCriticalMineralsOutlook2024.spacy
Processing newDocFiles\IRENAGreenhydrogenderivativestrade2024.spacy
Processing newDocFiles\Oil2024.spacy
Processing newDocFiles\Renewables2024.spacy
Processing newDocFiles\WorldEnergyInvestment2024.spacy
Processing newDocFiles\WorldEnergyOutlook2024.spacy


In [None]:
import json

with open('Project/GraphAnalysis/energyReportsGraph/labels_mapping.json', 'w') as f:
    json.dump(labels_mapping, f)

In [None]:
labels_mapping["GPE"]

{'id': 7,
 'labels_count': 6380,
 'labels': {'China': 6344,
  'Germany': 5419,
  'Spain': 4935,
  'Italy': 4937,
  'France': 6064,
  'United Kingdom': 5850,
  'India': 6370,
  'Japan': 6358,
  'Brazil': 6041,
  'Saudi Arabia': 6130,
  'Egypt': 5689,
  'Algeria': 3836,
  'Nigeria': 6080,
  'South Africa': 6256,
  'Ethiopia': 5927,
  'United States': 6379,
  'Canada': 5833,
  'Indonesia': 6366,
  'Philippines': 5932,
  'Ireland': 4124,
  'Netherlands': 5501,
  'Taipei': 3926,
  'Republic of Türkiye': 74,
  'Russia': 6179,
  'Ukraine': 6152,
  'Denmark': 6063,
  'Poland': 6047,
  'Belgium': 6048,
  'Sweden': 6065,
  'Portugal': 5457,
  'Finland': 4938,
  'Croatia': 3923,
  'Slovakia': 5460,
  'Australia': 6099,
  'Viet Nam': 6367,
  'Malaysia': 5532,
  'Thailand': 5788,
  'New Zealand': 4137,
  'US': 6010,
  'Chile': 6040,
  'Argentina': 6034,
  'Mexico': 6257,
  'Eritrea': 183,
  'Kenya': 5948,
  'Tanzania': 5944,
  'Botswana': 4941,
  'Angola': 4607,
  'Benin': 203,
  'Côte d’Ivoire': 3

In [None]:
import json

def execute_query(session, labels_mapping):
    try:
        query = "MATCH (n)-[r]-(m) RETURN n, r, m"
        result = session.run(query)
        # Creare liste per nodi e archi
        nodes = {}
        edges = []
        relation_texts = []

        phase_already_taken = []

        for record in result:
            relation_dict = {}
            relation = {}
            phase_index = -1
            document_name = ""

            # Itera su tutti gli elementi nel record
            for value in record.values():

                # Gestisce i nodi
                if hasattr(value, 'labels'):
                    node_id = value.id  # ID univoco del nodo
                    if node_id not in nodes:
                        labels = [l for l in value.labels if l!="Node"]
                        node_text = value['text']
                        map=[]
                        for l in labels:
                            new_mapping={}
                            if l in labels_mapping:
                                new_mapping['categoria_id']=labels_mapping[l]['id']
                                for key in labels_mapping[l]['labels']:
                                    if key in node_text:
                                        new_mapping["\"entita_id\""]=labels_mapping[l]['labels'][key]
                                        break
                            map.append(new_mapping)

                        nodes[node_id] = {
                            'id': node_id,
                            'labels': map,
                            'properties': dict(value)
                        }

                # Gestisce le relazioni
                elif hasattr(value, 'type'):
                    # Ottiene gli ID dei nodi di origine e destinazione
                    start_node = value.start_node
                    end_node = value.end_node

                    relation['subject'] = start_node['text']
                    if (relation['subject'] != end_node['text']):
                        relation['object'] = end_node['text']

                    # Aggiunge i nodi se non esistono già
                    if start_node.id not in nodes:
                        labels_start_node = [l for l in start_node.labels if l!="Node"]
                        node_text = start_node['text']
                        map=[]
                        for l in labels_start_node:
                            new_mapping={}
                            if l in labels_mapping:
                                new_mapping['categoria_id']=labels_mapping[l]['id']
                                for key in labels_mapping[l]['labels']:
                                    if key in node_text:
                                        new_mapping["\"entita_id\""]=labels_mapping[l]['labels'][key]
                                        break
                            map.append(new_mapping)

                        nodes[start_node.id] = {
                            'id': start_node.id,
                            'labels': map,
                            'properties': dict(start_node)
                        }

                    if end_node.id not in nodes:
                        labels_end_node= [l for l in end_node.labels if l!="Node"]
                        node_text = end_node['text']
                        map=[]
                        for l in labels_end_node:
                            new_mapping={}
                            if l in labels_mapping:
                                new_mapping['categoria_id']=labels_mapping[l]['id']
                                for key in labels_mapping[l]['labels']:
                                    if key in node_text:
                                        new_mapping["\"entita_id\""]=labels_mapping[l]['labels'][key]
                                        break
                            map.append(new_mapping)
                        nodes[end_node.id] = {
                            'id': end_node.id,
                            'labels': map,
                            'properties': dict(end_node)
                        }


                    new_value_dict = {}
                    for key in value:
                        if key == 'documentName':
                            document_name = value[key]
                        elif key == 'sentencesIndex':
                            phase_index = value[key]
                        else:
                            new_value_dict[key] = value[key]

                    # Crea l'arco
                    edge = {
                        'id': value.id,
                        'source': start_node.id,
                        'target': end_node.id,
                        'type': value.type,
                        'properties': dict(new_value_dict)
                    }
                    edges.append(edge)

                    relation['verb'] = value.type
                    relation['properties'] = dict(value)

            relation_dict['relation'] = relation
            #Get the relation text
            if document_name != "" and phase_index != -1:
                if phase_index in phase_already_taken:
                    relation_dict['text'] = ""
                else:

                    phase_already_taken.append(phase_index)

            relation_texts.append(relation_dict)

        # Restituire i nodi e gli archi come liste
        return {
            "nodes": list(nodes.values()),
            "edges": edges,
            'relation_texts': relation_texts
        }

    except Exception as e:
        print(f"Errore durante l'esecuzione della query: {e}")
        return None

def extract_graph_data(session,labels_mapping):

    result = execute_query(session,labels_mapping)

    if not result:
        return None

    graph_data = []

    for edge in result['edges']:
        node1_id = edge['source']
        node2_id = edge['target']

        node1 = next(node for node in result['nodes'] if node['id'] == node1_id)
        node2 = next(node for node in result['nodes'] if node['id'] == node2_id)

        graph_data.append({
            "nodo1": {
                "id": node1['id'],
                "labels": node1['labels']
            },
            "nodo2": {
                "id": node2['id'],
                "labels": node2['labels']
            },
            "arco": {
                "id": edge['id']
            }
        })

    return graph_data

In [None]:
from neo4j import GraphDatabase

import warnings

warnings.simplefilter("ignore", DeprecationWarning)


uri = "bolt://localhost:7687"
driver = GraphDatabase.driver(uri, auth=("neo4j", "14052001"))
session = driver.session(database= "cleanedgraph2")

graph_data = extract_graph_data(session,labels_mapping)

In [None]:
graph_data

[{'nodo1': {'id': 0, 'labels': [{'categoria_id': 2, '"entita_id"': 85}]},
  'nodo2': {'id': 0, 'labels': [{'categoria_id': 2, '"entita_id"': 85}]},
  'arco': {'id': 1152921504606846976}},
 {'nodo1': {'id': 0, 'labels': [{'categoria_id': 2, '"entita_id"': 85}]},
  'nodo2': {'id': 1, 'labels': [{'categoria_id': 2, '"entita_id"': 1}]},
  'arco': {'id': 1152922604118474752}},
 {'nodo1': {'id': 0, 'labels': [{'categoria_id': 2, '"entita_id"': 85}]},
  'nodo2': {'id': 1, 'labels': [{'categoria_id': 2, '"entita_id"': 1}]},
  'arco': {'id': 1152922604118474752}},
 {'nodo1': {'id': 0, 'labels': [{'categoria_id': 2, '"entita_id"': 85}]},
  'nodo2': {'id': 8, 'labels': [{'categoria_id': 6, '"entita_id"': 588}]},
  'arco': {'id': 1152932499723124736}},
 {'nodo1': {'id': 0, 'labels': [{'categoria_id': 2, '"entita_id"': 85}]},
  'nodo2': {'id': 8, 'labels': [{'categoria_id': 6, '"entita_id"': 588}]},
  'arco': {'id': 1152932499723124736}},
 {'nodo1': {'id': 0, 'labels': [{'categoria_id': 2, '"entita

In [None]:
#Save the graph data in a json file
with open('graph_data.json', 'w') as f:
    json.dump(graph_data, f)

Mapping tra node_id e text in modo da poter prendere il testo dati gli id

In [None]:
from neo4j import GraphDatabase

uri = "bolt://localhost:7687"
driver = GraphDatabase.driver(uri, auth=("neo4j", "14052001"))
session = driver.session(database= "cleanedgraph2")

def get_nodes_dict(session):
    query = "MATCH (n) RETURN id(n) AS id, n.text AS text"
    result = session.run(query)
    return {record["id"]: record["text"] for record in result}

node_dict= get_nodes_dict(session)


#Save as json with node_id_text.json
with open('Project/GraphAnalysis/energyReportsGraph/node_id_text.json', 'w') as f:
    json.dump(node_dict, f)




{0: 'NDC',
 1: 'Party',
 2: 'NDCs',
 3: 'renewable energy',
 4: 'renewable electricity capacity',
 5: '2030',
 6: 'solar pv',
 7: 'China',
 8: 'renewable capacity',
 9: 'wind energy',
 10: 'policy stocktaking',
 11: 'IEA',
 12: 'renewable capacity ambitions',
 13: 'European',
 14: 'Europe',
 15: 'Spain',
 16: 'Italy',
 17: 'France',
 18: 'United Kingdom',
 19: 'European Union',
 20: 'climate goals',
 21: 'NECP',
 22: 'European Commission',
 23: 'Asia Pacific region',
 24: 'India',
 25: 'Japan',
 26: 'fossil fuel capacity',
 27: 'electricity generation share',
 28: 'renewable energy capacity',
 29: 'Latin America',
 30: '2022',
 31: 'Sub Saharan Africa',
 32: 'Eurasia',
 33: 'renewable energy capacity ambitions',
 34: 'MENA',
 35: 'hydropower',
 36: 'wind',
 37: '2023',
 38: 'cop28',
 39: 'IEA Emissions 2050',
 40: 'EMDEs',
 41: 'global greenhouse gas emissions',
 42: 'emissions',
 43: 'co2 emissions',
 44: 'solar power capacity',
 45: '2024',
 46: 'generation costs',
 47: 'generation c