Part 1 of this question is "what is the conceptual difference between nodes and entities?"
Part 2 of this questions "why are entities sometimes repeated twice in the nodes file?"

The output files in question are create_final_entities.parquet and create_final_nodes.parquet (create_final_relationships.parquet is also related).

GraphRAG extracts entities and relationships from text content and generates a graph. This graph is then used as the entry point for algorithms to summarize and answer questions about your dataset. When we extract entities, we create a canonical list of entities including the text units they were found within. This entity data is saved to create_final_entities.parquet.

We then combine the entities table and the relationships table to create a graph (network). Once we put each entity into the graph, it becomes a node in that graph (and the relationships are edges), and thus adopts new semantic meaning and analytic properties. You'll notice, for example, that in the nodes table each entity has a degree, x, y, and size. Degree is the node degree (connectedness), and x and y can be populated with a position in 2D coordinate space for visualizing the graph (see the configs for Node2Vec embeddings and UMAP). We use the degree to represent the size by default, so those columns are equivalent (but you could use any measure you deem important to set the size of a node in a graph visualization...).

As for the duplication: one of the graph analysis steps we run is hierarchical community detection with Leiden. A community will be assigned for every node, at every level in the hierarchy (unless that node becomes too distinct and becomes "orphaned" at some depth). This results in a duplicate entry in the nodes table for each computed community level. So the create_final_nodes.parquet is a one-to-many from create_final_entities.parquet, using the id field as join key.

To summarize: entities are canonical, nodes are a representation of that entity in graph space, and duplication is because we compute hierarchical communities and add an entry for each in the nodes table.

In [2]:
import pandas as pd


In [None]:
input_eval= {"id":0,"text":"""Printed circuit board
Electronics assemblies are based on use of a printed circuit board of one
form or another to hold components. Construction of these printed circuit
boards is critical to soldering processes in that different printed circuit
board types have different thermal characteristics which can greatly affect
how they must be soldered.
In principle a printed circuit board (PCB) sometimes called a printed
wiring board (PWB) or simply printed board comprises: a base which is
a thin board of insulating material supporting all the components which
make up a circuit; conducting tracks usually copper on one or bOth sides
of the base making up the interconnections between components. Component
connecting leads are electrically connected in some form of permanent
or semi-permanent way usually by soldering to lands sometimes called
pads ~ the areas of track specially designated for component connection
purposes. If lands have holes drilled or punched through the board to
facilitate component mounting the board is a through-hole printed circuit
board. If lands have no holes the board is a surface mounted printed circuit
board.
To clarify the term printed is somewhat misleading as tracks are not
printed directly onto the board. It refers instead to just one stage within the
whole printed circuit board manufacturing process where the conducting
track layout sometimes called pattern or image may be produced using
some form of printing technique.
Printed circuit boards can be made in one of two main ways. First in an
additive process the conductive track may be added to the surface of the
base material. There's a number of ways in which this can be done. Second
in a subtractive process where base material is supplied with its whole
surface covered with a conductive layer track pattern is defined and excess
conductive material is removed leaving the required track. Sometimes
both processes may be combined to produce printed circuit boards with
more than one layer of conductive track."""


}

pd.DataFrame([input_eval]).to_csv('/home/cip/ce/ix05ogym/Majid/LLM/GraphRag/elec_graph/input_eval/input_eval.csv'
,index=False)
pd.read_csv('/home/cip/ce/ix05ogym/Majid/LLM/GraphRag/elec_graph/input_eval/input_eval.csv').head()

In [29]:
info ={}

In [13]:
output_path = "/home/cip/ce/ix05ogym/Majid/LLM/GraphRag/elec_graph/output_eval/"
final_node = pd.read_parquet(output_path+'create_final_nodes.parquet')
final_node.sort_values('degree',ascending=False)

Unnamed: 0,level,title,type,description,source_id,community,degree,human_readable_id,id,size,graph_embedding,top_level_node_id,x,y
0,0,PRINTED CIRCUIT BOARD,TECHNOLOGY,A printed circuit board (PCB) is a thin board ...,dc02e730db9813c2bc1463c9a951d015,0,16,0,f6611b55254a48188f9ffc670786d147,16,"[-0.004821797367185354, -0.001710553653538227,...",f6611b55254a48188f9ffc670786d147,2.408991,-14.375688
31,1,PRINTED CIRCUIT BOARD,TECHNOLOGY,A printed circuit board (PCB) is a thin board ...,dc02e730db9813c2bc1463c9a951d015,4,16,0,f6611b55254a48188f9ffc670786d147,16,"[-0.004821797367185354, -0.001710553653538227,...",f6611b55254a48188f9ffc670786d147,2.408991,-14.375688
61,0,COMPONENT,,,dc02e730db9813c2bc1463c9a951d015,1,3,22,e731b9baaafa4398b8943640d7dff143,3,"[-0.0038543029222637415, -0.000373646791558712...",e731b9baaafa4398b8943640d7dff143,1.628155,-13.981526
60,0,COMPONENT,,,dc02e730db9813c2bc1463c9a951d015,1,3,22,e731b9baaafa4398b8943640d7dff143,3,"[-0.0038543029222637415, -0.000373646791558712...",e731b9baaafa4398b8943640d7dff143,1.628155,-13.981526
29,0,COMPONENT,,,dc02e730db9813c2bc1463c9a951d015,1,3,22,e731b9baaafa4398b8943640d7dff143,3,"[-0.0038543029222637415, -0.000373646791558712...",e731b9baaafa4398b8943640d7dff143,1.628155,-13.981526
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1,0,PRINTED WIRING BOARD,TECHNOLOGY,A printed wiring board (PWB) is another name f...,dc02e730db9813c2bc1463c9a951d015,0,1,1,b820e765cb7c4b179a7b65ec3dda8b6e,1,"[-0.001985871698707342, -0.001004505786113441,...",b820e765cb7c4b179a7b65ec3dda8b6e,-11.945440,17.922535
3,0,SURFACE MOUNTED PRINTED CIRCUIT BOARD,TECHNOLOGY,A surface mounted printed circuit board has la...,dc02e730db9813c2bc1463c9a951d015,0,1,3,cdb8cd24d4884fe28b4743f40237400a,1,"[-0.001315498724579811, -0.0003250970912631601...",cdb8cd24d4884fe28b4743f40237400a,-15.065989,20.480062
33,1,THROUGH-HOLE PRINTED CIRCUIT BOARD,TECHNOLOGY,A through-hole printed circuit board has lands...,dc02e730db9813c2bc1463c9a951d015,4,1,2,fd2c914bf9254fb0bd4448f818ccbb2a,1,"[-0.0018933044048026204, 1.5758490917505696e-0...",fd2c914bf9254fb0bd4448f818ccbb2a,-14.731735,19.267538
34,1,SURFACE MOUNTED PRINTED CIRCUIT BOARD,TECHNOLOGY,A surface mounted printed circuit board has la...,dc02e730db9813c2bc1463c9a951d015,4,1,3,cdb8cd24d4884fe28b4743f40237400a,1,"[-0.001315498724579811, -0.0003250970912631601...",cdb8cd24d4884fe28b4743f40237400a,-15.065989,20.480062


In [7]:
pd.set_option('display.max_rows', 10)


In [None]:
final_entities = pd.read_parquet(output_path+'create_final_entities.parquet')
types = final_entities['type'].value_counts()
info['number of entities']= len(final_entities)
info['number of classes']= len(types)
info['classes'] = types
types

type
PROCESS         5
TECHNOLOGY      4
COMPONENT       3
MATERIAL        3
PROPERTY        1
DESIGN          1
SYSTEM          1
RELATIONSHIP    1
EVENT           1
STRUCTURE       1
Name: count, dtype: int64

In [None]:
final_relationships = pd.read_parquet(output_path+'create_final_relationships.parquet')
cleaned_relationships = final_relationships.drop_duplicates(['source','target']).reset_index().drop('index',axis=1)
info['number of relationships']= len(cleaned_relationships)
info['number of relationships outliner']= len(cleaned_relationships[cleaned_relationships['rank']<3])
info['max source_degree']= cleaned_relationships['source_degree'].max() 
info['max target_degree']= cleaned_relationships['target_degree'].max()
info['max rank']= cleaned_relationships['rank'].max()
info['min rank']= cleaned_relationships['rank'].min()
info['source_degree/number of relationships']= cleaned_relationships['source_degree'].sum()/len(cleaned_relationships)
info['target_degree/number of relationships']= cleaned_relationships['target_degree'].sum()/len(cleaned_relationships)


cleaned_relationships


Unnamed: 0,source,target,weight,description,text_unit_ids,id,human_readable_id,source_degree,target_degree,rank
0,PRINTED CIRCUIT BOARD,PRINTED WIRING BOARD,10.0,Printed circuit board and printed wiring board...,[dc02e730db9813c2bc1463c9a951d015],58d56184101e4bbcba17dbbee3bfa933,0,16,1,17
1,PRINTED CIRCUIT BOARD,THROUGH-HOLE PRINTED CIRCUIT BOARD,8.0,A through-hole printed circuit board is a type...,[dc02e730db9813c2bc1463c9a951d015],ae72d90ee34847a1841e27b07196525e,1,16,1,17
2,PRINTED CIRCUIT BOARD,SURFACE MOUNTED PRINTED CIRCUIT BOARD,8.0,A surface mounted printed circuit board is a t...,[dc02e730db9813c2bc1463c9a951d015],8c2c6a77ce434508983f75a570e8e462,2,16,1,17
3,PRINTED CIRCUIT BOARD,ADDITIVE PROCESS,7.0,The additive process is a method for manufactu...,[dc02e730db9813c2bc1463c9a951d015],89ab20ea34b44e58afe0c9cba4954664,3,16,1,17
4,PRINTED CIRCUIT BOARD,SUBTRACTIVE PROCESS,1.0,The subtractive process is a method for manufa...,[dc02e730db9813c2bc1463c9a951d015],3f67c602e31648f08d0ae7b6339c9551,4,16,1,17
5,PRINTED CIRCUIT BOARD,ELECTRONICS ASSEMBLIES,9.0,Electronics assemblies are built upon printed ...,[dc02e730db9813c2bc1463c9a951d015],85fc5c441fb94d62a3c34c970894da5c,5,16,1,17
6,PRINTED CIRCUIT BOARD,SOLDERING PROCESSES,9.0,Soldering processes are essential for connecti...,[dc02e730db9813c2bc1463c9a951d015],2fe47c32986745ca8b6126d07f1e5d92,6,16,1,17
7,PRINTED CIRCUIT BOARD,THERMAL CHARACTERISTICS,8.0,Different printed circuit board types have dif...,[dc02e730db9813c2bc1463c9a951d015],46d09b2d1e7d4e24a9127d3bdb3aca41,7,16,1,17
8,PRINTED CIRCUIT BOARD,LANDS,9.0,Lands are areas on a printed circuit board spe...,[dc02e730db9813c2bc1463c9a951d015],955fe66c65d544cabfc97f2e9e095405,8,16,2,18
9,PRINTED CIRCUIT BOARD,BASE MATERIAL,9.0,The base material forms the foundation of a pr...,[dc02e730db9813c2bc1463c9a951d015],b13a3518a7dc492d9753ff946048942e,9,16,1,17


In [34]:
final_relationships.sort_values('weight',ascending=False)


Unnamed: 0,source,target,weight,description,text_unit_ids,id,human_readable_id,source_degree,target_degree,rank
0,PRINTED CIRCUIT BOARD,PRINTED WIRING BOARD,10.0,Printed circuit board and printed wiring board...,[dc02e730db9813c2bc1463c9a951d015],58d56184101e4bbcba17dbbee3bfa933,0,16,1,17
36,LANDS,COMPONENT CONNECTING LEADS,9.0,Component connecting leads are connected to la...,[dc02e730db9813c2bc1463c9a951d015],ace10f18f92c4fd1a00a0f9eac568cb7,16,2,1,3
60,CIRCUIT,COMPONENT,9.0,Components are the building blocks of circuits...,[dc02e730db9813c2bc1463c9a951d015],d3e78fb41e8649968a5c9e59a0a768ec,19,1,3,4
59,CIRCUIT,COMPONENT,9.0,Components are the building blocks of circuits...,[dc02e730db9813c2bc1463c9a951d015],d3e78fb41e8649968a5c9e59a0a768ec,19,1,3,4
58,CIRCUIT,COMPONENT,9.0,Components are the building blocks of circuits...,[dc02e730db9813c2bc1463c9a951d015],d3e78fb41e8649968a5c9e59a0a768ec,19,1,3,4
...,...,...,...,...,...,...,...,...,...,...
55,CONDUCTIVE TRACK,COPPER,8.0,Copper is a common material used for conductiv...,[dc02e730db9813c2bc1463c9a951d015],b041fbc0a30d4dc889813e16d7890bd4,17,2,1,3
3,PRINTED CIRCUIT BOARD,ADDITIVE PROCESS,7.0,The additive process is a method for manufactu...,[dc02e730db9813c2bc1463c9a951d015],89ab20ea34b44e58afe0c9cba4954664,3,16,1,17
24,PRINTED CIRCUIT BOARD,LAYER,1.0,Multi-layered printed circuit boards have mult...,[dc02e730db9813c2bc1463c9a951d015],97ab164496a04b639e83f75128e953d1,15,16,1,17
4,PRINTED CIRCUIT BOARD,SUBTRACTIVE PROCESS,1.0,The subtractive process is a method for manufa...,[dc02e730db9813c2bc1463c9a951d015],3f67c602e31648f08d0ae7b6339c9551,4,16,1,17


In [35]:
be = pd.read_parquet(output_path+'create_base_entity_graph.parquet')
be

Unnamed: 0,level,clustered_graph,embeddings
0,0,"<graphml xmlns=""http://graphml.graphdrawing.or...","{'ADDITIVE PROCESS': [-0.002070424612611532, -..."
1,1,"<graphml xmlns=""http://graphml.graphdrawing.or...","{'ADDITIVE PROCESS': [-0.002070424612611532, -..."


In [37]:
import networkx as nx

#graph = nx.read_graphml(output_path+'clustered_graph.0.graphml')

def get_subgraph_with_descendants(graph, node):
    # Get all descendants of the node (DFS)
    #descendants = list(nx.descendants(graph, node))  # Returns all descendants of the node
    #descendants = list(nx.bfs_successors(graph,'QUALITY',1))
    
    #descendants = descendants[0][1]
    descendants = list(nx.ego_graph(graph,node,1))
    # Include the original node
    descendants.append(node)
    
    # Create the subgraph from the descendants
    subgraph = graph.subgraph(descendants)
    
    return subgraph

subgraph = get_subgraph_with_descendants(g, 'QUALITY')

pos = nx.spring_layout(subgraph)
nx.draw(subgraph, pos, with_labels=True, node_size=700, node_color='lightblue', font_size=10, font_weight='bold', edge_color='gray')


NameError: name 'g' is not defined

In [34]:
descendants = list(nx.bfs_successors(graph,'QUALITY',1))
descendants[0][1]

['SOLDERING',
 'SOLIDIFICATION',
 'SOLDERING PROCESS',
 'COST',
 'CONSISTENCY',
 'TRAINING',
 'MACHINE SOLDERING',
 'TOUCH-UP',
 'PROCESS CONTROL']

In [51]:
path = nx.shortest_path(graph, source='ADDRESSES', target='ALLOY',)
for i in range(len(path) - 1):
    u = path[i]
    v = path[i + 1]
    description = graph[u][v]['description']
    print(f"{u} -> {v}: {description}")

ADDRESSES -> ORGANIZATIONS: Addresses are used to identify the locations of organizations
ORGANIZATIONS -> STANDARDS: Organizations are responsible for developing and publishing standards
STANDARDS -> SOLDERING: Standards are used to define the requirements for soldering, such as the types of solder that can be used, the temperature of the soldering iron, and the cleanliness of the surfaces being soldered
SOLDERING -> SOLDER: Soldering is a process that uses solder to join metal parts together. Solder is the material used in soldering to join metal parts together. Soldering involves melting and flowing solder into the joint between two metal parts. 

SOLDER -> ALLOY: Solder is an alloy, typically made of tin and lead, used to join metals together


In [47]:
g = nx.read_graphml(output_path+'summarized_graph.graphml')
g

<networkx.classes.graph.Graph at 0x7fdfba009650>

Unnamed: 0,id,text
0,0,Printed circuit board\nElectronics assemblies ...
