# Overview of ClinGraph & ClinVec

### Downloading ClinGraph from Harvard Dataverse

Navigate to the dataset's repository. There is no login required: [https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/Z6H1A8](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/Z6H1A8)

<img src="img/screen_1.png"  width="500"/>

Click on the Download button. This will open a tab where you can click how you'd like to download. Pick the original format. Here is a brief description of each file. The details on column names are in the README.

- `ClinGraph_node.csv`: this contains all the node metadata and index information.
- `ClinGraph_edges.csv`: this contains the triplet information used to construct the KG. We also include each node's metadata that's found in ClinGraph_node.csv for convenience.
- `ClinGraph_dgl.bin`: ClinGraph in DGL binary format. We store the node types and node features under the node data (`ndata`) attribute. 
- `ClinGraph_adjlist.csv`: ClinGraph in adjacency list format; format matches NetworkX syntax. This format does not include node features. 
- `ClinGraph_pyg.pt`: ClinGraph as a PyTorch Genometric object. Node features are saved under the `x` attribute.
- `ClinGraph_features.csv`: csv file containing vectors size 1x1024 used as node features in HGT training. Values are generated using Xavier noise, but we provide the exact values for reproducibility. 

<img src="img/screen_2.png"  width="500"/>

### Reading ClinGraph into Python

Once downloaded, there are different ways to read in each graph object depending on the format:

In [None]:
import os 
os.chdir("/n/holylfs06/LABS/mzitnik_lab/Lab/ruthjohnson/kg_paper_revision/harvard_dataverse/ClinGraph")

In [None]:
# DGL
from dgl.data.utils import load_graphs
graph_list, _ = load_graphs("ClinGraph_dgl.bin")
g = graph_list[0]

# node features
print(g.ndata['feat'])

# node type (as indices)
(g.ndata['ntype'])

In [None]:
# NetworkX
import networkx as nx
g = nx.read_adjlist("ClinGraph_adjlist.csv")

In [None]:
# PyTorch Geometric
from torch_geometric.data import Data
import torch

g = torch.load('ClinGraph_pyg.pt', weights_only=False)

# node features
print(g.x)

### Downloading ClinVec embeddings from Harvard Dataverse

The embeddings are located in the same repository as ClinGraph. We separate embedding files by source vocabulary. Each set of embeddings is saved as a pandas dataframe where the 128 columns correspond to the dimensions of the embedding and the row index matches the `node_index` which is described in `ClinGraph_node.csv`.

There are a total of 9 files:

- `ClinVec_atc.csv` 
- `ClinVec_cpt.csv`
- `ClinVec_icd10cm.csv`
- `ClinVec_icd9cm.csv`
- `ClinVec_lnc.csv`
- `ClinVec_phecode.csv`
- `ClinVec_rxnorm.csv`
- `ClinVec_snomedct.csv`
- `ClinVec_umls.csv`


### Reading in ClinVec embeddings into Python

In [None]:
import pandas as pd

# load phecode embeddings
df = pd.read_csv("ClinVec_phecode.csv")
df = df.set_index(df.columns[0])

# get matrix of embeddings
emb_mat = df.values

# get node metadata
node_df = pd.read_csv("ClinGraph_nodes.csv", sep='\t')
df['node_index'] = df.index
phecode_emb_df = df.merge(node_df, how='inner', on='node_index')