# Organizing and Enriching Data Entities

Let's say you have resolved some data entities and stored them in a knowledge graph, but now it's time to organize them.

Often, this is done ad hoc following a hierarchical schema, like a taxonomy.

## Structuring Metadata in Graphs

Let's consider a couple aspects of knowledge graph development by example.

### Example: Organize your music taxonomically

In general, consider the familiar "schema" of popular music. In pseudo-Cypher, we can represent it like this

    (Artist)-[:RELEASED]->(Album)-[:INCLUDES]->(Song)

Where the nodes are modeled according to node Labels (Artist, Album, Song) and relationship Types (RELEASED, INCLUDES).

Each label might have some common properties that could be indexed on, for example:

    (Artist {knownAs, stageNames, nickNames, performerName})
    (Album {name, releaseDate, recordLabel, runtime})
    (Song {name, releaseDate, artist, producer, details})

Once organized in this kind of schema, it's easy to answer questions with queries "semantically", e.g.,
- *List all albums or songs released by a specific artist or group of artists*
- *Find all artists who have collaborated with each*
- *List all artists or albums containing a song with a word or phrase in its name*

### Example: Link additional nodes to music taxa

What's more - this schema can be built out more as needed. For example, I might want to group songs by Key or by Lyrics or Versions. The way this is done is a design choice. Let me give two examples...

In one case, let's say we know we often look for songs in the same key, so we decide that it should be a built-in property of a Song, so we might use Cypher to add that property:

    MERGE (s:Song {name: $name, artist: $artist, releaseDate: $releaseDate})
    SET s += {key: $key}

If we query this enough, then an alternative approach might be to create a node for Key

    (Key {name: $name, rootNote: $root_note, mode: $mode})<-[IN]-(Song)

Both approaches have their merit! We do not have hard rules about this.

## Demo: Research Data

The rest of this notebook focuses on research data. Our approach starts by scanning a filesystem and resolving all of the Folders as nodes based on information found in the NCDU scan. Note: Files can be included if the user chooses and does not exhaust the system).

**Goal:** *Create data entity models (aka "labels" or "classes"), with properties that represent the concept of data we are working on.*

Since we are working with `Image` data that was collected following the user's concept of cohort design, we are left with her categorical approach. In this case, she generally had three levels of folders where she stored the data. Each subject imaged is a `Mouse`.

The user provided an (understandably) messy version of the following details for each based on their workflow during acquisition:
- `Study`: e.g. U37, U19; these correspond with planned efforts to collect images
- `Date`: e.g. 211010, 20230221, 20Feb2022; these are handwritten dates of acquisition (there is actual metadata in the image headers that we can scrape)
- `Cage`: e.g. Cage1, Cage2, etc; these correspond with the animal housing, which was broken up into groups
- `CageID`: e.g NT, 1L, 2R; subjects were housed together and distinguished by ear tags that could be identified from these

Wherever we find unique (`Study`, `Cage`, `CageID`), we effectively have a unique `Mouse`, so we can actually continue adding entities. Given these details, we might create a schema like this

    (Study)-[INCLUDED]->(Cage)-[HOUSED]->(Mouse)-[REPRESENTED_IN]->(Image)-[STORED_IN]->(Folder)

linking Mouse, Image, and Folder location! Now, we have something really powerful we can attach to our data management graph.

Once we have stored `Image` and `Mouse` nodes in the system, we can pull a table of either on the fly, just like a relational database. Follow along below to see how this is done.

### Working with node data

In the following cell, load the csv that you've exported from the "Index" page of the app.

Once loaded, you can process data in a few steps:
- rename columns to work better with Neo4j
- add columns to describe the new nodes

In [3]:
import pandas as pd
from pathlib import Path
from datetime import datetime

# input_fname = '/home/patch/Documents/2025-03-03T18-50_export.csv'
input_fname = '/home/patch/PycharmProjects/science_data_kit/ipynb/2025-03-03T21-33_export.csv'

def output_fname(input_fname):
    try:
        return f"{Path(input_fname).stem}_{datetime.today().ctime().replace(' ','_')}{Path(input_fname).suffix}"
    except:
        print(f"ERROR: Could not save processed {input_fname}.")

df = pd.read_csv(input_fname)

df.drop(columns="Unnamed: 0", inplace=True)
df['Name'] = df['Path'].apply(lambda x: Path(x).name)

files = df[df['Type'] == 'File'].copy()
folders = df[df['Type'] == 'Directory'].copy()

files.reset_index(drop=True, inplace=True)
folders.reset_index(drop=True, inplace=True)

def filename_suffixes_join(row):
    if row['Type'] == 'File':
        return ''.join(Path(row['Name']).suffixes)
    else:
        return False

files['Suffix'] = files.apply(lambda row: filename_suffixes_join(row), axis=1)

def sizeof_fmt(num, suffix="B"):
    for unit in ("", "K", "M", "G", "T", "P", "E", "Z"):
        if abs(num) < 1024.0:
            return f"{num:3.1f}{unit}{suffix}"
        num /= 1024.0
    return f"{num:.1f}Yi{suffix}"

def sum_size_and_du_of_folder(folder_path, files):
    _files_inside = files[files['Path'].apply(lambda x: folder_path in x)]
    _size_byte = _files_inside["Size (Bytes)"].sum()
    _du_byte = _files_inside["Disk Usage (Bytes)"].sum()
    return [_size_byte, _du_byte, sizeof_fmt(_size_byte), sizeof_fmt(_du_byte)]

folders[["Size (Bytes)", "Disk Usage (Bytes)", "Size", "DiskUsage"]] = pd.DataFrame(
    zip(*folders['Path'].apply(lambda x: sum_size_and_du_of_folder(x, files)))).T

folders = folders[['Name', 'Size', 'DiskUsage', 'Type', 'Path', 'Size (Bytes)', 'Disk Usage (Bytes)']]
folders.sort_values(by=['Size (Bytes)', 'Disk Usage (Bytes)'], ascending=False, inplace=True)
folders.reset_index(inplace=True, drop=True)

def assess_parentage(df_filetree):
    df_filetree['Depth'] = df_filetree['Path'].apply(lambda x: len(x.split(Path(x).anchor)))
    df_filetree['Parent'] = df_filetree['Path'].apply(lambda x: Path(x).parent)
    
    _trunk = None
    _root = None
    _bottom = min(df_filetree['Depth'])
    df_filetree['Depth'] = df_filetree['Depth'] - _bottom
    folder_trunk_rows = df_filetree[df_filetree['Depth'] == 0]
    
    if len(folder_trunk_rows) == 1:
        _trunk_row = folder_trunk_rows.iloc[0]
        _trunk = Path(_trunk_row['Path']).name
        _root = Path(_trunk_row['Parent'])
        print(f" Root: {_root}")
        print(f"Trunk: {_trunk}")
    else:
        print("ERROR: Could not identify Trunk and Root directories. Found multiple.")

    df_filetree['Branch'] = df_filetree['Path'].apply(lambda x: x.replace(f"{_root.as_posix()}{_root.anchor}", ""))
    
    return _trunk, _root

trunk, root = assess_parentage(folders)

folders['Label'] = 'AnalysisModule'
folders['Instrument'] = 'Computer'
folders['Trunk'] = trunk
folders['Root'] = root
folders['Host'] = "bmc-lab6.mit.edu"
folders['Share'] = 'atwai'
folders['filepath'] = folders['Path']

folders.to_csv(output_fname(input_fname), index=False)
folders

 Root: /mnt/server/bmc-lab6/atwai/archive
Trunk: nnUnet_LauraMTrainingData


Unnamed: 0,Name,Size,DiskUsage,Type,Path,Size (Bytes),Disk Usage (Bytes),Depth,Parent,Branch,Label,Instrument,Trunk,Root,Host,Share,filepath
0,nnUnet_LauraMTrainingData,93.5GB,93.6GB,Directory,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...,100432868107,100456116224,0,/mnt/server/bmc-lab6/atwai/archive,nnUnet_LauraMTrainingData,AnalysisModule,Computer,nnUnet_LauraMTrainingData,/mnt/server/bmc-lab6/atwai/archive,bmc-lab6.mit.edu,atwai,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...
1,Dataset001_LungTumor,62.9GB,62.9GB,Directory,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...,67537990351,67541086208,1,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...,nnUnet_LauraMTrainingData/Dataset001_LungTumor,AnalysisModule,Computer,nnUnet_LauraMTrainingData,/mnt/server/bmc-lab6/atwai/archive,bmc-lab6.mit.edu,atwai,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...
2,imagesTr,54.0GB,54.0GB,Directory,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...,57958869255,57960099840,2,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...,nnUnet_LauraMTrainingData/Dataset001_LungTumor...,AnalysisModule,Computer,nnUnet_LauraMTrainingData,/mnt/server/bmc-lab6/atwai/archive,bmc-lab6.mit.edu,atwai,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...
3,data_preprocessed,30.6GB,30.7GB,Directory,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...,32894810742,32914956288,1,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...,nnUnet_LauraMTrainingData/data_preprocessed,AnalysisModule,Computer,nnUnet_LauraMTrainingData,/mnt/server/bmc-lab6/atwai/archive,bmc-lab6.mit.edu,atwai,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...
4,Dataset001_LungTumor,30.6GB,30.7GB,Directory,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...,32894722213,32914866176,2,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...,nnUnet_LauraMTrainingData/data_preprocessed/Da...,AnalysisModule,Computer,nnUnet_LauraMTrainingData,/mnt/server/bmc-lab6/atwai/archive,bmc-lab6.mit.edu,atwai,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...
5,nnUNetPlans_2d,11.9GB,11.9GB,Directory,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...,12774916797,12781572096,3,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...,nnUnet_LauraMTrainingData/data_preprocessed/Da...,AnalysisModule,Computer,nnUnet_LauraMTrainingData,/mnt/server/bmc-lab6/atwai/archive,bmc-lab6.mit.edu,atwai,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...
6,nnUNetPlans_3d_fullres,11.4GB,11.4GB,Directory,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...,12218483585,12224630784,3,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...,nnUnet_LauraMTrainingData/data_preprocessed/Da...,AnalysisModule,Computer,nnUnet_LauraMTrainingData,/mnt/server/bmc-lab6/atwai/archive,bmc-lab6.mit.edu,atwai,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...
7,nnUNetPlans_3d_lowres,7.2GB,7.2GB,Directory,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...,7683371223,7689433088,3,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...,nnUnet_LauraMTrainingData/data_preprocessed/Da...,AnalysisModule,Computer,nnUnet_LauraMTrainingData,/mnt/server/bmc-lab6/atwai/archive,bmc-lab6.mit.edu,atwai,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...
8,test,5.0GB,5.0GB,Directory,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...,5341377480,5341593600,2,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...,nnUnet_LauraMTrainingData/Dataset001_LungTumor...,AnalysisModule,Computer,nnUnet_LauraMTrainingData,/mnt/server/bmc-lab6/atwai/archive,bmc-lab6.mit.edu,atwai,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...
9,NoTumor,5.0GB,5.0GB,Directory,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...,5341377480,5341593600,3,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...,nnUnet_LauraMTrainingData/Dataset001_LungTumor...,AnalysisModule,Computer,nnUnet_LauraMTrainingData,/mnt/server/bmc-lab6/atwai/archive,bmc-lab6.mit.edu,atwai,/mnt/server/bmc-lab6/atwai/archive/nnUnet_Laur...


In [3]:
!ls

2025-03-03T18-50_export_Mon_Mar__3_16:01:46_2025.csv
2025-03-03T18-50_export_Mon_Mar__3_16:02:43_2025.csv
Example_Enrich_2.ipynb
Example_RandomFiles.ipynb
LauraM_USGI_uCT_Entities-Copy1.ipynb
Tutorial_Enrich_1.ipynb
Tutorial_Enrich_2.ipynb
Tutorial_ResolveEntities.ipynb
