# Pull imaging session data and enrich

Using Python and spreadsheets, we can update the knowledge graph. Follow along below.

# Enrich a dataset

We can use this tool to enrich datasets ad hoc in a knowledge graph.

## Get the data, work with it, and then enrich the KG

Each project will differ. Here, we need to:

1. Pull nodes of interest (NOI) from the database
2. Compute using NOI metadata in Python/iPyNotebook or something else
3. Merge results into a DataFrame
4. Push in using one of the functions found in `science_data_kit.utils.graph_utils.Neo4jConnection`

## Connect to the database with the included driver

First of all, ensure you are pointing to the correct configuration file

By default, this software assumes `.db_config.yaml`. However, the user can change this. The general recommendation is to copy the template, `db_config.yaml` to `.db_config.yaml`, and then edit `.db_yaml`.

    cp db_config.yaml .db_cofig.yaml
    vi .db_config.yaml #enter proper credentials with editor of choice
    
Once that is in place, you should be able to run the following.

In [3]:
from utils.graph_utils import Neo4jConnection, load_db_config

neocon = Neo4jConnection(config_file='.db_config.yaml')
neocon.test_connection()

Connection successful!


True

## Build a submodule

We are often working with data ad hoc. This means we cannot define the constraints of the problem in a clear and general sense. As a result, `science_data_kit` focuses on providing barebones skeleton code to expedite building out your use case on a project-by-project basis.

Below, you can find some example queries that utilize the built-in functions to collect relevant information

In [None]:
import numpy as np


def fetch_imaging_sessions_by_mouse(session_count_min=1, contains_tbd=False):
    _substr = ""
    if not contains_tbd:
        _substr = "NOT"
    return neocon.query_to_dataframe(f"""
        MATCH (m:Mouse)<-[i:IMAGED]-(ims:ImagingSession)
        WITH m, COUNT(DISTINCT ims) AS session_count
        WHERE session_count >= {session_count_min}
        AND {_substr} ( m.uid CONTAINS 'TBD' or m.uid CONTAINS 'nan' ) 
        MATCH (m)<-[i:IMAGED]-(ims)  // Re-match after filtering
        RETURN m.uid AS uid, COUNT(DISTINCT ims.filepath) AS session_count, COLLECT(ims.filepath) AS filepaths
        ORDER BY session_count DESC 
    """)

def fetch_imaging_session_by_mouse_and_date(session_count_min=1, contains_tbd=False, tbd_filter=True):
    _filter = ""
    if tbd_filter:
        _substr = ""
        if not contains_tbd:
            _substr = "NOT"
        _filter = f"        AND {_substr} ( m.uid CONTAINS 'TBD' or m.uid CONTAINS 'nan' )"
    return neocon.query_to_dataframe(f"""
        MATCH (m:Mouse)<-[i:IMAGED]-(ims:ImagingSession)
        WITH m, COUNT(DISTINCT ims) AS session_count
        WHERE session_count >= {session_count_min}
        {_filter}
        MATCH (m)<-[i:IMAGED]-(ims)  // Re-match after filtering
        RETURN m.uid AS uid, ims.Date AS date, ims.filepath AS filepath
    """)

def fetch_imaging_session_count_across_mice(session_count_min=1, contains_tbd=False, tbd_filter=True):
    _filter = ""
    if tbd_filter:
        _substr = ""
        if not contains_tbd:
            _substr = "NOT"
        _filter = f"        AND {_substr} ( m.uid CONTAINS 'TBD' or m.uid CONTAINS 'nan' )"
    return neocon.query_to_value(f"""
        MATCH (m:Mouse)<-[i:IMAGED]-(ims:ImagingSession)
        WITH m, COUNT(DISTINCT ims) AS session_count
        WHERE session_count >= {session_count_min}
        {_filter}
        MATCH (m)<-[i:IMAGED]-(ims)  // Re-match after filtering
        RETURN COUNT(DISTINCT ims.filepath) AS session_count
    """)



## Aggregate relevant information for the analysis directly from the daatbase

Building out notebooks like this allows us to run a preliminary analysis and then just re-run when all the data arrives.

Here are some ways I've used the above functions to collect relevant information.

In [2]:
d = {}

d['Mouse'] = {}
d['Mouse']['ImagingSessions'] = {}
d['Mouse']['ImagingSessions']['unique'] = fetch_imaging_sessions_by_mouse(contains_tbd=False)
d['Mouse']['ImagingSessions']['ambiguity'] = fetch_imaging_sessions_by_mouse(contains_tbd=True)

d['ImagingSession'] = {}
d['ImagingSession']['count'] = fetch_imaging_session_count_across_mice(
    session_count_min=1, tbd_filter=False, contains_tbd=True)
d['ImagingSession']['count_w_unique_mouse_uid'] = fetch_imaging_session_count_across_mice(
    session_count_min=1, tbd_filter=True, contains_tbd=False)
d['ImagingSession']['count_w_ambiguous_mouse_uid'] = fetch_imaging_session_count_across_mice(
    session_count_min=1, tbd_filter=True, contains_tbd=True)

## Unit Testing Data

Here are some examples of checks we can run to ensure that data looks correct from a few different perspectives at runtime!

First, check that unique sessions mapping to filepaths one-to-one. Then check that a mouse's image counts match the database's relationships.

In [3]:
count_total = d['ImagingSession']['count']
count_combined = d['ImagingSession']['count_w_unique_mouse_uid'] + d['ImagingSession']['count_w_ambiguous_mouse_uid']
if not count_total == count_combined:
    print(f"WARNING: Counts not matching!")
    print(f"            TOTAL: {count_total:5d}")
    print(f"         COMBINED: {count_total:5d}")
    
else:
    print(f"Success! (Counts match)")
    for _k, _v in d['ImagingSession'].items():
        print(f"{_k:>30s}: {_v:5d}")


mice = d['Mouse']['ImagingSessions']['unique']
_filt = mice.apply(lambda row: len(row['filepaths']) == row['session_count'], axis=1)
if len(mice[~_filt]):
    print("WARNING: Session counts do not match found filepaths.")
else:
    print("Success!")

mice = mice[mice['session_count'] >= 3].reset_index(drop=True)

Success! (Counts match)
                         count:  2765
      count_w_unique_mouse_uid:  1818
   count_w_ambiguous_mouse_uid:   947
Success!


## An example pull, enrich, and then push

### Pull
First we pull a useful set of data we want to use and enrich

In [6]:
sessions = fetch_imaging_session_by_mouse_and_date(session_count_min=3)

### Enrich
Then we will enirch the dataset - just enriching with an example of a random stat.

In [None]:
sessions['Label'] = "Analysis"
sessions['is'] = 'ml_model_12345_v2' ## change this by measurement
sessions['value'] = np.random.rand(len(sessions)) ## edit this for whatever you want to push back in
sessions['units'] = 'probability (0-1)' # edit this to describe the measurement clearlyy
sessions['Link_Label'] = 'ImagingSession'

### Push

Finally, we will use one of the `Neo4jConnection` member functions to push and link to existing nodes.

In [8]:
# neocon.push_dataframe(df=sessions, label_col='Label', property_cols=['uid', 'date', ''],  match_cols=['uid','date'])

neocon.push_and_link_dataframe(
    sessions,
    label_col='Label', 
    property_cols=['uid', 'date', 'is', 'value', 'units'], 
    match_cols=['uid', 'date', 'is'],
    node_match_label='Link_Label', 
    node_match_properties=['uid', 'date'], 
    node_match_relationship_type='ANALYZED'
)

NOTE: This process can be be adapted and repeated on the same match_cols and property calls - just needs to be Cypher friendly.