## Collecting **Summary Statistics** across multiple experiments - Part 2.6
--------------------
Now that all a function has been built to quantify features of **organelle** and **regions** across multiple cells, we can create a function that **summarizes** the resulting metrics.

## **OBJECTIVE**
### <input type="checkbox"/> Summarize ***organelle and region quantification***
In this notebook, the logic for restructing and summarizing the **organelle and region** metric tables from the combined quantification is outlined.

---------
## **Batch Summary Stats**

### summary of steps

🛠️ **BUILD FUNCTION PROTOTYPE**

- **`0`** - Establish csv paths *(preliminary step)*

- **`1`** - Read in and categorize csv files

    - read in and categorize csv files for all listed paths
    - Combine comprehensive metrics tables to be summarized and restructured

- **`2`** - Restructure comprehensive organelle two-way interaction metrics table

    - breakdown the interaction table column names
    - group observations by the **first** organelle involved in interactions
    - unstack the grouped table to create a column for every unique organelle interaction type
    - correct column names for unstacked tables to accurately describe the **first** organelle involved in each unique interaction site
    - repeat last three substeps for the **second** organelle involved in the interactions
    - combine and merge the data from **both** unstacked tables to include interaction metrics from all organelle objects

- **`3`** - Apply aggregate statistics for summarization

    - determine aggregate statistics to be applied per organelle object
    - summarize metrics between the organelle morphology and interaction tables
    - summarize metrics in the region morphology table
    - summarize additional metrics in the organelle morphology table

- **`4`** - Restructure distribution metrics tables

    - for XY-distribution collect summary statistics for voxel bins and wedges
    - for Z-distribution collect summary statistics for voxel bins and wedges
    - calculate the coefficient of variation for the **mean**, **median**, **standard deviation** for the XY-distribution bin values
    - repeat the first two substeps for the nucleus distribution metrics
    - combine nucleus and organelle distribution tables


- **`5`** - Add normalized metrics

    - calculate fraction of cell area taken up by the organelles
    - calculate fraction of organelle objects involved in specific interorganelle contacts

- **`6`** - Unstack and finalize summary stats tables

    - unstack and reorder organelle morphology summary table columns
    - fill "NaN" values with 0 when necessary to final organelle morphology summary table
    - unstack and reorder organelle interactions summary table columns
    - fill "NaN" values with 0 when necessary to final organelle interactions summary table
    - unstack and reorder distribution measurements summary table columns to create finalized table
    - unstack and reorder region morphology summary table columns
    - add normalization to finalize region morphology summary table
    - combine all four tables to create a complete summary table

- **`7`** - Export summary stats tables as .csv files

⚙️ **EXECUTE FUNCTION PROTOTYPE**

- Define prototype `_batch_summary_stats` function

- Run prototype `_batch_summary_stats` function

## **IMPORTS**

#### &#x1F3C3; **Run code; no user input required**

&#x1F453; **FYI:** This code block loads all of the necessary python packages and functions you will need for this notebook.

In [None]:
from pathlib import Path
from typing import List

import numpy as np
import pandas as pd
import os

from infer_subc.utils.stats import *
from infer_subc.utils.stats_helpers import *

%load_ext autoreload
%autoreload 2

# ***BUILD FUNCTION PROTOTYPE***

## **`0` - Establish csv paths *(preliminary step)***

In [None]:
# List of file paths to be included in the summary
csv_path_list=[Path(os.getcwd()).parents[1] / "sample_data" /  "batch_example" / "quant"]

## **`1` - Read in and categorize csv files**

- read in and categorize csv files for all listed paths

In [None]:
ds_count = 0
fl_count = 0

org_tabs = []
contact_tabs = []
dist_tabs = []
region_tabs = []

for loc in csv_path_list:
    ds_count = ds_count + 1
    loc=Path(loc)
    files_store = sorted(loc.glob("*.csv"))
    for file in files_store:
        fl_count = fl_count + 1
        stem = file.stem

        org = "organelles"
        contacts = "contacts"
        dist = "distributions"
        regions = "_regions"

        if org in stem:
            test_orgs = pd.read_csv(file, index_col=0)
            test_orgs.insert(0, "dataset", stem[:-11])
            org_tabs.append(test_orgs)
        if contacts in stem:
            test_contact = pd.read_csv(file, index_col=0)
            test_contact.insert(0, "dataset", stem[:-9])
            contact_tabs.append(test_contact)
        if dist in stem:
            test_dist = pd.read_csv(file, index_col=0)
            test_dist.insert(0, "dataset", stem[:-14])
            dist_tabs.append(test_dist)
        if regions in stem:
            test_regions = pd.read_csv(file, index_col=0)
            test_regions.insert(0, "dataset", stem[:-8])
            region_tabs.append(test_regions)


- Combine comprehensive metrics tables to be summarized and restructured

In [None]:
org_df = pd.concat(org_tabs,axis=0, join='outer')
contacts_df = pd.concat(contact_tabs,axis=0, join='outer')
dist_df = pd.concat(dist_tabs,axis=0, join='outer')
regions_df = pd.concat(region_tabs,axis=0, join='outer')

In [None]:
org_df.head()

In [None]:
contacts_df.head()

In [None]:
dist_df.head()

In [None]:
regions_df.head()

## **`2` - Restructure comprehensive two-way organelle interaction metrics table**

> ###### **📝 Please note that in the following steps, a specific procedure will be repeated to ensure that all unique organelle objects are described by their interaction metrics, regardless of whether they are the first organelle (A) or the second organelle (B) involved in the two-way contact.**

- breakdown the interaction table column names

In [None]:
contact_cnt = contacts_df[["dataset", "image_name", "object", "label", "volume"]]
contact_cnt[["orgA", "orgB"]] = contact_cnt["object"].str.split('X', expand=True)
contact_cnt[["A_ID", "B_ID"]] = contact_cnt["label"].str.split('_', expand=True)
contact_cnt["A"] = contact_cnt["orgA"] +"_" + contact_cnt["A_ID"].astype(str)
contact_cnt["B"] = contact_cnt["orgB"] +"_" + contact_cnt["B_ID"].astype(str)

- group observations by the **first** organelle involved in interactions

In [None]:
contact_cnt_percell = contact_cnt[["dataset", "image_name", "orgA", "A_ID", "object", "volume"]].groupby(["dataset", "image_name", "orgA", "A_ID", "object"]).agg(["count", "sum"])
contact_cnt_percell.columns = ["_".join(col_name).rstrip('_') for col_name in contact_cnt_percell.columns.to_flat_index()]

- unstack the grouped table to create a column for every unique organelle interaction type

In [None]:
unstacked = contact_cnt_percell.unstack(level='object')
unstacked.columns = ["_".join(col_name).rstrip('_') for col_name in unstacked.columns.to_flat_index()]
unstacked = unstacked.reset_index()

- correct column names for unstacked tables to accurately describe the **first** organelle involved in each unique interaction site

In [None]:
# Fixes the count and volume metrics
for col in unstacked.columns:
    if col.startswith("volume_count_"):
        newname = col.split("_")[-1] + "_count"
        unstacked.rename(columns={col:newname}, inplace=True)
    if col.startswith("volume_sum_"):
        newname = col.split("_")[-1] + "_volume"
        unstacked.rename(columns={col:newname}, inplace=True)

# first organelle is simply referred to as object
# label of first organelle is simply reffered to as label
unstacked.rename(columns={"orgA":"object", "A_ID":"label"}, inplace=True)
unstacked.set_index(['dataset', 'image_name', 'object', 'label'])


- repeat last three substeps for the **second** organelle involved in the interactions

In [None]:
contact_percellB = contact_cnt[["dataset", "image_name", "orgB", "B_ID", "object", "volume"]].groupby(["dataset", "image_name", "orgB", "B_ID", "object"]).agg(["count", "sum"])
contact_percellB.columns = ["_".join(col_name).rstrip('_') for col_name in contact_percellB.columns.to_flat_index()]
unstackedB = contact_percellB.unstack(level='object')
unstackedB.columns = ["_".join(col_name).rstrip('_') for col_name in unstackedB.columns.to_flat_index()]
unstackedB = unstackedB.reset_index()
for col in unstackedB.columns:
    if col.startswith("volume_count_"):
        newname = col.split("_")[-1] + "_count"
        unstackedB.rename(columns={col:newname}, inplace=True)
    if col.startswith("volume_sum_"):
        newname = col.split("_")[-1] + "_volume"
        unstackedB.rename(columns={col:newname}, inplace=True)
unstackedB.rename(columns={"orgB":"object", "B_ID":"label"}, inplace=True)
unstackedB.set_index(['dataset', 'image_name', 'object', 'label'])

- combine and merge the data from **both** unstacked tables to include interaction metrics from all organelle objects

In [None]:
contact_cnt = pd.concat([unstacked, unstackedB], axis=0).sort_index(axis=0)
contact_cnt = contact_cnt.groupby(['dataset', 'image_name', 'object', 'label']).sum().reset_index()
contact_cnt['label']=contact_cnt['label'].astype("Int64")

org_df = pd.merge(org_df, contact_cnt, how='left', on=['dataset', 'image_name', 'object', 'label'], sort=True)
org_df[contact_cnt.columns] = org_df[contact_cnt.columns].fillna(0)

In [None]:
contact_cnt_percell

In [None]:
unstacked

In [None]:
contact_cnt

In [None]:
org_df

## **`3` - Apply aggregate statistics for summarization**

- determine aggregate statistics to be applied per organelle object

In [12]:
###################
# summary stat group
###################

# ensure summary statistics are applied on a per organelle object level
group_by = ['dataset', 'image_name', 'object']

# metrics to be observed
sharedcolumns = ["SA_to_volume_ratio", "equivalent_diameter", "extent", "euler_number", "solidity", "axis_major_length"]

# statistical functions to be performed on the metrics
ag_func_standard = ['mean', 'median', 'std']

- summarize shared metrics between the organelle morphology and interaction tables

In [None]:
###################
# summarize shared measurements between org_df and contacts_df
###################
org_cont_tabs = []
for tab in [org_df, contacts_df]:
    tab1 = tab[group_by + ['volume']].groupby(group_by).agg(['count', 'sum'] + ag_func_standard)
    tab2 = tab[group_by + ['surface_area']].groupby(group_by).agg(['sum'] + ag_func_standard)
    tab3 = tab[group_by + sharedcolumns].groupby(group_by).agg(ag_func_standard)
    shared_metrics = pd.merge(tab1, tab2, 'outer', on=group_by)
    shared_metrics = pd.merge(shared_metrics, tab3, 'outer', on=group_by)
    org_cont_tabs.append(shared_metrics)

org_summary = org_cont_tabs[0]
contact_summary = org_cont_tabs[1]

- summarize metrics in the region morphology table

In [None]:
###################
# group metrics from regions_df similar to the above
###################
regions_summary = regions_df[group_by + ['volume', 'surface_area'] + sharedcolumns].set_index(group_by)

- summarize additional metrics in the organelle morphology table

In [None]:
###################
# summarize extra metrics from org_df
###################
columns2 = [col for col in org_df.columns if col.endswith(("_count", "_volume"))]
contact_counts_summary = org_df[group_by + columns2].groupby(group_by).agg(['sum'] + ag_func_standard)
org_summary = pd.merge(org_summary, contact_counts_summary, 'outer', on=group_by)#left_on=group_by, right_on=True)

In [None]:
org_summary

In [14]:
pd.set_option('display.max_columns', None)

## **`4` - Restructure distribution metrics tables**

- for XY-distribution collect summary statistics for voxel bins and wedges

- for Z-distribution collect summary statistics for voxel bins and wedges

> ###### Statistics collected: **mean, median, mode, minimum, maximum, range, standard deviation, skew, kurtosis and variance**

- calculate the coefficient of variation for the **mean**, **median**, **standard deviation** for the XY-distribution bin values

In [None]:
hist_dfs = []
for ind in range(0,1):#len(dist_df.index)):
    selection = dist_df.iloc[[ind]] #    selection = dist_df.loc[[ind]]
    bins_df = pd.DataFrame()
    wedges_df = pd.DataFrame()
    Z_df = pd.DataFrame()
    CV_df = pd.DataFrame()

    bins_df[['bins', 'masks', 'obj']] = selection[['XY_bins', 'XY_mask_vox_cnt_perbin', 'XY_obj_vox_cnt_perbin']]
    wedges_df[['bins', 'masks', 'obj']] = selection[['XY_wedges', 'XY_mask_vox_cnt_perwedge', 'XY_obj_vox_cnt_perwedge']]
    Z_df[['bins', 'masks', 'obj']] = selection[['Z_slices', 'Z_mask_vox_cnt', 'Z_obj_vox_cnt']]

    dfs = [selection[['dataset', 'image_name', 'object']].reset_index()]
    for df, prefix in zip([bins_df, wedges_df, Z_df], ["XY_bins_", "XY_wedges_", "Z_slices_"]):
        single_df = pd.DataFrame(list(zip(df["bins"].values[0][1:-1].split(", "), 
                                        df["obj"].values[0][1:-1].split(", "), 
                                        df["masks"].values[0][1:-1].split(", "))), columns =['bins', 'obj', 'mask']).astype(int)
        
        if "Z_" in prefix:
            single_df =  single_df.drop(single_df[single_df['mask'] == 0].index)
            single_df['bins'] = (single_df["bins"]/max(single_df.bins)*9.99).apply(np.floor)+1
            single_df = single_df.groupby("bins").agg(['sum']).reset_index()
            single_df.columns = ['bins',"obj","mask"]
    
        single_df['mask_fract'] = single_df['mask']/single_df['mask'].max()
        # single_df['obj_normed_tocell'] = (single_df["obj"]*single_df["mask_fract"]).fillna(0)
        single_df['obj_perc_per_bin'] = (single_df["obj"] / single_df["obj"].sum())*100
        single_df['obj_portion_normed_tobin'] = (single_df["obj_perc_per_bin"]/single_df["mask_fract"]).fillna(0)

        sumstats_df = pd.DataFrame()

        s = single_df['bins'].repeat(single_df['obj_portion_normed_tobin']*100)
        ###################################
        #SUB-STEPS 1 & 2
        ###################################
        sumstats_df['hist_mean']=[s.mean()]
        sumstats_df['hist_median']=[s.median()]
        if single_df['obj_portion_normed_tobin'].sum() != 0: sumstats_df['hist_mode']=[s.mode().iloc[0]]
        else: sumstats_df['hist_mode']=['NaN']
        sumstats_df['hist_min']=[s.min()]
        sumstats_df['hist_max']=[s.max()]
        sumstats_df['hist_range']=[s.max() - s.min()]
        sumstats_df['hist_stdev']=[s.std()]
        sumstats_df['hist_skew']=[s.skew()]
        sumstats_df['hist_kurtosis']=[s.kurtosis()]
        sumstats_df['hist_var']=[s.var()]
        sumstats_df.columns = [prefix+col for col in sumstats_df.columns]
        dfs.append(sumstats_df.reset_index())
    
    ###################################
    #SUB-STEP 3
    ###################################
    CV_df = pd.DataFrame(list(zip(selection["XY_obj_cv_perbin"].values[0][1:-1].split(", "))), columns =['CV']).astype(float)
    sumstats_CV_df = pd.DataFrame()
    sumstats_CV_df['XY_bin_CV_mean'] = CV_df.mean()
    sumstats_CV_df['XY_bin_CV_median'] = CV_df.median()
    sumstats_CV_df['XY_bin_CV_std'] = CV_df.std()
    dfs.append(sumstats_CV_df.reset_index().drop(['index'], axis=1))
    
    ###################################
    # Combine all resulting tables
    ###################################
    combined_df = pd.concat(dfs, axis=1).drop(columns="index")
    hist_dfs.append(combined_df)
dist_org_summary = pd.concat(hist_dfs, ignore_index=True)
dist_org_summary

- repeat the first two substeps for the nucleus distribution metrics

In [None]:
# nucleus distribution
nuc_dist_df = dist_df[["dataset", "image_name", 
                    "XY_bins", "XY_center_vox_cnt_perbin", "XY_mask_vox_cnt_perbin",
                    "XY_wedges", "XY_center_vox_cnt_perwedge", "XY_mask_vox_cnt_perwedge",
                    "Z_slices", "Z_center_vox_cnt", "Z_mask_vox_cnt"]].set_index(["dataset", "image_name"])
nuc_hist_dfs = []
for idx in nuc_dist_df.index.unique():
    selection = nuc_dist_df.loc[idx].iloc[[0]].reset_index()
    bins_df = pd.DataFrame()
    wedges_df = pd.DataFrame()
    Z_df = pd.DataFrame()

    bins_df[['bins', 'center', 'masks']] = selection[['XY_bins', 'XY_center_vox_cnt_perbin', 'XY_mask_vox_cnt_perbin']]
    wedges_df[['bins', 'center', 'masks']] = selection[['XY_wedges', 'XY_center_vox_cnt_perwedge', 'XY_mask_vox_cnt_perwedge']]
    Z_df[['bins', 'center', 'masks']] = selection[['Z_slices', 'Z_center_vox_cnt', 'Z_mask_vox_cnt']]

    dfs = [selection[['dataset', 'image_name']]]
    
    for df, prefix in zip([bins_df, wedges_df, Z_df], ["XY_bins_", "XY_wedges_", "Z_slices_"]):
        single_df = pd.DataFrame(list(zip(df["bins"].values[0][1:-1].split(", "), 
                                        df["masks"].values[0][1:-1].split(", "),
                                        df["center"].values[0][1:-1].split(", "))), columns =['bins', 'mask', 'obj']).astype(int)
        
        if "Z_" in prefix:
            single_df =  single_df.drop(single_df[single_df['mask'] == 0].index)
            single_df['bins'] = (single_df["bins"]/max(single_df.bins)*9.99).apply(np.floor)+1
            single_df = single_df.groupby("bins").agg(['sum']).reset_index()
            single_df.columns = ['bins',"mask","obj"]

        single_df['mask_fract'] = single_df['mask']/single_df['mask'].max()
        # single_df['obj_normed_tocell'] = (single_df["obj"]*single_df["mask_fract"]).fillna(0)
        single_df['obj_perc_per_bin'] = (single_df["obj"] / single_df["obj"].sum())*100
        single_df['obj_portion_normed_tobin'] = (single_df["obj_perc_per_bin"]/single_df["mask_fract"]).fillna(0)

        sumstats_df = pd.DataFrame()

        s = single_df['bins'].repeat(single_df['obj_portion_normed_tobin']*100)
        ###################################
        #SUB-STEPS 1 & 2 FOR NUC
        ###################################
        sumstats_df['hist_mean']=[s.mean()]
        sumstats_df['hist_median']=[s.median()]
        if single_df['obj_portion_normed_tobin'].sum() != 0: sumstats_df['hist_mode']=[s.mode().iloc[0]]
        else: sumstats_df['hist_mode']=['NaN']
        sumstats_df['hist_min']=[s.min()]
        sumstats_df['hist_max']=[s.max()]
        sumstats_df['hist_range']=[s.max() - s.min()]
        sumstats_df['hist_stdev']=[s.std()]
        sumstats_df['hist_skew']=[s.skew()]
        sumstats_df['hist_kurtosis']=[s.kurtosis()]
        sumstats_df['hist_var']=[s.var()]
        sumstats_df.columns = [prefix+col for col in sumstats_df.columns]
        dfs.append(sumstats_df.reset_index())
        
    ###################################
    # Combine all resulting tables
    ###################################
    combined_df = pd.concat(dfs, axis=1).drop(columns="index")
    nuc_hist_dfs.append(combined_df)
dist_center_summary = pd.concat(nuc_hist_dfs, ignore_index=True)
dist_center_summary.insert(2, column="object", value="nuc")
dist_center_summary

- combine nucleus and organelle distribution tables

In [None]:
dist_summary = pd.concat([dist_org_summary, dist_center_summary], axis=0).set_index(group_by).sort_index()
dist_summary

## **`5` - Add normalized metrics**

- calculate fraction of cell area taken up by the organelles

In [None]:
###################
# add normalization
###################
# organelle area fraction
area_fractions = []
for idx in org_summary.index.unique():
    org_vol = org_summary.loc[idx][('volume', 'sum')]
    cell_vol = regions_summary.loc[idx[:-1] + ('cell',)]["volume"]
    afrac = org_vol/cell_vol
    area_fractions.append(afrac)
org_summary[('volume', 'fraction')] = area_fractions
# TODO: add in line to reorder the level=0 columns here

- calculate fraction of organelle objects involved in specific interorganelle contacts

In [None]:
# contact sites volume normalized
norm_toA_list = []
norm_toB_list = []
for col in contact_summary.index:
    norm_toA_list.append(contact_summary.loc[col][('volume', 'sum')]/org_summary.loc[col[:-1]+(col[-1].split('X')[0],)][('volume', 'sum')])
    norm_toB_list.append(contact_summary.loc[col][('volume', 'sum')]/org_summary.loc[col[:-1]+(col[-1].split('X')[1],)][('volume', 'sum')])
contact_summary[('volume', 'norm_to_A')] = norm_toA_list
contact_summary[('volume', 'norm_to_B')] = norm_toB_list

# number and area of individuals organelle involved in contact
cont_cnt = org_df[group_by]
cont_cnt[[col.split('_')[0] for col in org_df.columns if col.endswith(("_count"))]] = org_df[[col for col in org_df.columns if col.endswith(("_count"))]].astype(bool)
cont_cnt_perorg = cont_cnt.groupby(group_by).agg('sum')
cont_cnt_perorg.columns = pd.MultiIndex.from_product([cont_cnt_perorg.columns, ['count_in']])
for col in cont_cnt_perorg.columns:
    cont_cnt_perorg[(col[0], 'num_fraction_in')] = cont_cnt_perorg[col].values/org_summary[('volume', 'count')].values
cont_cnt_perorg.sort_index(axis=1, inplace=True)
org_summary = pd.merge(org_summary, cont_cnt_perorg, on=group_by, how='outer')

In [None]:
cont_cnt_perorg

In [None]:
org_summary

## **`6` - Unstack and finalize summary stats tables**

- unstack and reorder organelle morphology summary table columns

In [15]:
###################
# flatten datasheets and combine
# TODO: restructure this so that all of the datasheets and unstacked and then reorded based on shared level 0 columns before flattening
###################
# org flattening
org_final = org_summary.unstack(-1)
for col in org_final.columns:
    if col[1] in ('count_in', 'num_fraction_in') or col[0].endswith(('_count', '_volume')):
        if col[2] not in col[0]:
            org_final.drop(col,axis=1, inplace=True)
new_col_order = ['dataset', 'image_name', 'object', 'volume', 'surface_area', 'SA_to_volume_ratio', 
                'equivalent_diameter', 'extent', 'euler_number', 'solidity', 'axis_major_length', 
                'ERXLD', 'ERXLD_count', 'ERXLD_volume', 'golgiXER', 'golgiXER_count', 'golgiXER_volume', 
                'golgiXLD', 'golgiXLD_count', 'golgiXLD_volume', 'golgiXperox', 'golgiXperox_count', 'golgiXperox_volume', 
                'lysoXER', 'lysoXER_count', 'lysoXER_volume', 'lysoXLD', 'lysoXLD_count', 'lysoXLD_volume', 
                'lysoXgolgi', 'lysoXgolgi_count', 'lysoXgolgi_volume', 'lysoXmito', 'lysoXmito_count', 'lysoXmito_volume', 
                'lysoXperox', 'lysoXperox_count', 'lysoXperox_volume', 'mitoXER', 'mitoXER_count', 'mitoXER_volume', 
                'mitoXLD', 'mitoXLD_count', 'mitoXLD_volume', 'mitoXgolgi', 'mitoXgolgi_count', 'mitoXgolgi_volume', 
                'mitoXperox', 'mitoXperox_count', 'mitoXperox_volume', 'peroxXER', 'peroxXER_count', 'peroxXER_volume', 
                'peroxXLD', 'peroxXLD_count', 'peroxXLD_volume']
new_cols = org_final.columns.reindex(new_col_order, level=0)
org_final = org_final.reindex(columns=new_cols[0])
org_final.columns = ["_".join((col_name[-1], col_name[1], col_name[0])) for col_name in org_final.columns.to_flat_index()]

- fill "NaN" values with 0 when necessary to final organelle morphology summary table

In [None]:
#renaming, filling "NaN" with 0 when needed, and removing ER_std columns
for col in org_final.columns:
    if '_count_in_' or '_fraction_in_' in col:
        org_final[col] = org_final[col].fillna(0)
    if col.endswith(("_count_volume","_sum_volume", "_mean_volume", "_median_volume")):
        org_final[col] = org_final[col].fillna(0)
    if col.endswith("_count_volume"):
        org_final.rename(columns={col:col.split("_")[0]+"_count"}, inplace=True)
    if col.startswith("ER_std_"):
        org_final.drop(columns=[col], inplace=True)
org_final = org_final.reset_index()

- unstack and reorder organelle interactions summary table columns

In [None]:
# contacts flattened
contact_final = contact_summary.unstack(-1)
contact_final.columns = ["_".join((col_name[-1], col_name[1], col_name[0])) for col_name in contact_final.columns.to_flat_index()]

- fill "NaN" values with 0 when necessary to final organelle interactions summary table

In [None]:
#renaming and filling "NaN" with 0 when needed
for col in contact_final.columns:
    if col.endswith(("_count_volume","_sum_volume", "_mean_volume", "_median_volume")):
        contact_final[col] = contact_final[col].fillna(0)
    if col.endswith("_count_volume"):
        contact_final.rename(columns={col:col.split("_")[0]+"_count"}, inplace=True)
contact_final = contact_final.reset_index()

- unstack and reorder distribution measurements summary table columns to create finalized table

In [17]:
# distributions flattened
dist_final = dist_summary.unstack(-1)
dist_final.columns = ["_".join((col_name[1], col_name[0])) for col_name in dist_final.columns.to_flat_index()]
dist_final = dist_final.reset_index()

- unstack and reorder region morphology summary table columns

In [None]:
# regions flattened
regions_final = regions_summary.unstack(-1)
regions_final.columns = ["_".join((col_name[1], col_name[0])) for col_name in regions_final.columns.to_flat_index()]

- add normalization to finalize region morphology summary table

In [None]:
# normalization added
regions_final['nuc_area_fraction'] = regions_final['nuc_volume'] / regions_final['cell_volume']
regions_final = regions_final.reset_index()

- combine all four tables to create a complete summary table

In [None]:
# combining them all
combined = pd.merge(org_final, contact_final, on=["dataset", "image_name"], how="outer")
combined = pd.merge(combined, dist_final, on=["dataset", "image_name"], how="outer")
combined = pd.merge(combined, regions_final, on=["dataset", "image_name"], how="outer").set_index(["dataset", "image_name"])
combined.columns = [col.replace('sum', 'total') for col in combined.columns]

## **`7` - Export summary stats tables as .csv files**

In [None]:
###################
# export summary sheets
###################

# location for the final csv files to be exported to
out_path = Path(os.getcwd()).parents[1] / "sample_data" /  "batch_example" / "quant"

# prefix added to summary tables
out_preffix = "example_prototype_"

org_summary.to_csv(out_path + f"/{out_preffix}per_org_summarystats.csv")
contact_summary.to_csv(out_path + f"/{out_preffix}per_contact_summarystats.csv")
dist_summary.to_csv(out_path + f"/{out_preffix}distribution_summarystats.csv")
regions_summary.to_csv(out_path + f"/{out_preffix}per_region_summarystats.csv")
combined.to_csv(out_path + f"/{out_preffix}summarystats_combined.csv")

# ***EXECUTE FUNCTION PROTOTYPE***

## **Define prototype `_batch_summary_stats` function**

In [55]:
def _batch_summary_stats(csv_path_list: List[str],
                         out_path: str,
                         out_preffix: str):
    """" 
    csv_path_list: List[str],
        A list of path strings where .csv files to analyze are located.
    out_path: str,
        A path string where the summary data file will be output to
    out_preffix: str
        The prefix used to name the output file.    
    """
    ds_count = 0
    fl_count = 0
    ###################
    # Read in the csv files and combine them into one of each type
    ###################
    org_tabs = []
    contact_tabs = []
    dist_tabs = []
    region_tabs = []

    for loc in csv_path_list:
        ds_count = ds_count + 1
        loc=Path(loc)
        files_store = sorted(loc.glob("*.csv"))
        for file in files_store:
            fl_count = fl_count + 1
            stem = file.stem

            org = "organelles"
            contacts = "contacts"
            dist = "distributions"
            regions = "_regions"

            if org in stem:
                test_orgs = pd.read_csv(file, index_col=0)
                test_orgs.insert(0, "dataset", stem[:-11])
                org_tabs.append(test_orgs)
            if contacts in stem:
                test_contact = pd.read_csv(file, index_col=0)
                test_contact.insert(0, "dataset", stem[:-9])
                contact_tabs.append(test_contact)
            if dist in stem:
                test_dist = pd.read_csv(file, index_col=0)
                test_dist.insert(0, "dataset", stem[:-14])
                dist_tabs.append(test_dist)
            if regions in stem:
                test_regions = pd.read_csv(file, index_col=0)
                test_regions.insert(0, "dataset", stem[:-8])
                region_tabs.append(test_regions)
            
    org_df = pd.concat(org_tabs,axis=0, join='outer')
    contacts_df = pd.concat(contact_tabs,axis=0, join='outer')
    dist_df = pd.concat(dist_tabs,axis=0, join='outer')
    regions_df = pd.concat(region_tabs,axis=0, join='outer')

    ###################
    # adding new metrics to the original sheets
    ###################
    # TODO: include these labels when creating the original sheets
    contact_cnt = contacts_df[["dataset", "image_name", "object", "label", "volume"]]
    contact_cnt[["orgA", "orgB"]] = contact_cnt["object"].str.split('X', expand=True)
    contact_cnt[["A_ID", "B_ID"]] = contact_cnt["label"].str.split('_', expand=True)
    contact_cnt["A"] = contact_cnt["orgA"] +"_" + contact_cnt["A_ID"].astype(str)
    contact_cnt["B"] = contact_cnt["orgB"] +"_" + contact_cnt["B_ID"].astype(str)

    contact_cnt_percell = contact_cnt[["dataset", "image_name", "orgA", "A_ID", "object", "volume"]].groupby(["dataset", "image_name", "orgA", "A_ID", "object"]).agg(["count", "sum"])
    contact_cnt_percell.columns = ["_".join(col_name).rstrip('_') for col_name in contact_cnt_percell.columns.to_flat_index()]
    unstacked = contact_cnt_percell.unstack(level='object')
    unstacked.columns = ["_".join(col_name).rstrip('_') for col_name in unstacked.columns.to_flat_index()]
    unstacked = unstacked.reset_index()
    for col in unstacked.columns:
        if col.startswith("volume_count_"):
            newname = col.split("_")[-1] + "_count"
            unstacked.rename(columns={col:newname}, inplace=True)
        if col.startswith("volume_sum_"):
            newname = col.split("_")[-1] + "_volume"
            unstacked.rename(columns={col:newname}, inplace=True)
    unstacked.rename(columns={"orgA":"object", "A_ID":"label"}, inplace=True)
    unstacked.set_index(['dataset', 'image_name', 'object', 'label'])

    contact_percellB = contact_cnt[["dataset", "image_name", "orgB", "B_ID", "object", "volume"]].groupby(["dataset", "image_name", "orgB", "B_ID", "object"]).agg(["count", "sum"])
    contact_percellB.columns = ["_".join(col_name).rstrip('_') for col_name in contact_percellB.columns.to_flat_index()]
    unstackedB = contact_percellB.unstack(level='object')
    unstackedB.columns = ["_".join(col_name).rstrip('_') for col_name in unstackedB.columns.to_flat_index()]
    unstackedB = unstackedB.reset_index()
    for col in unstackedB.columns:
        if col.startswith("volume_count_"):
            newname = col.split("_")[-1] + "_count"
            unstackedB.rename(columns={col:newname}, inplace=True)
        if col.startswith("volume_sum_"):
            newname = col.split("_")[-1] + "_volume"
            unstackedB.rename(columns={col:newname}, inplace=True)
    unstackedB.rename(columns={"orgB":"object", "B_ID":"label"}, inplace=True)
    unstackedB.set_index(['dataset', 'image_name', 'object', 'label'])

    contact_cnt = pd.concat([unstacked, unstackedB], axis=0).sort_index(axis=0)
    contact_cnt = contact_cnt.groupby(['dataset', 'image_name', 'object', 'label']).sum().reset_index()
    contact_cnt['label']=contact_cnt['label'].astype("Int64")

    org_df = pd.merge(org_df, contact_cnt, how='left', on=['dataset', 'image_name', 'object', 'label'], sort=True)
    org_df[contact_cnt.columns] = org_df[contact_cnt.columns].fillna(0)

    ###################
    # summary stat group
    ###################
    group_by = ['dataset', 'image_name', 'object']
    sharedcolumns = ["SA_to_volume_ratio", "equivalent_diameter", "extent", "euler_number", "solidity", "axis_major_length"]
    ag_func_standard = ['mean', 'median', 'std']

    ###################
    # summarize shared measurements between org_df and contacts_df
    ###################
    org_cont_tabs = []
    for tab in [org_df, contacts_df]:
        tab1 = tab[group_by + ['volume']].groupby(group_by).agg(['count', 'sum'] + ag_func_standard)
        tab2 = tab[group_by + ['surface_area']].groupby(group_by).agg(['sum'] + ag_func_standard)
        tab3 = tab[group_by + sharedcolumns].groupby(group_by).agg(ag_func_standard)
        shared_metrics = pd.merge(tab1, tab2, 'outer', on=group_by)
        shared_metrics = pd.merge(shared_metrics, tab3, 'outer', on=group_by)
        org_cont_tabs.append(shared_metrics)

    org_summary = org_cont_tabs[0]
    contact_summary = org_cont_tabs[1]

    ###################
    # group metrics from regions_df similar to the above
    ###################
    regions_summary = regions_df[group_by + ['volume', 'surface_area'] + sharedcolumns].set_index(group_by)

    ###################
    # summarize extra metrics from org_df
    ###################
    columns2 = [col for col in org_df.columns if col.endswith(("_count", "_volume"))]
    contact_counts_summary = org_df[group_by + columns2].groupby(group_by).agg(['sum'] + ag_func_standard)
    org_summary = pd.merge(org_summary, contact_counts_summary, 'outer', on=group_by)#left_on=group_by, right_on=True)

    ###################
    # summarize distribution measurements
    ###################
    # organelle distributions
    hist_dfs = []
    for ind in range(0,len(dist_df.index)):
        selection = dist_df.iloc[[ind]] #    selection = dist_df.loc[[ind]]
        bins_df = pd.DataFrame()
        wedges_df = pd.DataFrame()
        Z_df = pd.DataFrame()
        CV_df = pd.DataFrame()

        bins_df[['bins', 'masks', 'obj']] = selection[['XY_bins', 'XY_mask_vox_cnt_perbin', 'XY_obj_vox_cnt_perbin']]
        wedges_df[['bins', 'masks', 'obj']] = selection[['XY_wedges', 'XY_mask_vox_cnt_perwedge', 'XY_obj_vox_cnt_perwedge']]
        Z_df[['bins', 'masks', 'obj']] = selection[['Z_slices', 'Z_mask_vox_cnt', 'Z_obj_vox_cnt']]

        dfs = [selection[['dataset', 'image_name', 'object']].reset_index()]
        for df, prefix in zip([bins_df, wedges_df, Z_df], ["XY_bins_", "XY_wedges_", "Z_slices_"]):
            single_df = pd.DataFrame(list(zip(df["bins"].values[0][1:-1].split(", "), 
                                            df["obj"].values[0][1:-1].split(", "), 
                                            df["masks"].values[0][1:-1].split(", "))), columns =['bins', 'obj', 'mask']).astype(int)
            
            if "Z_" in prefix:
                single_df =  single_df.drop(single_df[single_df['mask'] == 0].index)
                single_df['bins'] = (single_df["bins"]/max(single_df.bins)*9.99).apply(np.floor)+1
                single_df = single_df.groupby("bins").agg(['sum']).reset_index()
                single_df.columns = ['bins',"obj","mask"]
        
            single_df['mask_fract'] = single_df['mask']/single_df['mask'].max()
            # single_df['obj_normed_tocell'] = (single_df["obj"]*single_df["mask_fract"]).fillna(0)
            single_df['obj_perc_per_bin'] = (single_df["obj"] / single_df["obj"].sum())*100
            single_df['obj_portion_normed_tobin'] = (single_df["obj_perc_per_bin"]/single_df["mask_fract"]).fillna(0)

            sumstats_df = pd.DataFrame()

            s = single_df['bins'].repeat(single_df['obj_portion_normed_tobin']*100)

            sumstats_df['hist_mean']=[s.mean()]
            sumstats_df['hist_median']=[s.median()]
            if single_df['obj_portion_normed_tobin'].sum() != 0: sumstats_df['hist_mode']=[s.mode().iloc[0]]
            else: sumstats_df['hist_mode']=['NaN']
            sumstats_df['hist_min']=[s.min()]
            sumstats_df['hist_max']=[s.max()]
            sumstats_df['hist_range']=[s.max() - s.min()]
            sumstats_df['hist_stdev']=[s.std()]
            sumstats_df['hist_skew']=[s.skew()]
            sumstats_df['hist_kurtosis']=[s.kurtosis()]
            sumstats_df['hist_var']=[s.var()]
            sumstats_df.columns = [prefix+col for col in sumstats_df.columns]
            dfs.append(sumstats_df.reset_index())

        CV_df = pd.DataFrame(list(zip(selection["XY_obj_cv_perbin"].values[0][1:-1].split(", "))), columns =['CV']).astype(float)
        sumstats_CV_df = pd.DataFrame()
        sumstats_CV_df['XY_bin_CV_mean'] = CV_df.mean()
        sumstats_CV_df['XY_bin_CV_median'] = CV_df.median()
        sumstats_CV_df['XY_bin_CV_std'] = CV_df.std()
        dfs.append(sumstats_CV_df.reset_index().drop(['index'], axis=1))

        combined_df = pd.concat(dfs, axis=1).drop(columns="index")
        hist_dfs.append(combined_df)
    dist_org_summary = pd.concat(hist_dfs, ignore_index=True)
    dist_org_summary

    # nucleus distribution
    nuc_dist_df = dist_df[["dataset", "image_name", 
                        "XY_bins", "XY_center_vox_cnt_perbin", "XY_mask_vox_cnt_perbin",
                        "XY_wedges", "XY_center_vox_cnt_perwedge", "XY_mask_vox_cnt_perwedge",
                        "Z_slices", "Z_center_vox_cnt", "Z_mask_vox_cnt"]].set_index(["dataset", "image_name"])
    nuc_hist_dfs = []
    for idx in nuc_dist_df.index.unique():
        selection = nuc_dist_df.loc[idx].iloc[[0]].reset_index()
        bins_df = pd.DataFrame()
        wedges_df = pd.DataFrame()
        Z_df = pd.DataFrame()

        bins_df[['bins', 'center', 'masks']] = selection[['XY_bins', 'XY_center_vox_cnt_perbin', 'XY_mask_vox_cnt_perbin']]
        wedges_df[['bins', 'center', 'masks']] = selection[['XY_wedges', 'XY_center_vox_cnt_perwedge', 'XY_mask_vox_cnt_perwedge']]
        Z_df[['bins', 'center', 'masks']] = selection[['Z_slices', 'Z_center_vox_cnt', 'Z_mask_vox_cnt']]

        dfs = [selection[['dataset', 'image_name']]]
        for df, prefix in zip([bins_df, wedges_df, Z_df], ["XY_bins_", "XY_wedges_", "Z_slices_"]):
            single_df = pd.DataFrame(list(zip(df["bins"].values[0][1:-1].split(", "), 
                                            df["masks"].values[0][1:-1].split(", "),
                                            df["center"].values[0][1:-1].split(", "))), columns =['bins', 'mask', 'obj']).astype(int)

            if "Z_" in prefix:
                single_df =  single_df.drop(single_df[single_df['mask'] == 0].index)
                single_df['bins'] = (single_df["bins"]/max(single_df.bins)*9.99).apply(np.floor)+1
                single_df = single_df.groupby("bins").agg(['sum']).reset_index()
                single_df.columns = ['bins',"mask","obj"]
        
            single_df['mask_fract'] = single_df['mask']/single_df['mask'].max()
            # single_df['obj_normed_tocell'] = (single_df["obj"]*single_df["mask_fract"]).fillna(0)
            single_df['obj_perc_per_bin'] = (single_df["obj"] / single_df["obj"].sum())*100
            single_df['obj_portion_normed_tobin'] = (single_df["obj_perc_per_bin"]/single_df["mask_fract"]).fillna(0)

            sumstats_df = pd.DataFrame()

            s = single_df['bins'].repeat(single_df['obj_portion_normed_tobin']*100)

            sumstats_df['hist_mean']=[s.mean()]
            sumstats_df['hist_median']=[s.median()]
            if single_df['obj_portion_normed_tobin'].sum() != 0: sumstats_df['hist_mode']=[s.mode().iloc[0]]
            else: sumstats_df['hist_mode']=['NaN']
            sumstats_df['hist_min']=[s.min()]
            sumstats_df['hist_max']=[s.max()]
            sumstats_df['hist_range']=[s.max() - s.min()]
            sumstats_df['hist_stdev']=[s.std()]
            sumstats_df['hist_skew']=[s.skew()]
            sumstats_df['hist_kurtosis']=[s.kurtosis()]
            sumstats_df['hist_var']=[s.var()]
            sumstats_df.columns = [prefix+col for col in sumstats_df.columns]
            dfs.append(sumstats_df.reset_index())
        combined_df = pd.concat(dfs, axis=1).drop(columns="index")
        nuc_hist_dfs.append(combined_df)
    dist_center_summary = pd.concat(nuc_hist_dfs, ignore_index=True)
    dist_center_summary.insert(2, column="object", value="nuc")

    dist_summary = pd.concat([dist_org_summary, dist_center_summary], axis=0).set_index(group_by).sort_index()


    ###################
    # add normalization
    ###################
    # organelle area fraction
    area_fractions = []
    for idx in org_summary.index.unique():
        org_vol = org_summary.loc[idx][('volume', 'sum')]
        cell_vol = regions_summary.loc[idx[:-1] + ('cell',)]["volume"]
        afrac = org_vol/cell_vol
        area_fractions.append(afrac)
    org_summary[('volume', 'fraction')] = area_fractions
    # TODO: add in line to reorder the level=0 columns here

    # contact sites volume normalized
    norm_toA_list = []
    norm_toB_list = []
    for col in contact_summary.index:
        norm_toA_list.append(contact_summary.loc[col][('volume', 'sum')]/org_summary.loc[col[:-1]+(col[-1].split('X')[0],)][('volume', 'sum')])
        norm_toB_list.append(contact_summary.loc[col][('volume', 'sum')]/org_summary.loc[col[:-1]+(col[-1].split('X')[1],)][('volume', 'sum')])
    contact_summary[('volume', 'norm_to_A')] = norm_toA_list
    contact_summary[('volume', 'norm_to_B')] = norm_toB_list

    # number and area of individuals organelle involved in contact
    cont_cnt = org_df[group_by]
    cont_cnt[[col.split('_')[0] for col in org_df.columns if col.endswith(("_count"))]] = org_df[[col for col in org_df.columns if col.endswith(("_count"))]].astype(bool)
    cont_cnt_perorg = cont_cnt.groupby(group_by).agg('sum')
    cont_cnt_perorg.columns = pd.MultiIndex.from_product([cont_cnt_perorg.columns, ['count_in']])
    for col in cont_cnt_perorg.columns:
        cont_cnt_perorg[(col[0], 'num_fraction_in')] = cont_cnt_perorg[col].values/org_summary[('volume', 'count')].values
    cont_cnt_perorg.sort_index(axis=1, inplace=True)
    org_summary = pd.merge(org_summary, cont_cnt_perorg, on=group_by, how='outer')


    ###################
    # flatten datasheets and combine
    # TODO: restructure this so that all of the datasheets and unstacked and then reorded based on shared level 0 columns before flattening
    ###################
    # org flattening
    org_final = org_summary.unstack(-1)
    for col in org_final.columns:
        if col[1] in ('count_in', 'num_fraction_in') or col[0].endswith(('_count', '_volume')):
            if col[2] not in col[0]:
                org_final.drop(col,axis=1, inplace=True)
    new_col_order = ['dataset', 'image_name', 'object', 'volume', 'surface_area', 'SA_to_volume_ratio', 
                 'equivalent_diameter', 'extent', 'euler_number', 'solidity', 'axis_major_length', 
                 'ERXLD', 'ERXLD_count', 'ERXLD_volume', 'golgiXER', 'golgiXER_count', 'golgiXER_volume', 
                 'golgiXLD', 'golgiXLD_count', 'golgiXLD_volume', 'golgiXperox', 'golgiXperox_count', 'golgiXperox_volume', 
                 'lysoXER', 'lysoXER_count', 'lysoXER_volume', 'lysoXLD', 'lysoXLD_count', 'lysoXLD_volume', 
                 'lysoXgolgi', 'lysoXgolgi_count', 'lysoXgolgi_volume', 'lysoXmito', 'lysoXmito_count', 'lysoXmito_volume', 
                 'lysoXperox', 'lysoXperox_count', 'lysoXperox_volume', 'mitoXER', 'mitoXER_count', 'mitoXER_volume', 
                 'mitoXLD', 'mitoXLD_count', 'mitoXLD_volume', 'mitoXgolgi', 'mitoXgolgi_count', 'mitoXgolgi_volume', 
                 'mitoXperox', 'mitoXperox_count', 'mitoXperox_volume', 'peroxXER', 'peroxXER_count', 'peroxXER_volume', 
                 'peroxXLD', 'peroxXLD_count', 'peroxXLD_volume']
    new_cols = org_final.columns.reindex(new_col_order, level=0)
    org_final = org_final.reindex(columns=new_cols[0])
    org_final.columns = ["_".join((col_name[-1], col_name[1], col_name[0])) for col_name in org_final.columns.to_flat_index()]

    #renaming, filling "NaN" with 0 when needed, and removing ER_std columns
    for col in org_final.columns:
        if '_count_in_' or '_fraction_in_' in col:
            org_final[col] = org_final[col].fillna(0)
        if col.endswith(("_count_volume","_sum_volume", "_mean_volume", "_median_volume")):
            org_final[col] = org_final[col].fillna(0)
        if col.endswith("_count_volume"):
            org_final.rename(columns={col:col.split("_")[0]+"_count"}, inplace=True)
        if col.startswith("ER_std_"):
            org_final.drop(columns=[col], inplace=True)
    org_final = org_final.reset_index()

    # contacts flattened
    contact_final = contact_summary.unstack(-1)
    contact_final.columns = ["_".join((col_name[-1], col_name[1], col_name[0])) for col_name in contact_final.columns.to_flat_index()]

    #renaming and filling "NaN" with 0 when needed
    for col in contact_final.columns:
        if col.endswith(("_count_volume","_sum_volume", "_mean_volume", "_median_volume")):
            contact_final[col] = contact_final[col].fillna(0)
        if col.endswith("_count_volume"):
            contact_final.rename(columns={col:col.split("_")[0]+"_count"}, inplace=True)
    contact_final = contact_final.reset_index()

    # distributions flattened
    dist_final = dist_summary.unstack(-1)
    dist_final.columns = ["_".join((col_name[1], col_name[0])) for col_name in dist_final.columns.to_flat_index()]
    dist_final = dist_final.reset_index()

    # regions flattened & normalization added
    regions_final = regions_summary.unstack(-1)
    regions_final.columns = ["_".join((col_name[1], col_name[0])) for col_name in regions_final.columns.to_flat_index()]
    regions_final['nuc_area_fraction'] = regions_final['nuc_volume'] / regions_final['cell_volume']
    regions_final = regions_final.reset_index()

    # combining them all
    combined = pd.merge(org_final, contact_final, on=["dataset", "image_name"], how="outer")
    combined = pd.merge(combined, dist_final, on=["dataset", "image_name"], how="outer")
    combined = pd.merge(combined, regions_final, on=["dataset", "image_name"], how="outer").set_index(["dataset", "image_name"])
    combined.columns = [col.replace('sum', 'total') for col in combined.columns]

    ###################
    # export summary sheets
    ###################
    org_summary.to_csv(out_path + f"/{out_preffix}per_org_summarystats.csv")
    contact_summary.to_csv(out_path + f"/{out_preffix}per_contact_summarystats.csv")
    dist_summary.to_csv(out_path + f"/{out_preffix}distribution_summarystats.csv")
    regions_summary.to_csv(out_path + f"/{out_preffix}per_region_summarystats.csv")
    combined.to_csv(out_path + f"/{out_preffix}summarystats_combined.csv")

    print(f"Processing of {fl_count} files from {ds_count} dataset(s) is complete.")
    return f"{fl_count} files from {ds_count} dataset(s) were processed"

## **Run prototype `_batch_summary_stats` function**

In [None]:
out=_batch_summary_stats(csv_path_list = csv_path_list,
                         out_path = Path(os.getcwd()).parents[1] / "sample_data" /  "batch_example" / "quant",
                         out_preffix = "example_prototype_")