This notebooks compute some basic statistics for the dataset. As the result three (.csv) files are created:
- [stats_summary.csv](https://ligands.blob.core.windows.net/ligands/stats_summary.csv) - statistics for the whole dataset, aggregated using mean value of given feature, of all blobs,
- [stats_by_label.csv](https://ligands.blob.core.windows.net/ligands/stats_by_label.csv) - statistics for each label, aggregated using mean value of given feature, grouped by label,
- [stats_all.csv](https://ligands.blob.core.windows.net/ligands/stats_all.csv) - statistics for each blob, not grouped nor aggregated.

Column names:

`blob` prefix stands for the whole blob ($B$) <br>
`wout_0` prefix stands for statistics computed for voxels containing nonzero values (without 0) ($B_{+}$)

Legend:

- `label` - label of the given ligand
- `blob_shape` - dimensions of the blob; ($B_{x}$, $B_{y}$, $B_{z}$)
- `blob_n` - number all of voxels in the blob; $|B| = B_{x} \times B_{y} \times B_{z}$
- `wout_0_n` - number of voxels containing nonzero values; $|B_{+}|$
- `wout_0_%` - participation of nonzero valued voxels in the whole blob, expressed as a fraction; $\frac{\textrm{wout\textunderscore0\textunderscore n}}{\textrm{blob\textunderscore n}}$
- `wout_0_min` - minimum value of voxels containing nonzero values; $min(B_{+}$)
- `wout_0_1_qrtl` - first quartile of the given blob, computed using voxels containing nonzero values; $Q1(B_{+})$
- `wout_0_mean` - mean value of voxels containing nonzero values; $\frac{1}{|B_{+}|} \sum_{B_{+}} b$
- `wout_0_3_qrtl`- third quartile of the given blob, computed using voxels containing nonzero values; $Q3(B_{+})$
- `wout_0_max` - minimum value of voxels containing nonzero values; $max(B_{+}$)
- `wout_0_sum` - sum of values of voxels containing nonzero values; $\sum_{B_{+}} b$
- `wout_0_median` - median of values of voxels containing nonzero values; $median(B_{+})$
- `wout_0_std` - standard deviation of values of voxels contatining nonzero values; $std(B_{+})$
- `wout_0_skewness` - skewness of values of voxels contatining nonzero values; $skewness(B_{+})$
- `wout_0_kurtosis` - kurtosis deviation of values of voxels contatining nonzero values; $kurtosis(B_{+})$
- `wout_0_zscore_2_n` - number of values of voxels contatining nonzero values with z-score greater than 2; $|z-score (B_{+}) > 2|$
- `wout_0_zscore_2_%` - participation of values of voxels contatining nonzero values with z-score greater than 2 in the number of all nonzero valued voxels, expressed as a fraction; $\frac{|z-score (B_{+}) > 2|}{|B_{+}|}$
- `wout_0_zscore_3_n` - number of values of voxels contatining nonzero values with z-score greater than 3; $|z-score (B_{+}) > 3|$
- `wout_0_zscore_3_%` - participation of values of voxels contatining nonzero values with z-score greater than 3 in the number of all nonzero valued voxels, expressed as a fraction; $\frac{|z-score (B_{+}) > 3|}{|B_{+}|}$


In [1]:
import pandas as pd
import torch

import sys
sys.path.append('../src/')

from simple_reader import LigandDataset, DataLoader

In [2]:
dataset = LigandDataset('../data')
dataloader = DataLoader(dataset)

In [3]:
def zscore_threshold(tensor, threshold):
    diffs = tensor - tensor.mean()
    zscores = diffs / tensor.std()
    return zscores[zscores > threshold].shape[0]

In [4]:
stats = []

for idx, (blob, label) in enumerate(dataloader):
    wout_0 = blob[blob > 0]
    wout_0_n = wout_0.shape[0]                      # wout_0_n
    blob_n = blob.flatten().shape[0]                # blob_n
    wout_0_mean = float(wout_0.mean())              # wout_0_mean
    wout_0_std = float(wout_0.std())                # wout_0_std
    wout_0_1_qrtl = float(wout_0.quantile(0.25))    # wout_0_1_qrtl
    wout_0_3_qrtl = float(wout_0.quantile(0.75))    # wout_0_3_qrtl

    diffs = wout_0 - wout_0_mean
    zscores = diffs / wout_0_std

    wout_0_zscore_2 = zscores[zscores > 2.0].shape[0]   # wout_0_zscore_2_n
    wout_0_zscore_3 = zscores[zscores > 3.0].shape[0]   # wout_0_zscore_3_n

    wout_0_skewness = float(torch.pow(zscores, 3.0).mean())         # wout_0_skewness
    wout_0_kurtosis = float(torch.pow(zscores, 4.0).mean() - 3.0)   # wout_0_kurtosis
    
    stats.append([
        label,                          # label
        list(blob.shape),               # blob_shape
        blob_n,                         # blob_n
        wout_0_n,                       # wout_0_n
        wout_0_n / blob_n,              # wout_0_%
        float(wout_0.min()),            # wout_0_min
        wout_0_1_qrtl,                  # wout_0_1_qrtl
        wout_0_mean,                    # wout_0_mean
        wout_0_3_qrtl,                  # wout_0_3_qrtl
        float(wout_0.max()),            # wout_0_max
        float(wout_0.sum()),            # wout_0_sum         
        float(wout_0.median()),         # wout_0_median
        wout_0_std,                     # wout_0_std
        wout_0_skewness,                # wout_0_skewness
        wout_0_kurtosis,                # wout_0_kurtosis
        wout_0_zscore_2,                # wout_0_zscore_2_n
        wout_0_zscore_2 / wout_0_n,     # wout_0_zscore_2_%
        wout_0_zscore_3,                # wout_0_zscore_3_n
        wout_0_zscore_3 / wout_0_n,     # wout_0_zscore_3_%
    ])

In [5]:
df = pd.DataFrame(
    stats,
    columns=[
        'label',
        'blob_shape',
        'blob_n',
        'wout_0_n',
        'wout_0_%',        
        'wout_0_min',
        'wout_0_1_qrtl',
        'wout_0_mean',
        'wout_0_3_qrtl',
        'wout_0_max',
        'wout_0_sum',
        'wout_0_median',
        'wout_0_std',
        'wout_0_skewness',
        'wout_0_kurtosis',
        'wout_0_zscore_2_n',
        'wout_0_zscore_2_%',
        'wout_0_zscore_3_n',
        'wout_0_zscore_3_%'
    ]
)
df.to_csv('stats_all.csv')

In [6]:
df.groupby('label').mean().to_csv('stats_by_label.csv')

In [7]:
df.mean(axis=0).to_csv(
    'stats_summary.csv',
    index_label='stat',
    header=['value']    
)