This notebooks compute some basic statistics for the dataset. As the result three (.csv) files are created:
- [stats_summary.csv](https://ligands.blob.core.windows.net/ligands/stats_summary.csv) - statistics for the whole dataset, aggregated using min, mean, median, max and std values of given feature, of all blobs,
- [stats_by_label.csv](https://ligands.blob.core.windows.net/ligands/stats_by_label.csv) - statistics for each label, aggregated using min, mean, median, max and std values of given feature, grouped by label, to determine which column contain which statistic, look at the aggregation function name after last underscore (e.g. `blob_n`**`_mean`**, `nonzero_max`**`_max`**)
- [stats_all.csv](https://ligands.blob.core.windows.net/ligands/stats_all.csv) - statistics for each blob, neither grouped nor aggregated.

Column names:

`blob` prefix stands for the whole blob ($B$) <br>
`nonzero` prefix stands for statistics computed for voxels containing nonzero values (without 0) ($B_{+}$)

Legend:

- `label` - label of the given ligand
- `blob_shape` - dimensions of the blob; ($B_{x}$, $B_{y}$, $B_{z}$)
- `blob_n` - number of all voxels in the blob; $|B| = B_{x} \times B_{y} \times B_{z}$
- `nonzero_n` - number of voxels containing nonzero values; $|B_{+}|$
- `nonzero_%` - participation of nonzero voxels in the whole blob, expressed as a fraction; $\frac{|B_{+}|}{|B|}$
- `nonzero_min` - minimum value of voxels containing nonzero values; $min(B_{+}$)
- `nonzero_1_qrtl` - first quartile of the given blob, computed using voxels containing nonzero values; $Q1(B_{+})$
- `nonzero_mean` - mean value of voxels containing nonzero values; $\frac{1}{|B_{+}|} \sum_{B_{+}} b$
- `nonzero_3_qrtl`- third quartile of the given blob, computed using voxels containing nonzero values; $Q3(B_{+})$
- `nonzero_max` - minimum value of voxels containing nonzero values; $max(B_{+}$)
- `nonzero_sum` - sum of values of voxels containing nonzero values; $\sum_{B_{+}} b$
- `nonzero_median` - median of values of voxels containing nonzero values; $median(B_{+})$
- `nonzero_std` - standard deviation of values of voxels contatining nonzero values; $std(B_{+})$
- `nonzero_skewness` - skewness of values of voxels contatining nonzero values; $skewness(B_{+})$
- `nonzero_kurtosis` - kurtosis deviation of values of voxels contatining nonzero values; $kurtosis(B_{+})$
- `nonzero_zscore_2_n` - number of values of voxels contatining nonzero values with z-score greater than 2; $|z-score (B_{+}) > 2|$
- `nonzero_zscore_2_%` - participation of values of voxels contatining nonzero values with z-score greater than 2 in the number of all nonzero valued voxels, expressed as a fraction; $\frac{|z-score (B_{+}) > 2|}{|B_{+}|}$
- `nonzero_zscore_3_n` - number of values of voxels contatining nonzero values with z-score greater than 3; $|z-score (B_{+}) > 3|$
- `nonzero_zscore_3_%` - participation of values of voxels contatining nonzero values with z-score greater than 3 in the number of all nonzero valued voxels, expressed as a fraction; $\frac{|z-score (B_{+}) > 3|}{|B_{+}|}$

In [None]:
import pandas as pd
import torch

import sys
sys.path.append('../src/')

from simple_reader import LigandDataset, DataLoader

In [None]:
dataset = LigandDataset('../data')
dataloader = DataLoader(dataset)

In [None]:
def zscore_threshold(tensor, threshold):
    diffs = tensor - tensor.mean()
    zscores = diffs / tensor.std()
    return zscores[zscores > threshold].shape[0]

In [None]:
stats = []

for idx, (blob, label) in enumerate(dataloader):
    nonzero = blob[blob > 0]
    nonzero_n = nonzero.shape[0]                        # nonzero_n
    blob_n = blob.flatten().shape[0]                    # blob_n
    nonzero_mean = float(nonzero.mean())                # nonzero_mean
    nonzero_std = float(nonzero.std())                  # nonzero_std
    nonzero_1_qrtl = float(nonzero.quantile(0.25))      # nonzero_1_qrtl
    nonzero_3_qrtl = float(nonzero.quantile(0.75))      # nonzero_3_qrtl

    diffs = nonzero - nonzero_mean
    zscores = diffs / nonzero_std

    nonzero_zscore_2 = zscores[zscores > 2.0].shape[0]      # nonzero_zscore_2_n
    nonzero_zscore_3 = zscores[zscores > 3.0].shape[0]      # nonzero_zscore_3_n

    nonzero_skewness = float(torch.pow(zscores, 3.0).mean())        # nonzero_skewness
    nonzero_kurtosis = float(torch.pow(zscores, 4.0).mean() - 3.0)  # nonzero_kurtosis
    
    stats.append([
        label,                          # label
        list(blob.shape),               # blob_shape
        blob_n,                         # blob_n
        nonzero_n,                      # nonzero_n
        nonzero_n / blob_n,             # nonzero_%
        float(nonzero.min()),           # nonzero_min
        nonzero_1_qrtl,                 # nonzero_1_qrtl
        nonzero_mean,                   # nonzero_mean
        nonzero_3_qrtl,                 # nonzero_3_qrtl
        float(nonzero.max()),           # nonzero_max
        float(nonzero.sum()),           # nonzero_sum         
        float(nonzero.median()),        # nonzero_median
        nonzero_std,                    # nonzero_std
        nonzero_skewness,               # nonzero_skewness
        nonzero_kurtosis,               # nonzero_kurtosis
        nonzero_zscore_2,               # nonzero_zscore_2_n
        nonzero_zscore_2 / nonzero_n,   # nonzero_zscore_2_%
        nonzero_zscore_3,               # nonzero_zscore_3_n
        nonzero_zscore_3 / nonzero_n,   # nonzero_zscore_3_%
    ])

In [None]:
df = pd.DataFrame(
    stats,
    columns=[
        'label',
        'blob_shape',
        'blob_n',
        'nonzero_n',
        'nonzero_%',        
        'nonzero_min',
        'nonzero_1_qrtl',
        'nonzero_mean',
        'nonzero_3_qrtl',
        'nonzero_max',
        'nonzero_sum',
        'nonzero_median',
        'nonzero_std',
        'nonzero_skewness',
        'nonzero_kurtosis',
        'nonzero_zscore_2_n',
        'nonzero_zscore_2_%',
        'nonzero_zscore_3_n',
        'nonzero_zscore_3_%'
    ]
)

[stats_all.csv](https://ligands.blob.core.windows.net/ligands/stats_all.csv)

In [None]:
df.to_csv('stats_all.csv')

[stats_by_label.csv](https://ligands.blob.core.windows.net/ligands/stats_by_label.csv)

In [None]:
df_by_label = df.drop(df.columns[0], axis=1).groupby('label').agg(['min', 'mean', 'median', 'max', 'std'])
df_by_label.columns = ['_'.join(column) for column in df_by_label.columns.values]
df_by_label['count'] = df.groupby('label').count().iloc[:, 0]
df_by_label.to_csv('stats_by_label.csv')

[stats_summary.csv](https://ligands.blob.core.windows.net/ligands/stats_summary.csv)

In [None]:
df_summary = df.drop(df.columns[:3], axis=1).agg(['min', 'mean', 'median', 'max', 'std'], axis=0)
df.summary.to_csv('stats_summary.csv', index_label='stat')