# Dynamical Network Biomarker (DNB) tools for tabular data

---

DNB (Dynamical Network Biomarker) analysis targets detecting early warning signals with collective fluctuations at pre-disease states before critical transitions from healthy states to disease states [1, 2, 3].

 Although the original DNB theory[1] is based on calculating standard deviations and Pearson's correlation coefficients of collective fluctuations in multivariate timeseries data, such analysis is not easy for omics data like gene/protein expression profiles. For omics data, on the other hand, it is useful to calculate statistical indexes by calculation over multiple-samples rather over time [1,4]. This software provides tools for such calculation of statistical indexes from omics data of multiple samples.

 There are two important indexes to characterize DNB: one for strength of fluctuations in each variable of DNB and the other for correlations of fluctuations between each pair of variables in DNB. You can choose mad(median absolute deviation) or std(standard deviation) for the former, and sperman (Sperman’s correlation coefficients) or pearson (Pearson's correlation coefficients) for the latter by your own choice. There is a report to show that the median absolute deviation and Sperman’s correlation coefficients are more appropriate when data include outliers [5], although decision of outliers is a difficult problem by itself.


## References

1. L. Chen, R. Liu, Z.-P. Liu, M. Li, and K. Aihara: “Detecting Early-warning Signals for Sudden Deterioration of Complex Diseases by Dynamical Network Biomarkers,” Scientific Reports, 2, 342, 1-8, doi:10.1038/srep00342 (2012). 
1.	M. Oku and K. Aihara, “On the Covariance Matrix of the Stationary Distribution of a Noisy Dynamical System,” Nonlinear Theory and Its Applications, IEICE, 9(2), 166-184, doi:10.1587/nolta.9.166 (2018).
1.	K. Aihara, R. Liu, K. Koizumi, X. Liu, and L. Chen: “Dynamical Network Biomarkers: Theory and Application,” Gene, 808, 145997, 1-10, doi: 10.1016/j.gene.2021.145997 (2022). 
1.	K. Koizumi, M. Oku, S. Hayashi, A. Inujima, N. Shibahara, L. Chen, Y. Igarashi, K. Tobe, S. Saito, M. Kadowaki, and K. Aihara: “Identifying Pre-disease Signals before Metabolic Syndrome in Mice by Dynamical Network Biomarkers,” Scientific Reports, 9, 8767, 1-11, doi:10.1038/s41598-019-45119-w (2019). 
1.	M. Oku: “Two Novel Methods for Extracting Synchronously Fluctuated Genes,” IPSJ Transactions on Bioimformatics, 12, 9-16, doi: 10.2197/ipsjtbio.12.9 (2019).


## Requiremets

This package need the following:

```code
numpy
scipy
pandas
matplotlib
PyYAML
```

In [None]:
# first run this code box to connect to runtime

## Usage

1. Install package

In [None]:
!pip install git+https://github.com/hiroshi-yamashita/dnb-tools.git

2. Prepare your data in `input/` following the format instructions below.
  - When using COLABORATORY, Upload csv files. Click on the folder icon on the far left and upload the file from "Upload to Session Storage". 
  - You can get example data by following command:

In [None]:
!dnb_example_tabular
# See "dnb_example_tabular --help" for more information.
# !dnb_example_tabular --help

### Input format

- The files in `input` folder are regarded as input files.
- Format
  - Only CSV files are accepted. MS Excel files are not supported.
  - Delimiters should be commas. Tab-separated files are not supported.
  - **Data of experimental and control groups should be contained in the same file.** They are splitted into control and experimental group **depending on the prefix of the label.**
- File names
  - **Input files must have the same prefix**. In particular, the names of the input files must follow the format as ``sample_data_T.csv``.
    - `sample_data`: prefix of input files. 
    - `T`: index of input (ex. T=1,2,...,10)
      - Typically `T` correspond to the timepoint.
- Row and Columns
  - The first column should be row names that needs to be unique.
    -  Typically they correspond to gene IDs
  - The first row should be column names. They are splitted into control and experimental group depending on the prefix of the label (ex. `expr` v.s. `ctrl`).
    - Each group should have at least 4 columns.
  - Because of the tie-breaking in the clustering method, the result depends on the order of rows and columns. We recommend fixing their orders to improve the reproducibility of the result.
- The name of input folder (`input`) and prefixes of input files (`test`) and labels (`expr`, `ctrl`) can be different. Please modify the configuration box below.
- Following is an example of the csv file layout. 

|  | `ctrl` | ... | `ctrl` | `expr` | ... | `expr` |
|----|----|----|----|----|----|----|
|gene_name1|$C_{1,1}$ | ... | $C_{M,1}$ | $E_{1,1}$ | ... |$E_{M',1}$ |
|gene_name2|$C_{1,2}$ | ... |$C_{M,2}$ | $E_{1,2}$ | ... |$E_{M',2}$ |
|...| ... | ... | ... | ... | ... | ... |
|gene_nameN1|$C_{1,N}$ | ... |$C_{M,3}$ | $E_{1,N}$ | ... |$E_{M',N}$ |

3. If you need, change the configuration box according to your experimental settings.
4. Then run the code boxes in `DNB analysis` section below sequentially.
  - You can find more simple instruction in `How to run in terminal` section.
5. Check the output file (`output.csv` by default). 
  - When using COLABORATORY, Download the output file from `file` section of the left column. 
6. Before using the result, examine it through additional experiments or literature.

---

## DNB analysis

### Step 0: preparation

In [None]:
from dnb_tool.tabular.dnb_iterate import dnb_tb_iterate
from dnb_tool.tabular.read_files import get_filenames
from dnb_tool.tabular.read_files import read_csv_and_split
import pandas as pd

### Step 1: configuration

- Please modify settings below according to situation.
  - Please see notes after `#` for each option. 

In [None]:
input_path = "input" # the name of folder that contains input .csv files
prefix = "sample_data" # the prefix of input .csv files
key_control = "ctrl" # the columns that contains this are considered as control group
key_experimental = "expr" # the columns that contains this are considered as experimental group
output_filename = "output.csv" # DNB calculated from each file are written to this file
ignore_extra_columns = False # if True, ignore columns not included in either control or experimental group
kwargs_DNB = {
    "deviation_metric": "mad", # the metric for deviation. "mad": median absolute deviation. "std": standard deviation.
    "thres_gene_filtering": 2, # genes whose deviations in the experimental group are larger than X*100 % of those in the control group are selected for DNB candidates.
    "linkage_metric": "spearman", # metric used for clustering. "spearman": Spearman's rank correlation, "pearson": (Pearson's) correlation coefficient.
    "linkage_method": "average", # linkage method used for clustering. Please see the documentation of scipy.cluster.hierarchy.
    "linkage_threshold": 0.75, # the threshold for cluster division
    "thres_cluster_selection": 0.5, # clusters whose size is larger than X*100 % of the maximum cluster size are selected for output.
    "output_metrics": True, # if True, the output includes detailed metrics for DNB candidates
    "plot_correlation": True, # If True, the correlation of the DNB candidates is plotted
    "plot_heatmap": True, # if True, the input values for DNB candidates are plotted
    "plot_file_prefix": None # (path and) prefix of filenames of plots (if None, they will be displayed on screen)
}

### Step 2: check input files

- obtain file names

In [None]:
keys, filenames = get_filenames(input_path, prefix)

- check file names

In [None]:
display(pd.DataFrame(filenames, index=pd.Series(keys, name="key"), columns=["name"]))

- check that the input data is properly splitted into control and experimental group. The the summary of the first input is displayed

In [None]:
print(f"control group key:\n\t{key_control}")
print(f"experimental group key:\n\t{key_experimental}")

print(f"input file:\n\t{filenames[0]}")
df_c, df_e = read_csv_and_split(filenames[0], key_control, key_experimental)
print("control group:")
display(df_c)
print("experimental group:")
display(df_e)

for filename in filenames:
    _, _ = read_csv_and_split(
        filename, key_control, key_experimental, ignore_extra_columns=ignore_extra_columns)

### Step 3: calculate SFGs (DNB candidate)

In [None]:
result = dnb_tb_iterate(keys, filenames, key_control, key_experimental, kwargs_DNB)

### Step 4: output result to csv file


- Writing results in `output.csv`. After running the box below, please find and download `output.csv` from the left column for further analyses. (The filename can be different depending on the configuration.)

In [None]:
print("#### output table ####")

#### output columns:
#### `dnb`: dnb candidates
#### `cluster`: index of cluster that the gene belongs
#### `clustersize`: size of cluster that the gene belongs
#### `dev_expr`(`dev_ctrl`): the deviation metric (default: MAD) of the gene in the experiemntal (control) group
#### `cor_mean`: mean of the correlation metric (default: Spearman's rank correlation) in the cluster
#### `time_point`: index of the input file

#### If you are analyzing one dataset, 
#### you can run below line to output only "dnb" column.
# result = result[["dnb"]] 

result.to_csv(output_filename, index=False)
display(result)


---

## How to run in terminal

### Step 1: Make example data


In [None]:
!dnb_example_tabular
# See "dnb_example_tabular --help" for more information.
# !dnb_example_tabular --help

### Step 2: Set your data and configure
- Example data and configuration file are saved to `input/`. Check these files and modify them.

### Step 3: Run the analysis script

In [None]:
!dnb_tabular --config_file input/sample_params.json
# See "dnb_tabular --help" for more information.
# !dnb_tabular --help

### Step 4: Check the output
- Check the output .csv file. By default, the output is saved to `output.csv'. This can be different depending on the configuration.

---

## Author of this code
Hiroshi Yamashita : h.yamashita@ist.osaka-u.ac.jp

Github: https://github.com/hiroshi-yamashita/dnb_tool